The Forum is up for sale: XeNTaX Forum looking for new owner

Format auto-detection

Coders and would-be coders alike, this is the place to talk about programming.
Post Reply
finale00
M-M-M-Monster veteran
M-M-M-Monster veteran
Posts: 2382
Joined: Sat Apr 09, 2011 1:22 am
Has thanked: 170 times
Been thanked: 307 times

Format auto-detection

Post by finale00 »

Initially asked this in noesis thread, but figured it should be separated into its own topic.

So I'm looking for ideas on some good ways to do format auto-detection.
It's mainly for 3D formats, but I suppose a generic discussion on auto-detection would also generate ideas.

Basically I've written a small 3D batch converter in python, sort of like karpati's 3D Object Converter in concept, except the only real purpose is so that I don't have to write export functions everytime I try to figure out another format. So ya, for convenience and not really for use.

It currently only supports 12 different formats (most of them are documented), and have found that trying to detect the format is getting pretty difficult.

Some formats use the same file extension as another format
Some formats use the same header as another format
Some formats don't even have a header, and may share the same extension as another header-less format!

There are probably more issues that can arise during the detection step which I have not encountered or thought of.

I've recently switched from a hardcoded design (just lots of cases) to a plugin design (where the plugin itself will tell the engine what it needs to know about the format), and even that can get a little messy with the above 3 conditions.

I'm currently checking two things:

1. headers, and then
2. extension

If neither are recognized, then I just say I couldn't recognize it and if it's supported then manually type in the input format lol

But the header itself can also be a headache.

-some formats have a 3-byte Id string
-some formats have 4-byte string
-some formats have 5-byte string
-some formats have ...

and of course some don't have any.

Well, most of the ones I've come across (that have the magic word in the first place) tend to have between 3 and 7, more commonly 3 or 4, so I can just take the first 10 bytes or something and then start reading substrings and checking if the format is supported. Or at least try.

Anyone have ideas?

Dinoguy mentioned checking the folder that it is being loaded from, which is an interesting idea. I've been focused on individual files themselves so haven't really considered the big picture.

Some things I've thought of

-check header.
-If no header, check extension

-check both header AND extension and make sure both match the format

I would try brute forcing it, but it might successfully be parsed and still be wrong.
Last edited by finale00 on Wed Jul 06, 2011 12:07 am, edited 1 time in total.
User avatar
Dinoguy1000
Site Admin
Posts: 786
Joined: Mon Sep 13, 2004 1:55 am
Has thanked: 154 times
Been thanked: 163 times

Re: Format auto-detection

Post by Dinoguy1000 »

There is also the TrID method of simply looking for patterns that are likely to be unique to the format in question. TrID uses a very naive approach of only looking for static patterns, but my guess is that you could actually get quite sophisticated with this, if you have a deep enough understanding of a format.
Welcome to Xentax!

Rules | Requests | Wiki | Discord

If you run across a post that breaks the rules, please report the post - a mod or admin will handle it from there.
finale00
M-M-M-Monster veteran
M-M-M-Monster veteran
Posts: 2382
Joined: Sat Apr 09, 2011 1:22 am
Has thanked: 170 times
Been thanked: 307 times

Re: Format auto-detection

Post by finale00 »

if you have a deep enough understanding of a format.
I suppose one could consider trying to read specific bytes at particular offsets sequentially and check what the value is and use some additional definitions that will be specified by the format.

So for a header-less file with no extension, one might read an int and check whether it makes sense or not (ie: 35489122 probably won't for most formats), then skip that many values and then read a float and check whether it makes sense or not such that anything beyond a particular interval would probably be "unrealistic" (ie: 2.34e-23).

For very complex formats, it'd probably quickly realize that it's not that format and move on, whereas ironically for simple formats this might take a little longer!

Of course, this would be after filtering the list so you don't have to check too much.

Definitely makes use of the knowledge of the format itself.
Post Reply