Page 1 of 2

PROJECT: CODENAME Eureka

Posted: Wed Aug 11, 2004 8:16 pm
by Mr.Mouse
WATTO, any others that wish to participate and me can use this thread to discuss ways to create the support program, codename Eureka, that enables the innocent user of Game Resource Archivers to figure out newly encountered GRAs.

Stuff that should be discussed:

- strategies of the program to identify whole formats of archives or parts of formats of archives
- routines needed to accomplish subtasks
- methods to ensure easy additions of new format specifications
- known file identity tags to scan for specific file types in an archive
- ways to identify files that don't have readily distuigishable tags
- ways to locate compressed files and if possible recognize the format used
- support excutables or inclusion of decompression/encyption code to be used on matched formats

I think this kind of discussion and research may help us reach a common strategy and approach, before we even have to code a single line! :D

Posted: Wed Aug 11, 2004 9:11 pm
by Rahly
LOL, thats funny, for my coding site, i just did a forum. Its a little more, specific than yours though. :)

http://dl748.com/forums/

[EDIT]I pretty much know the languages that are listed in there[/EDIT]

Posted: Wed Aug 11, 2004 9:50 pm
by Mr.Mouse
Make that A LOT more specified :) .

The purpose of the Code Talk forum is to be able to separate the talk about code (MultiEx related) on this forum from the other subjects, and to offer a place were MultiEx related projects can be discussed. :)

Posted: Wed Aug 11, 2004 9:52 pm
by Rahly
Yeah, I know, I like to talk about a lot of things about coding, including ideas.

Posted: Thu Aug 12, 2004 5:22 am
by Captain
Rahly wrote: [EDIT]I pretty much know the languages that are listed in there[/EDIT]
Judging by that list, you may like to learn Python. It's a fun language that's easy to pick up, versatile, and powerful.

Posted: Thu Aug 12, 2004 10:53 am
by Rahly
Captain wrote:
Rahly wrote: [EDIT]I pretty much know the languages that are listed in there[/EDIT]
Judging by that list, you may like to learn Python. It's a fun language that's easy to pick up, versatile, and powerful.
I know that already, I said, I know pretty much all the languages there, I didn't say those were the only languages I know. I also know COBOL and ADA and BrainF*ck, and a bunch more.

Posted: Fri Aug 13, 2004 9:10 am
by Mr.Mouse
Well, anyway, let's stay on topic in this thread. :)

I think one of the key issues in creating Eureka is the define how it should operate.

The first thing it should do in my opinion is check whether it has encountered an archive format it already knows. Let's call this the Current_Match process. It would be advisable to create the program in such a way that we can feed it additional known format identity tags (FIT). These FITs can be seen as a list of things Eureka has to confirm present in the archive. The format of a FIT may be very easy, all you might need is a list of strings or bytes that have the following structure:

FIT-structure:

[0] : Type - What type of data is Eureka to load from the file? (e.g. string, null-string, word, dword etc.)
[1] : Offset - Where in the file is Eureka to load this variable?
[2] : Match - The variable of type Type loaded at offset Offset should match this variable

In VB you can easily convert strings to words, to bytes, to dwords etc. if you define a array of strings. Thus, you can easily make some kind of tab-delimited text file in which you put these components.

For instance, you can start a line with:
1. Archive Format Name (e.g. "Painkiller .PAK format")
2. Archive Format Index (if we create a new list of locations of multiex scripts and plugins)
3. Number of criteria to match (and that follow, that Eureka has to load)
4-XX - The three-string entries for each criteria.

When all criteria are met Eureka could declare the archive known and try to open it. When a lot, but not all criteria are met, Eureka could declare it "The format resembles XXXX format for NN %".

If too little criteria are met, Phase 2 should start, identifying patterns, files etc in the archive. But I think it best if we're sure of the principle of Current_Match phase first, before we move on.

Posted: Fri Aug 13, 2004 11:03 am
by Rahly
I have some idea, but I hesitate to share, only cuz that would get me involved, and it would get me distracted from the plugin manager. :) I'd mostly start writing code for it, to test the theory.

Posted: Fri Aug 13, 2004 11:19 am
by Mr.Mouse
You can always share your views without getting involved. You don't have to start writing code! This thread is usefull for discussion, not code. We can all agree on an approach and then only some will actually code it. Or someone (like WATTO or me) will make the decision, based on the discussion and start to code, limiting further discussion about that particular part. ;)

Posted: Fri Aug 13, 2004 12:21 pm
by Rahly
I know that :). But its just the kind of person I am, though.

Anyway, I'm thinking of a drag and drop method. Drop large batches of files, like "everything" in the data directory. And file pattern similiarities between them, like the first 4 bytes in these group of files are all the same, or for these other group of files at position $22 looks like it represents the file size, or the filesize - $22(Maybe $22 is a header size). Also have different methods of checking that the user can select. like "Check all known" "Check for simple patterns" "Thorough check of files list (warning: this option can take a while, especially if the files are large)", "Try raw compression schemes"

Posted: Fri Aug 13, 2004 3:38 pm
by friendsofwatto
OK, these are the things I can think up at the moment, so bear with me :)

Filename searches shouldn't be too difficult to do - we should consider something a filename if it contains at least 5 characters in the English character set - this gets around the format types that include 4-byte extension strings or tags (i call them chunk-based archives). We could also confirm the presence of a filename by locating a . character in it, but this is not always the case.

Once we find a filename, we should first try the bytes before it for something which may provide the filename size - this will either be a 1-byte, 2-byte, or a 4-byte value. Failing this, we should then read through until the first null, making sure to check each character as we go, because if a non-english character appears before the null, it probably isn't a file name.

Once we have a file name confirmed, we should find the next filename. This will help us discover an important fundamental of the archives - the entry size for each file. Alternatively, it will also tell us if there is a separate filename directory, usually constructed of all the filenames separated by a single null character - opviously this is the case if the second filename starts directly after the first.

If we can find out the entry size, we should then check by skipping forward those X bytes and check that the beginning of the filename starts at this position - if not then we are probably dealing with a variable-length entry size, usually brought about by allowing filenames of any length.

Anyway, from there we should try to figure out the rest of the fields in each entry. Knowing that most fields are 4-byte fields, we should start with that and see what patterns we can find. We should look for patterns of increasing value first, which would indicate either a File ID field or a File Offset field. Once we determine the File Offset field, the next logical step is to look for a File Size field.

Where we go from here is to be determined. We could go to each offset and look for a known file header, or even any string of 4-bytes that might indicate a header. We could also try to pick out whether the files are compressed by searching in the entries for a field which is always bigger than the File Size. We should also look for a pointer either at the start or the end of the archive which points to the directory, and a numFiles field.


The things to look out for...
1. Archives that are chunk-based (ie archives that do not have a directory, rather they have a format like header-file-header-file-etc. for the entire file, thus the files are stored in chunks.)
2. Variable-length entries will be a pain
3. Long file headers may interfere with the determination of a filename
4. Similarly, if directories are stored in the archive (ie if it contains a nested directory structure) then it will be quite difficult to determine the structure of the archive, and the directory names will interfere with filename detection


Thats all I can think of at this point in time, feel free to critique.

WATTO
watto@watto.org
http://www.watto.org

Posted: Fri Aug 13, 2004 3:55 pm
by friendsofwatto
Here is a list of the many different fields that I have encountered, ranked approximately by their level of occurrance (with the most common fields at the top). Add some more if you find any I have missed.

Data Offset
Compressed File Size (also includes the File Size field in non-compressed archives)
Number Of Files
Archive Header (usually 4-byte, but sometimes 2, 4, 6, 8, or a really large number)
Archive Version (sometimes as a number, sometimes as a string)
Filename
Filename Fillers (usually 0-3 bytes, to pad the file entries to multiples of 4 bytes)
Filename Offset
Filename Length
File Tag / Extension (a 4-byte String representing the files extension)
Archive Size
Header Size
File Data Size (total size of all file data)
Directory Size / First File Offset (typically the same thing)
Nested Directory Offset
Filename Directory Offset
Uncompressed File Size
Number Of Directories
File Entry Size
Compression Type
Archive Name
Timestamp (as a 4-byte millisecond count from 1/1/1970)
Checksum (usually 8-bytes)
Unicode Filename
Archive Creation Year (as a string eg "2002")


Also, there are quite a few archives that use fixed padding sizes for directory entries, files, and anything else. Common padding sizes I have found are 2048, 512, 128, 64, 32, and 4. The 4 padding is most used in file entries in the directory, used when an arbitary-length filename is allowed, where each filename is padded to a multiple of 4 bytes by adding 0-3 nulls.

WATTO
watto@watto.org
http://www.watto.org

Posted: Fri Aug 13, 2004 4:01 pm
by Mr.Mouse
Okay, well, we are now talking about Phase 2 of the identification process (skipping phase 1 I talked about: identifying known formats, the Current_Match function). :P Do you agree with the approach I came up with for that?

About Phase 2, the identification of filenames I have done in 2000 with Mexscan. Here's a piece of the docs that came with it :
mexscan.doc wrote:msSCFILEHEADER
--------------

Example log text:

;**** Header Scan Initiated ****
;17 >> C:\TOOLS\MC\BREAD.EXE >> DELTA 17 >> AL 20 >> F 15
;281 >> C:\TOOLS\MC\EDITOR.MCT >> DELTA 264 >> AL 19 >> F 14
;545 >> C:\TOOLS\MC\FILE.MKD >> DELTA 264 >> AL 21 >> F 5
;809 >> C:\TOOLS\MC\LFE.EXE >> DELTA 264 >> AL 22 >> F 16
;1073 >> C:\TOOLS\MC\MC.BAT >> DELTA 264 >> AL 23 >> F 11
;1337 >> C:\TOOLS\MC\MC.INI >> DELTA 264 >> AL 23 >> F 11
;1601 >> C:\TOOLS\MC\MC.PIF >> DELTA 264 >> AL 23 >> F 5
;1865 >> C:\TOOLS\MC\MCMAIN.EXE >> DELTA 264 >> AL 19 >> F 9
;2129 >> C:\TOOLS\MC\MCVIEW.EXE >> DELTA 264 >> AL 19 >> F 14
;2393 >> C:\TOOLS\MC\MULTIAD.EXE >> DELTA 264 >> AL 18 >> F 9
;2657 >> C:\TOOLS\MC\MULTIEX.EXE >> DELTA 264 >> AL 18 >> F 13
;2921 >> C:\TOOLS\MC\multiex.imp >> DELTA 264 >> AL 18 >> F 4
;3185 >> C:\TOOLS\MC\MULTIEX.INI >> DELTA 264 >> AL 18 >> F 13
;3449 >> C:\TOOLS\MC\multiex.lst >> DELTA 264 >> AL 18 >> F 9
;3713 >> C:\TOOLS\MC\README.1ST >> DELTA 264 >> AL 19 >> F 9
;3977 >> C:\TOOLS\MC\selected.mkd >> DELTA 264 >> AL 17 >> F 8
;4241 >> C:\TOOLS\MC\TREAD.EXE >> DELTA 264 >> AL 20 >> F 15
;!! Possible HEADER at approx. 17, 16 files 264 spacing !!
;Most probable header : offset 374456, at least 48 files.
;=-*** Header Scan Ended ***-=

The log said it initiated the header scan and then reports any
filename found: this has the following format:

<offset> >>
<text> >>
DELTA <number of bytes distance from previous reported filename> >>
AL <percentage of non-letter character>
F <average frequency of all the different characters in the text>

When MEXSCAN sees that the DELTA is the same it counts this as a
possible header, and counts the number of possible files in the
datafile, until it finds a possible filename that has a different
DELTA. It then gives a summary of the previous header:
;!! Possible HEADER at approx. 17, 16 files 264 spacing !!
When the whole file has been scanned it will give the most probable
header(containing filenames) found and ends by saying that the scan
is done.
What it did was look for strings of Egyptian characters (I think that's the type we use in the Western world languages if I remember correctly from my history lessons) for at least a set amount of length (probably 8 or something) and calculate the percentage of non-Egyptian characters (e.g. /, \, . ). If below a set percentage it would consider it a valid text string, only when the percentage of each different character in the string would also be below a set percentage (to avoid reading strings that are actually just 1 specific character repeated multiple times (think of graphic files and you know what it was trying to avoid! :D ). It would remember where the string began and continue the search. It would stumble upon a second valid string and substract the position of the previous string with the new one to get a Delta. It would then search on for the next, and start comparing deltas. If they would be equal it would start to count them as header entries. This would stop whenever it would encounter a string that showed a different delta. Then it would note down the number of possible entries, the spacing, and the offset of the first "filename". Search would continue, so it might note down multiple possible headers.

The problem with this approach was that pieces of text might be something completely different (game related) but still at regular intervals. The strong thing we should implement would have to be more checking to see if someting is really a filename, and if it is really a header. Perhaps you're right to say it should do this immediately when ever it find such a string. I think thought, that the filenamestring identification code needs to be spot-on first.

[EDIT]Look above for the text in bold.

Posted: Fri Aug 13, 2004 4:10 pm
by Rahly
this is similar to the unix "strings" command.

http://linux.about.com/library/cmd/blcmdl_strings.htm

Posted: Fri Aug 13, 2004 4:37 pm
by friendsofwatto
Sorry, I wasn't really going by phases or anything, thats just a lot of random dribble that I thought about when you first mentioned the project to me.

Anyway, I have now read what you posted about phase 1, and I think I see what you are trying to say - you think we should define some kind of standard system for checking files against known types? For example, you would like to develop another mexscript-like system where we check for certain aspects in a file and, if matching, present the probability that the file is of a type we already know. Is that what you mean?

If so, I think that is probably the way to go, or maybe we can simply have 2 buttons for the user so that they can perform the 2 different techniques individually from each other. I can see your phase 1 idea being pretty good because we can keep it updated with the latest formats using a plugin-like interface much like you do now in your MexCom, or like my Game Extractor.

So yes, this is a good idea, and wouldn't be too difficult to implement.

And I agree completely with the filename issue you mentioned, and I can see we will need to think very carefully about the technique we use to verify whether it is a valid filename. Here are some things I can see at the moment...

1. It must be at least 5 characters long, and less than 256 characters long
2. If should preferably contain a dot or a directory slash, this would greatly improve the probability of a filename
3. It must contain some identifier of the filename length, either by a trailing null or a length field before the name. If looking for a null, it must not encounter any non-English characters before it.

We can also look for a drive letter or a directory slash as the first character - some archives store filenames with them included.


Just throwing around ideas.

WATTO
watto@watto.org
http://www.watto.org