Page 2 of 2

Posted: Fri Aug 13, 2004 5:24 pm
by Rahly
This could potentially be a bigger project than Mexcom, as in the future, it will prolly "merge" with mexcom.

Here is a small outline I think it might work.

1) Try all known format (plugin call(IsArchive) or by mexscript)

2) Search Archive for known archive formats (archive in an archive)

3) Attempt to uncompress file if its a known compression type

4) search file for known nonarchive file types (picture files, word processing files, compressed files)

5) attempt to look for pattern recognition

5 shouldn't be used to actually TRY to unarchive, but rather, email us, the developers on what it found in the archive.

back on this "new script", i'm thinking about maybe a specialized RegEx (Regular Expression) language for it.

Posted: Fri Aug 13, 2004 5:39 pm
by Rahly
Question: What do you do about split archives? Like in Prince of Persia, they have the index in one file, and binary data in the other file?

Posted: Fri Aug 13, 2004 7:12 pm
by Captain
friendsofwatto wrote:OK, these are the things I can think up at the moment, so bear with me :)

Filename searches shouldn't be too difficult to do - we should consider something a filename if it contains at least 5 characters in the English character set - this gets around the format types that include 4-byte extension strings or tags (i call them chunk-based archives). We could also confirm the presence of a filename by locating a . character in it, but this is not always the case.

Once we find a filename, we should first try the bytes before it for something which may provide the filename size - this will either be a 1-byte, 2-byte, or a 4-byte value. Failing this, we should then read through until the first null, making sure to check each character as we go, because if a non-english character appears before the null, it probably isn't a file name.

Once we have a file name confirmed, we should find the next filename. This will help us discover an important fundamental of the archives - the entry size for each file. Alternatively, it will also tell us if there is a separate filename directory, usually constructed of all the filenames separated by a single null character - opviously this is the case if the second filename starts directly after the first.

If we can find out the entry size, we should then check by skipping forward those X bytes and check that the beginning of the filename starts at this position - if not then we are probably dealing with a variable-length entry size, usually brought about by allowing filenames of any length.

Anyway, from there we should try to figure out the rest of the fields in each entry. Knowing that most fields are 4-byte fields, we should start with that and see what patterns we can find. We should look for patterns of increasing value first, which would indicate either a File ID field or a File Offset field. Once we determine the File Offset field, the next logical step is to look for a File Size field.

Where we go from here is to be determined. We could go to each offset and look for a known file header, or even any string of 4-bytes that might indicate a header. We could also try to pick out whether the files are compressed by searching in the entries for a field which is always bigger than the File Size. We should also look for a pointer either at the start or the end of the archive which points to the directory, and a numFiles field.


The things to look out for...
1. Archives that are chunk-based (ie archives that do not have a directory, rather they have a format like header-file-header-file-etc. for the entire file, thus the files are stored in chunks.)
2. Variable-length entries will be a pain
3. Long file headers may interfere with the determination of a filename
4. Similarly, if directories are stored in the archive (ie if it contains a nested directory structure) then it will be quite difficult to determine the structure of the archive, and the directory names will interfere with filename detection


Thats all I can think of at this point in time, feel free to critique.

WATTO
watto@watto.org
http://www.watto.org
I agree. This is an approach I would use. It wouldn't be suitable for all archives around, but would fill in some significant blanks in most of them.

It's probably impossible to automate the entire process, so I think the best thing would be to create a semi-automatic hex-editor with some specified controls to easily find patterns, while the user and his superior brain figures out how the general structure works. The user would then use different controls to feed the app that knowledge, and the tool spits out a script.

I would suggest a fault-tolerant interface in which the user can instantly see the result of his actions. By this I mean that whenever the user changes something in the test-structure he's creating, the tool can immediately perform a scan on the file we're researching, and show the result in a results-windows. One window would hold the directory and filename structure, as interpreted by the tool. When clicking on a filename in that structure, if the tool has enough knowledge of the archive type, the tool immediately shows the contents of that file, in hex or whatever format the user chooses. I think that would at least speed up the discovery process an order of magnitude.

For Mr.Mouse's idea I would suggest an object oriented approach where, for each archive type, a specific class is derived from a main ArchiveSignature class, and has specific implementations of methods for detection, analysis, and other stuff. I don't know if this can be done in VB, but if it's possible it would keep things tidy. This way, the tool can just loop through the available classes, and apply them to the archive(s) we're investigating.

Posted: Fri Aug 13, 2004 7:22 pm
by Mr.Mouse
Rahly wrote:this is similar to the unix "strings" command.

http://linux.about.com/library/cmd/blcmdl_strings.htm
Similar but not the same. I don't think the two use the same approach in identifying a valid string. :)

@PHASE 1

Although there may be two buttons for starting phase 1 or phase 2, I don't think it would make much sense to offer it. I think Phase 1 should just precede Phase 2. If an archive is known to the program, I need not start Phase 2, right? Likewise, suppose a user clicks to start Phase 2 - which may have some settings to set, as Rahly sugggests, but the archive format if already known? If Phase 1 would not precede Phase 2, Phase 2 would be initalized, taking quite some time and trying to figure out something Phase 1 could have told it beforehand.

Yes, I do suggest some kind of very easy script to go through a file, looking for clues that define it as a certain format. It must be able to compare almost anything anywhere in the fiel with what we ask it to compare with. Besides that, it must also be able to perform specific tasks, such as follow a route (go to offset A, read the long there, try to jump to it....SUCCES? Good, next criterium - FAILURE ? - Not the known format).
This jumping around is necessary, because many formats don't use clear identity tags whatsoever, so you must try to follow the rout (e.g. Get resource offset, try to jump to it, Get Resource size, try to virtualy extract it, OOPS! Resource larger than actual Archive? - FAILURE! - Not the known format etc etc). Perhaps something other than a script, but a set of commands that we can easily use to code additional formats.

[EDIT]@Captain: That would be nice, to do it with classes, yet it would require to create new executables each time you wish to add a new check, would it not? I'm trying to avoid that.
But indeed, it will be a big challenge to get the tool to be userfriendly. We must make sure however, not to create a vastly complicated tool that will put off those I personally would like to see it use: anybody. We may offer an advanced version for those with the superior brains you describe, but a stripped down, but smart version would certainly be a good thing.
It should not be for coders. Thus, the interface should not be for coders either. And that's not always easy.

Posted: Fri Aug 13, 2004 7:35 pm
by Mr.Mouse
Rahly wrote:back on this "new script", i'm thinking about maybe a specialized RegEx (Regular Expression) language for it.
Enlighten me, please, there's some terminology I haven't heard, don't forget, I'm your basic hobbycoder. :D
Rahly wrote:This could potentially be a bigger project than Mexcom, as in the future, it will prolly "merge" with mexcom.

Here is a small outline I think it might work.

1) Try all known format (plugin call(IsArchive) or by mexscript)

2) Search Archive for known archive formats (archive in an archive)

3) Attempt to uncompress file if its a known compression type

4) search file for known nonarchive file types (picture files, word processing files, compressed files)

5) attempt to look for pattern recognition

5 shouldn't be used to actually TRY to unarchive, but rather, email us, the developers on what it found in the archive.
I agree with the latter, but it may be an option for the user to try at their own risk (or people who know what they're doing, I'm thinking along the lines of Captain's suggestions, and my own of havind an Advanced mode or version).
Your roadmap to Total Archive Demystification is one I imagine as well.

Posted: Fri Aug 13, 2004 8:52 pm
by Rahly
Mr.Mouse wrote:That would be nice, to do it with classes, yet it would require to create new executables each time you wish to add a new check, would it not? I'm trying to avoid that.
But indeed, it will be a big challenge to get the tool to be userfriendly. We must make sure however, not to create a vastly complicated tool that will put off those I personally would like to see it use: anybody. We may offer an advanced version for those with the superior brains you describe, but a stripped down, but smart version would certainly be a good thing.
It should not be for coders. Thus, the interface should not be for coders either. And that's not always easy.
It could be something as simple as making our own classable scripting engine, or use a popular modular one that can be imported in. That way they just download a "compiled" script and it auto imports it. Or even better, just download the script itself, and program compiles it as needed.
Mr.Mouse wrote:Enlighten me, please, there's some terminology I haven't heard, don't forget, I'm your basic hobbycoder.
Regular Expressions are one liner that let you test for string validity. It became really popular with perl and a lot of languages use a PCRE(Perl Compatible Regular Expression) or (Perl Compatible RegEx Engine) however you wanna say it, php, python, lots of other languages. I'll show you some of the ones i've written

/\s*(.*?)\s*\=\s*(.*?)\s*$/

/(".*?"|.*?)(\s+|\Z)/

/^(.*?)(?:\s|\Z)(.*)$/

now i'll give you a few simple ones

string =~ /^:/

This means "does the string begin with a :" (true/false), you may think, oh.. wow.. so what?.. thats easy to write. But Wait.

string =~ /^H\s+E\s+L.+L\s*O$/

whoa.... whats that?! its basically saying
"does the string begin with a H, followed by one or more spaces, followed by an E, followed by one or more spaces, followed by L, followed by one or more any character, followed by L, followed by Zero or more spaces, followed by O at the end of the line". Try writing the code to parse that in one line ;-).

H E L L O = True
H E L LO = True
H E LxL O = True
HxE L L O = False

A lot of programming utilities allow regex search expressions. I'm just saying we could "adapt" one to our standards. I had started one but it was only ment for * or ? wildcards in filenames.
Mr. Mouse wrote:I agree with the latter, but it may be an option for the user to try at their own risk (or people who know what they're doing, I'm thinking along the lines of Captain's suggestions, and my own of havind an Advanced mode or version).
Your roadmap to Total Archive Demystification is one I imagine as well.
I'm for this, i wrote a visual control for delphi, for hex editing, looks exactly like HexWorkshop (favorite hexeditor), it wouldn't be hard to convert it to an ActiveXControl to import into VB. Although Its unfinished, as I only ment it for display purposes, i need to add highlighting and "editable". I can give you screen shots of the control if you want. Although I still would like a popup "Format Not Found, email data to developers?".

Posted: Fri Aug 13, 2004 10:40 pm
by Mr.Mouse
Yeah, there must be an option for users to send whatever log the program created after any run.

Ah, now I understand the RegEx, though the syntax is something I had not seen before, but then again, I haven't looked. Captain's been bugging me for a long time :" Dammit man, you should get off your lazy ass and learn Python!" and I reply : "Yes, man, you are right", but never get to it. Guess it's just because I have another schedule. :wink: Captain's done a fab job on OpenMex, a project now cancelled, but showing good promise to replace MultiEx Commander. The amount of time to actually do such a thing was just too much to handle for him, and it's true, and that's what puts me off as well: having to rewrite everything in another language.
Ok, so far for my excuses for not coding in 'better' languages. :oops:

I see however the possibilities of creating or own Regular Expression set.
The problem is that Eureka would have to perform very specific tasks, like those I mentioned before, before it could succesfully 'tag' an archive.
If you could offer Eureka a script to do that, that would ultimately not be unlike having Eureka compile a script first as it would compile it to tasks set by the commands of the script; ergo it might as well perform them instead of compiling them. Still, the question is in any case: what script to use? I could try an adapt the MexScript and make it perform tasks needed, and return the results. But that would interfere with or rely on 'legacy' code, that may not all be needed, and may not be what we want in a new program. Decisions, decisions.. :)

Posted: Fri Oct 01, 2004 6:32 pm
by Guest
Mr.Mouse wrote:having to rewrite everything in another language.
Ok, so far for my excuses for not coding in 'better' languages. :oops:
Why not use .NET? You can use multiple programming languages in 1 application. Also, you don't need to create your own scripting language for plugins, .NET can import (and run) uncompiled code during runtime!

Posted: Fri Oct 01, 2004 10:04 pm
by Rahly
Anonymous wrote:
Mr.Mouse wrote:having to rewrite everything in another language.
Ok, so far for my excuses for not coding in 'better' languages. :oops:
Why not use .NET? You can use multiple programming languages in 1 application. Also, you don't need to create your own scripting language for plugins, .NET can import (and run) uncompiled code during runtime!
Thats not the same thing, .NET can import uncompiled .NET code, which is a byte code similar to java. It cannot import readable languages such as pascal, c, c++, basic, etc etc etc etc. You still need a compiler for that, like Visual Studio. You've always been able to use multiple languages in one application. I can compile a pascal unit down and import it into a c program, or vice versa. The only true advantage for .NET would be its ability to run on multiple platforms. Since PDAs are limited in memory and mexcom usually works with large files (archives), running on a PDA really isn't feasible. The only thing would be left is linux, and that still is beta, you still can't run a compiled .NET app from VS on linux with mono. If we truely wanted to be multiplatformed we would go with java, since its supported on 100x the platforms that .NET is.

Posted: Sat Oct 02, 2004 12:01 pm
by Captain
Rahly wrote:
Anonymous wrote:
Mr.Mouse wrote:having to rewrite everything in another language.
Ok, so far for my excuses for not coding in 'better' languages. :oops:
Why not use .NET? You can use multiple programming languages in 1 application. Also, you don't need to create your own scripting language for plugins, .NET can import (and run) uncompiled code during runtime!
Thats not the same thing, .NET can import uncompiled .NET code, which is a byte code similar to java. It cannot import readable languages such as pascal, c, c++, basic, etc etc etc etc. You still need a compiler for that, like Visual Studio. You've always been able to use multiple languages in one application. I can compile a pascal unit down and import it into a c program, or vice versa. The only true advantage for .NET would be its ability to run on multiple platforms. Since PDAs are limited in memory and mexcom usually works with large files (archives), running on a PDA really isn't feasible. The only thing would be left is linux, and that still is beta, you still can't run a compiled .NET app from VS on linux with mono. If we truely wanted to be multiplatformed we would go with java, since its supported on 100x the platforms that .NET is.
Yes I agree. Java seems like a good choice for this, although I prefer Python for building tools (I have had very good experiences with the Python/wxWindows combo). Java can get overly complex for simple tasks at times, as in, metaphorically speaking, when I want a simple candlestick I don't want to have to build an entire christmas tree, just to support my candle.

And Mr. Mouse doesn't know Java (or OOP in general as far as I know), so we'll have to drag him in kicking and screaming. And let me tell you, that's not an easy thing to do. Furthermore, Mr.Mouse is on his honeymoon now and there's no telling what his priorities will be when he returns as a married man. lol.

Oh and just to finish up this rambling, garbled post, I know regular expressions very well so I can help out when necessary.