The Forum is up for sale: XeNTaX Forum looking for new owner
algorithm for searching/recognizing strings automatically?
-
Rheini
- Moderator
- Posts: 652
- Joined: Wed Oct 18, 2006 9:48 pm
- Location: Germany
- Has thanked: 19 times
- Been thanked: 46 times
- Contact:
algorithm for searching/recognizing strings automatically?
Just wondered if there is a way to scan a file for strings, i.e. recognizing "this is not some random binary data, this is a string", just like humans do 
- aluigi
- VVIP member

- Posts: 1916
- Joined: Thu Dec 08, 2005 12:26 pm
- Location: www.ZENHAX.com
- Has thanked: 4 times
- Been thanked: 661 times
- Contact:
Re: algorithm for searching/recognizing strings automatically?
the problem is the charset because a wider charset results in tons of false positives.
for example the english charset is basicly composed by the first 127 bytes of the ascii table with the exclusion of some of the first bytes so it's good because you can scan a casual sequence of bytes and saying that it is or it isn't a string enough easily (obviously a string is ever composed by a byte between 1 and 255 chosed by who created it so it's not possible to be sure at 100% of the result... that's logical).
but in other languages are used also other chars, for example the accented letters used in italian, french, spanish and german which are all bytes above 128 and the conseguence is a major number of false positives which become more if we consider the asian languages (practically any byte).
another important thing is the length of the string which should be at least of 4 bytes to decrease the false positives caused by 32bit numbers stored in the file.
if you are interested I worked on this stuff some months ago just for a project of strings recognization.
for avoiding false positives and moreover for allowing the reintroduction of the strings (for example for translating an executable in another language using a new string loooonger than the original) I added the scanning of the assembly instructions in the executable and the checking of the strings linked in these instructions (like "PUSH offset", "mov [eax], offset" and so on).
the tool is called exestringz: http://aluigi.org/mytoolz.htm#exestringz
obviously it can be used also with the blind binary scanning (-b) of the bytes inside a file without considering the assembly instructions.
hope it helps
for example the english charset is basicly composed by the first 127 bytes of the ascii table with the exclusion of some of the first bytes so it's good because you can scan a casual sequence of bytes and saying that it is or it isn't a string enough easily (obviously a string is ever composed by a byte between 1 and 255 chosed by who created it so it's not possible to be sure at 100% of the result... that's logical).
but in other languages are used also other chars, for example the accented letters used in italian, french, spanish and german which are all bytes above 128 and the conseguence is a major number of false positives which become more if we consider the asian languages (practically any byte).
another important thing is the length of the string which should be at least of 4 bytes to decrease the false positives caused by 32bit numbers stored in the file.
if you are interested I worked on this stuff some months ago just for a project of strings recognization.
for avoiding false positives and moreover for allowing the reintroduction of the strings (for example for translating an executable in another language using a new string loooonger than the original) I added the scanning of the assembly instructions in the executable and the checking of the strings linked in these instructions (like "PUSH offset", "mov [eax], offset" and so on).
the tool is called exestringz: http://aluigi.org/mytoolz.htm#exestringz
obviously it can be used also with the blind binary scanning (-b) of the bytes inside a file without considering the assembly instructions.
hope it helps
-
Rheini
- Moderator
- Posts: 652
- Joined: Wed Oct 18, 2006 9:48 pm
- Location: Germany
- Has thanked: 19 times
- Been thanked: 46 times
- Contact:
Re: algorithm for searching/recognizing strings automatically?
Good point.
You could also use some language statistics to match if it is english text and so on, but this obviously also only works well for bigger texts.
I wonder if there is a neural approach to that
You could also use some language statistics to match if it is english text and so on, but this obviously also only works well for bigger texts.
I wonder if there is a neural approach to that
-
Darkstar
- advanced
- Posts: 67
- Joined: Thu Jun 14, 2007 1:14 pm
- Location: Southern Germany
- Has thanked: 7 times
- Been thanked: 1 time
- Contact:
Re: algorithm for searching/recognizing strings automatically?
There's a UNIX command called "strings" (which has cygwin and probably even native win32 ports) which does exactly that. It's not foolproof either, but since nothing in this area can really be foolproof (because of codepages, encodings, markings in high-bits, etc.) it's better than nothing.
I use it to quickly "scan" over it's output to see if there's anything interesting in a file or not.
-Darkstar
I use it to quickly "scan" over it's output to see if there's anything interesting in a file or not.
-Darkstar
Check out the REWiki!
-
Mr.Mouse
- Site Admin
- Posts: 4073
- Joined: Wed Jan 15, 2003 6:45 pm
- Location: Dungeons of Doom
- Has thanked: 450 times
- Been thanked: 680 times
- Contact:
Re: algorithm for searching/recognizing strings automatically?
MexScan is a little app I created in 2000 to scan a file for hints that it is an archive of some sorts. I also had a routine searching for strings (and estimate the distance between strings: the program recognizes a certain pattern in the distance between strings, such as archives that store filenames).
It's easy to find standard ASCII strings, just check for repeating instances of the appropriate ascii codes (.e.g. ASC('a') through ASC('z'), ASC('A') through ASC('Z') etc. )
It's easy to find standard ASCII strings, just check for repeating instances of the appropriate ascii codes (.e.g. ASC('a') through ASC('z'), ASC('A') through ASC('Z') etc. )
You do not have the required permissions to view the files attached to this post.
