Home       Industry       Research       Personal   

 String Pattern Discovery


String Pattern Discovery

This is a project to create a utility that will allow some level of meaningful data extraction from files of unknown format; In essence, automating the process of reverse-engineering. Working on a solution has led me to the field of string pattern discovery and Combinatorics. The initial premise is that the best mathematical description of a pattern is that which is least probable.

When something controls the generation of a string, the possible results are fewer than those of a purely random string. It follows then that the fitness measure of string pattern description candidates is in their probability with the least probable being the best or fittest. With enough resulting output, probability calculation can reveal the string control system with a degree of certainty relative to the severity of the pattern.

To make sense of data extracted using this method, we must at the minimum, discover the encoding method. (After all, 'cat' would show us the contents of a file) In addition to the encoding method, meaningful extraction would allow selecting certain portions of the data by classifying it into like groups. In short, an automatic parser.

The success and resulting sophisticaton of such a classifying parser depends on the size of the candidate pattern description space and the complexity of the arrangement of data in the file. For now, jumping to different locations in the file by following pointers is not considered. But even a limited or crude extraction from pointer controlled files may still be quite useful.

The event considered in the probability calculation is that of adjacency. The two calculations consider the probability of like characters (AA) and those of unlike characters (AB). I was fortunate to get the help of Kevin Brown to answer the question of the probability of AB.

Kevin later provided a formula for the AA probability but the calculation is quite complicated. Although nowhere near as skilled in Combinatorics as Kevin, I was able discover a simple solution to the probability of AA.

The probability calculations in typical problems well exceed the range of standard floating point arithmetic. So for now, I'm using the GNU Multiple Precision Arithmetic or GMP math library. Scaling might bring these calculations back within native floating point range but for this research phase, I'm still using GMP for everything.