Flagrant Badassery

A JavaScript and regular expression centric blog

Automatic Regex Generation

Has it ever been done before? I'm interested in building an app (although I will possibly never get to it) which can automatically generate relatively-basic regular expressions by simply having the user enter some test data and mark the parts which the regex should match (all other text would serve as examples of what the regex should not match). By "relatively basic," I mean that the regexes should at least be capable of using literal characters, character classes, shorthand character classes like \d and \s, greedy quantification, alternation, grouping, and word boundaries.

Of course, there are major issues involved with this. For example, where would the algorithm draw the line between exactness and flexibility? I would imagine that it would attempt to create regexes which are as flexible as possible, since it's easy to rule out many non-matches by including a few non-matching examples which are very similar to strings which should match. Secondly, what would be the regex operator precedence (e.g., would it choose to attempt quantification before alternation)? There are numerous other implementation details which might present challenges, as well.

One thing that could sometimes improve the quality of the generated regexes is that the generator can cheat, by filtering expected results though a library of common patterns when this can be done with a reasonable level of certainty.

Have you ever considered this kind of thing in the past or seen it done elsewhere? Any thoughts on the issues or how the implementation might work? I have some ideas about simple algorithm approaches, but haven't proved out that they would work well in a respectable number of cases. Can you see something like this being useful to you?

There Are 8 Responses So Far. »

  1. You’d have to provide it with a really large sample size. Otherwise you’d end up with a regexp that works but that’s not what you want.

    My guess is that the amount of work to design the regexp is less than the work to collect the large sample, although if there is a program in which you want to embed the automatic regexp generator, that might be interesting.

  2. nordsieck, that is exactly the challenge… being able to provide just a few examples of what should match and a few examples of similar text which should not match, and having the generator come up with something that is good enough for your needs. Obviously, the larger the sample data, the more precise the regex could become, but I don’t think a massive dataset is necessary for the vast majority of simple patterns. Having the generator start out too precise would drastically reduce the chance that the regex would be useful with anything other than the sample data, so it’s by doing the reverse — having the generated regexes be as flexible as possible given the data (within a set hierarchy of which tokens and operations are most flexible) — that, in theory, would prevent the need to provide a huge amount of sample text. You would only need to provide samples which are very similar but should not be matched.

    Now, I would not advocate using automatically generated regexes in production code unless you fully understood their meaning (and if you did, you could probably write a better regex yourself), but such regexes could be useful for less-critical tasks like searching though personal documents or code, etc.

    If you subscribe to my views on the subject, the problem then lies not in amassing huge sample data, but in devising a system for constructing flexible patterns which match any number of provided strings, while not matching all other provided input. This would work nothing like a diff algorithm, for instance, since such programs generally can only compare two or three variations of a string. I am not aware of programs which are similar in nature to what I’ve described, but I would like to know about them if there are. I think it can be done, and have some ideas (which are not by any means fully fleshed out yet) about how to approach this in a relatively simplistic way.

    My interest in the subject comes simply from hearing a number of people express a desire for such a program.

  3. I think the main uses of generated RegExes would lie primarily in one-time use. You know, the n00bs who are trying to clean up an HTML page or format something or other in a single document and are like “how do I match tags?” For these, the data sets would be a pretty reliable sample because in general the set is limited and intended for single use. It doesn’t matter if the RegEx is flexible as long as it does it’s job.

    Another application of a RegEx generator is basically like a stub generator. It might provide a simple RegEx that is 90% what you want, and then you can tweak it to make it more flexible or correct as needed. Obviously this is most useful for those with good RegEx knowledge, but it could be a time saver at the very least. This is the principle behind most code generation techniques (e.g., scaffolding).

    Finally, another possibility, depending on how much time you want to put in to it, would be the option to “optimize for flexibility” or “optimize for efficiency” where the latter would match the data set more exactly with less regard for other similar sets of data that might need to be matched.

    At its core, this is a pattern recognition problem, traditionally a part of the artificial intelligence / neural networks area of computing and IT. You could probably apply a lot of those ideas and algorithms to this problem in the hopes of formulating the best possible RegEx. You could have a lite version and a smarter version which would be “trained” by providing test cases and then soliciting feedback for whether or not the pattern was matched. These NN then adjust the weights corresponding to the factors that were considered to provide the match.

  4. I have put a great deal of thought into this, I just haven’t put a great deal of code into it. I think you could build a pretty solid regex pattern matcher from a fairly small source of sample data, if the sample data is decidedly unambiguous and you have pretty good truth and negatives. I’m hoping that someone is really trying to tackle this, I for one would be willing to help with code that does this. Any one found any good examples?

  5. Jeff, it sounds like you and I are of the same mind on this. Unfortunately, I haven’t put much code into it yet either and probably won’t have the opportunity to do so for some time (and if I do, it could turn into a major endeavor which ends up nowhere). However, I think the principle is sound, and the potential is there. A big step in the right direction would be just creating a good interface, which would make gathering input for testing much easier.

    I don’t think any good generators exist yet, but I would love to be shown otherwise.

  6. http://www.txt2re.com
    nuff’ said

  7. O rly?

    I’ve seen that site before. It’s quite far from what I would wish for in a regex generator.

  8. Hi

    I am also planning to do the same kind of work. So any luck so far? and txt2re.com wud be very cumbersome to automate i think.

Post a Response

If you are about to post code, please escape your HTML entities (&, >, <).