Automatic Regex Generation

Has it ever been done before? I'm interested in building an app (although I will possibly never get to it) which can automatically generate relatively-basic regular expressions by simply having the user enter some test data and mark the parts which the regex should match (all other text would serve as examples of what the regex should not match). By "relatively basic," I mean that the regexes should at least be capable of using literal characters, character classes, shorthand character classes like \d and \s, greedy quantification, alternation, grouping, and word boundaries.

Of course, there are major issues involved with this. For example, where would the algorithm draw the line between exactness and flexibility? I would imagine that it would attempt to create regexes which are as flexible as possible, since it's easy to rule out many non-matches by including a few non-matching examples which are very similar to strings which should match. Secondly, what would be the regex operator precedence (e.g., would it choose to attempt quantification before alternation)? There are numerous other implementation details which might present challenges, as well.

One thing that could sometimes improve the quality of the generated regexes is that the generator can cheat, by filtering expected results though a library of common patterns when this can be done with a reasonable level of certainty.

Have you ever considered this kind of thing in the past or seen it done elsewhere? Any thoughts on the issues or how the implementation might work? I have some ideas about simple algorithm approaches, but haven't proved out that they would work well in a respectable number of cases. Can you see something like this being useful to you?

21 thoughts on “Automatic Regex Generation”

  1. You’d have to provide it with a really large sample size. Otherwise you’d end up with a regexp that works but that’s not what you want.

    My guess is that the amount of work to design the regexp is less than the work to collect the large sample, although if there is a program in which you want to embed the automatic regexp generator, that might be interesting.

  2. nordsieck, that is exactly the challenge… being able to provide just a few examples of what should match and a few examples of similar text which should not match, and having the generator come up with something that is good enough for your needs. Obviously, the larger the sample data, the more precise the regex could become, but I don’t think a massive dataset is necessary for the vast majority of simple patterns. Having the generator start out too precise would drastically reduce the chance that the regex would be useful with anything other than the sample data, so it’s by doing the reverse — having the generated regexes be as flexible as possible given the data (within a set hierarchy of which tokens and operations are most flexible) — that, in theory, would prevent the need to provide a huge amount of sample text. You would only need to provide samples which are very similar but should not be matched.

    Now, I would not advocate using automatically generated regexes in production code unless you fully understood their meaning (and if you did, you could probably write a better regex yourself), but such regexes could be useful for less-critical tasks like searching though personal documents or code, etc.

    If you subscribe to my views on the subject, the problem then lies not in amassing huge sample data, but in devising a system for constructing flexible patterns which match any number of provided strings, while not matching all other provided input. This would work nothing like a diff algorithm, for instance, since such programs generally can only compare two or three variations of a string. I am not aware of programs which are similar in nature to what I’ve described, but I would like to know about them if there are. I think it can be done, and have some ideas (which are not by any means fully fleshed out yet) about how to approach this in a relatively simplistic way.

    My interest in the subject comes simply from hearing a number of people express a desire for such a program.

  3. I think the main uses of generated RegExes would lie primarily in one-time use. You know, the n00bs who are trying to clean up an HTML page or format something or other in a single document and are like “how do I match tags?” For these, the data sets would be a pretty reliable sample because in general the set is limited and intended for single use. It doesn’t matter if the RegEx is flexible as long as it does it’s job.

    Another application of a RegEx generator is basically like a stub generator. It might provide a simple RegEx that is 90% what you want, and then you can tweak it to make it more flexible or correct as needed. Obviously this is most useful for those with good RegEx knowledge, but it could be a time saver at the very least. This is the principle behind most code generation techniques (e.g., scaffolding).

    Finally, another possibility, depending on how much time you want to put in to it, would be the option to “optimize for flexibility” or “optimize for efficiency” where the latter would match the data set more exactly with less regard for other similar sets of data that might need to be matched.

    At its core, this is a pattern recognition problem, traditionally a part of the artificial intelligence / neural networks area of computing and IT. You could probably apply a lot of those ideas and algorithms to this problem in the hopes of formulating the best possible RegEx. You could have a lite version and a smarter version which would be “trained” by providing test cases and then soliciting feedback for whether or not the pattern was matched. These NN then adjust the weights corresponding to the factors that were considered to provide the match.

  4. I have put a great deal of thought into this, I just haven’t put a great deal of code into it. I think you could build a pretty solid regex pattern matcher from a fairly small source of sample data, if the sample data is decidedly unambiguous and you have pretty good truth and negatives. I’m hoping that someone is really trying to tackle this, I for one would be willing to help with code that does this. Any one found any good examples?

  5. Jeff, it sounds like you and I are of the same mind on this. Unfortunately, I haven’t put much code into it yet either and probably won’t have the opportunity to do so for some time (and if I do, it could turn into a major endeavor which ends up nowhere). However, I think the principle is sound, and the potential is there. A big step in the right direction would be just creating a good interface, which would make gathering input for testing much easier.

    I don’t think any good generators exist yet, but I would love to be shown otherwise.

  6. No, enough has not been said, @me. I’ve seen that site before. I consider it mostly useless, and very far from what I would wish for in a regex generator.

  7. Hi

    I am also planning to do the same kind of work. So any luck so far? and txt2re.com wud be very cumbersome to automate i think.

  8. Just to know… Did you ever persue this further or find any ressource for doing such a job? Because right now, I’m looking at a way to define patterns from an interface where a no brainer user would select on an arbitrary webpage a few elements (as many as it takes to make it work) until all the elements of the same kind could be located automatically. Let say to find out all the names on a blog where signature have always the same position within a block. I could cook up a regex to do this, but building a tool that would already propose a reasonnable regex to find that pattern. An other need would be to “find” a singular item within a page, knowing its relative position, but not its (exact) content. Let say the name of the special of the day on a particulat webshop. Any lead, advice?

    Thanks!

  9. 1. test data units are short
    2. identification of fields in each test unit
    3. field prototypes are simple regexes

  10. well i am looking for exactly the same, because i am writing some small plugins which converts data (From Web or files) and stores it in MYSQL or react another way. so there are no new tools found in web?

  11. No. You should study about finite automatas. From the input, you can generate a simple, non-deterministic finite automata. This finite automata should be determinised, and then minimalized. Further going, you can generate the regular expression for the language, which minimal, deterministic finite automata accepts. This is the solution.

    It will be a long code (I think, it were so 20-40kByte in perl), but I’m on the way to do this.

  12. And it is not NP-complete. It is simple polynomic time. That won’t be a problem. We need minimal-length regexps which matches inputs.

  13. @Andrea De Lorenzo, that’s awesome! I’ll have to play with your webapp and closely read over your paper when I have some time available, but on first glance, that is exactly the kind of thing I was talking about and hoping to see when I wrote this blog post five years ago (albeit, your approach of using genetic programming was not something I’d considered). Do you have any plans to make the code for this very cool project of yours open source or available to view?

  14. @Steven Levithan thank for your feedback. We very appreciate it!
    We developed this webapp just as a prototype to present our work at GECCO conference.
    We are pleased to hear that it was very appreciated, we didn’t expected so much interest about it. We are currently working to improve the server capacity, but we have limited resources.
    We don’t plan to release the code in the short term, since it needs minor tweaks: as i wrote before, we have limited resources and we prefer to improve the webapp.

Leave a Reply

Your email address will not be published. Required fields are marked *