Has it ever been done before? I'm interested in building an app (although I will possibly never get to it) which can automatically generate relatively-basic regular expressions by simply having the user enter some test data and mark the parts which the regex should match (all other text would serve as examples of what the regex should not match). By "relatively basic," I mean that the regexes should at least be capable of using literal characters, character classes, shorthand character classes like
\s, greedy quantification, alternation, grouping, and word boundaries.
Of course, there are major issues involved with this. For example, where would the algorithm draw the line between exactness and flexibility? I would imagine that it would attempt to create regexes which are as flexible as possible, since it's easy to rule out many non-matches by including a few non-matching examples which are very similar to strings which should match. Secondly, what would be the regex operator precedence (e.g., would it choose to attempt quantification before alternation)? There are numerous other implementation details which might present challenges, as well.
One thing that could sometimes improve the quality of the generated regexes is that the generator can cheat, by filtering expected results though a library of common patterns when this can be done with a reasonable level of certainty.
Have you ever considered this kind of thing in the past or seen it done elsewhere? Any thoughts on the issues or how the implementation might work? I have some ideas about simple algorithm approaches, but haven't proved out that they would work well in a respectable number of cases. Can you see something like this being useful to you?