Code + data here: https://github.com/nicholaslocascio/deep-regex
In our case, we have no examples to test against, only a natural language (English) description of what the user wants the regex to do. This is an inference problem more than a search problem as we've got one shot to give our best guess without any tests to check against and modify our answer.
Some positive ones:
1) Spot-on prediction:
PROMPT: lines with 3 or more characters or lower-case letters
PRED: ((.)|([a-z])){3,}
GOLD: ((.)|([a-z])){3,}
2) Learned to generalize and produced a simpler regex: PROMPT: lines with a character and the string 'dog'
PRED: .*(.)&(dog).*
GOLD: .*((.)+)&(dog).*
3) Also learned to generalize and produced simpler regex without duplicate logic: PROMPT: lines not containing a letter
PRED: .*~(([A-z])+).*
GOLD: (.*)(.*~([A-z]).*)
4) Handling multiple references correctly: PROMPT: lines using 'su' after 'sun' or 'soon'.
PRED: .*(sun|soon).*su.*
GOLD: .*(sun|soon).*su.*
Though I find the mistakes interesting as well!1) Issues counting properly:
PROMPT: lines containing a 5 letter word beginning with 't'
PRED: .*\bt[A-z]{5}\b.*
GOLD: .*\bt[A-z]{4}\b.*
2) Misallocation of parenthesis (to be fair, the prompt is slightly ambiguous): PROMPT: lines with 'dog' follwed by 'truck' and a lower-case
PRED: (dog).*((truck)&([a-z])).*
GOLD: (dog.*truck.*)&(.*[a-z].*)Have you can considered generating something like a formal grammar, a recursive-descent parser or a software library?
So, the old ways are still better for this domain if it's a production system whose cost or results matter. These methods might be useful for search/query by casual users, though. Or people that come from a foreign language likely to express queries in a weird way.
This idea sounds good, but as soon as you start getting slightly more complicated you'll be writing paragraphs:
Try writing this in a natural language format:
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
That's a regex to get the value of an href from anchor links."match "<a " then do not match a ">" if it exists, followed by a space, if it exists, then match a "href=", then begin a capturing group, then match anything but a '"' 0 or more times, then close a capturing group, then match a '"'
I'm sure that's not even correct but you can see what I mean. I can see this idea being a good tool for learning though, especially for smaller regex
> [...] That's a regex to get the value of an href from anchor links.
A true natural language to regex system would take your short natural language description as input, not paragraphs describing the task in more detail. Of course this would require a lot of domain knowledge about HTML, but that knowledge is readily available out there on the internet. I think it's no longer crazy to imagine a system which could read the internet, learn about HTML, and apply that knowledge to answer your natural language regex query.
This is clearly far beyond where we are today, but I think a few orders of magnitude larger neural nets would be able to handle this task, and the hardware guys are hard at work getting us there. The pace of improvement will be much faster than Moore's law over the next couple of years as the first optimized neural net hardware becomes available.
http://stackoverflow.com/questions/1732348/regex-match-open-...