DeepRegex: Neural Generation of Regular Expressions from Natural Language (opens in new tab)

(arxiv.org)

145 pointsnicklo9y ago21 comments

21 comments

18 comments · 3 top-level

nickloOP9y ago· 8 in thread

Hey HN! One of the authors here. This was a fun project to work on. Happy to answer any questions :)

Code + data here: https://github.com/nicholaslocascio/deep-regex

Are you familiar with the work of the Machine Learning Lab at the University of Trieste? And, if so, could you quickly comment on how the two approaches differ?

http://machinelearning.inginf.units.it/

http://regex.inginf.units.it/

nickloOP9y ago

Yeah their work is very cool! The two tasks are a bit different however. Their system is for the task of generating regular expressions given a set of positive and negative examples of what the regex should match. It uses genetic algorithms and other techniques to optimize and search for a regex that fits all the given examples.

In our case, we have no examples to test against, only a natural language (English) description of what the user wants the regex to do. This is an inference problem more than a search problem as we've got one shot to give our best guess without any tests to check against and modify our answer.

1 more reply

utopkara9y ago

Congrats! The regex translation dataset generation is a great idea! 10000 lines could be the MNIST for regex :-)

nickloOP9y ago

Thanks! Starting this work, we realized that large regex datasets (large enough to apply deep-learning to) were difficult to come by. So we came up with a methodology that allowed us to make a pretty decent-sized dataset for cheap. We are glad to share it :)

vessenes9y ago

Hey, awesome work. Do you have any generated regexs? It would be nice to see some examples, especially if surprising in some way.

nickloOP9y ago

Sure!

Some positive ones:

1) Spot-on prediction:

  PROMPT: lines with 3 or more characters or lower-case letters

  PRED: ((.)|([a-z])){3,}

  GOLD: ((.)|([a-z])){3,}

2) Learned to generalize and produced a simpler regex:

  PROMPT: lines with a character and the string 'dog'

  PRED: .*(.)&(dog).*

  GOLD: .*((.)+)&(dog).*

3) Also learned to generalize and produced simpler regex without duplicate logic:

  PROMPT: lines not containing a letter

  PRED: .*~(([A-z])+).*

  GOLD: (.*)(.*~([A-z]).*)

4) Handling multiple references correctly:

  PROMPT: lines using 'su' after 'sun' or 'soon'.

  PRED: .*(sun|soon).*su.*

  GOLD: .*(sun|soon).*su.*

Though I find the mistakes interesting as well!

1) Issues counting properly:

  PROMPT: lines containing a 5 letter word beginning with 't'

  PRED: .*\bt[A-z]{5}\b.*

  GOLD: .*\bt[A-z]{4}\b.*

2) Misallocation of parenthesis (to be fair, the prompt is slightly ambiguous):

  PROMPT: lines with 'dog' follwed by 'truck' and a lower-case

  PRED: (dog).*((truck)&([a-z])).*

  GOLD: (dog.*truck.*)&(.*[a-z].*)

1 more reply

joe_the_user9y ago

It's an interesting project but to be honest, Regular expressions are kind of terrible as a piece of software if one is looking to embed them in larger software, especially if the regex gets at all large.

Have you can considered generating something like a formal grammar, a recursive-descent parser or a software library?

nickpsecurity9y ago

Despite this work being neat, that's still the best way to do it that I'm aware of. First reason is that English is imprecise enough that CompSci invented formal specifications to solve the problems that created. This is an immediate step back from precise requirements. Second, there's a ton tooling to automatically generate parsers from precise grammars. The languages, even BNF, are pretty easy to teach people. There's also text languages & spec methods that can make comprehensible stuff that regex's or BNF might muddy up. All of these take almost no CPU effort to deterministically produce a result from.

So, the old ways are still better for this domain if it's a production system whose cost or results matter. These methods might be useful for search/query by casual users, though. Or people that come from a foreign language likely to express queries in a weird way.

RankingMember9y ago· 6 in thread

As soon as I saw the title I started thinking of how great it would be to have a solid NaturalLanguage -> Regex translator, something I've always wanted. Regex is so powerful, but sometimes it takes so long to get it to do what you want it to.

flanbiscuit9y ago

> how great it would be to have a solid NaturalLanguage -> Regex translator

This idea sounds good, but as soon as you start getting slightly more complicated you'll be writing paragraphs:

Try writing this in a natural language format:

    <a\s+(?:[^>]*?\s+)?href="([^"]*)"

That's a regex to get the value of an href from anchor links.

"match "<a " then do not match a ">" if it exists, followed by a space, if it exists, then match a "href=", then begin a capturing group, then match anything but a '"' 0 or more times, then close a capturing group, then match a '"'

I'm sure that's not even correct but you can see what I mean. I can see this idea being a good tool for learning though, especially for smaller regex

modeless9y ago

> Try writing this in a natural language format:

> [...] That's a regex to get the value of an href from anchor links.

A true natural language to regex system would take your short natural language description as input, not paragraphs describing the task in more detail. Of course this would require a lot of domain knowledge about HTML, but that knowledge is readily available out there on the internet. I think it's no longer crazy to imagine a system which could read the internet, learn about HTML, and apply that knowledge to answer your natural language regex query.

This is clearly far beyond where we are today, but I think a few orders of magnitude larger neural nets would be able to handle this task, and the hardware guys are hard at work getting us there. The pace of improvement will be much faster than Moore's law over the next couple of years as the first optimized neural net hardware becomes available.

hamandcheese9y ago

It's not wise to attempt to parse html with a regex :)

http://stackoverflow.com/questions/1732348/regex-match-open-...

mynewtb9y ago

There is one that I always meant to try but I cannot find it again. One chained methods together or something.

Retra9y ago

https://github.com/VerbalExpressions/JSVerbalExpressions

1 more reply

jasonjmcghee9y ago

I couldn't see myself ever using a nl to regex translator. English especially is incredibly ambiguous and regex are simple and, as you noted, very powerful. Implementing a regex interpreter personally enabled to create even the trickiest regex. I'd highly recommend it! Knowing how to write complex regex also makes sed/grep your best friend. :)

1 more reply

duaneb9y ago· 1 in thread

The applications of this for scraping alone are amazing

staticautomatic9y ago

Imagine translating natural language into XPath queries. Now that would be something.

j / k navigate · click thread line to collapse

21 comments

18 comments · 3 top-level

nickloOP9y ago· 8 in thread

Hey HN! One of the authors here. This was a fun project to work on. Happy to answer any questions :)

Code + data here: https://github.com/nicholaslocascio/deep-regex

jgalt2129y ago

Are you familiar with the work of the Machine Learning Lab at the University of Trieste? And, if so, could you quickly comment on how the two approaches differ?

http://machinelearning.inginf.units.it/

http://regex.inginf.units.it/

nickloOP9y ago

1 more reply

utopkara9y ago

Congrats! The regex translation dataset generation is a great idea! 10000 lines could be the MNIST for regex :-)

nickloOP9y ago

vessenes9y ago

Hey, awesome work. Do you have any generated regexs? It would be nice to see some examples, especially if surprising in some way.

nickloOP9y ago

Sure!

Some positive ones:

1) Spot-on prediction:

  PROMPT: lines with 3 or more characters or lower-case letters

  PRED: ((.)|([a-z])){3,}

  GOLD: ((.)|([a-z])){3,}

2) Learned to generalize and produced a simpler regex:

  PROMPT: lines with a character and the string 'dog'

  PRED: .*(.)&(dog).*

  GOLD: .*((.)+)&(dog).*

3) Also learned to generalize and produced simpler regex without duplicate logic:

  PROMPT: lines not containing a letter

  PRED: .*~(([A-z])+).*

  GOLD: (.*)(.*~([A-z]).*)

4) Handling multiple references correctly:

  PROMPT: lines using 'su' after 'sun' or 'soon'.

  PRED: .*(sun|soon).*su.*

  GOLD: .*(sun|soon).*su.*

Though I find the mistakes interesting as well!

1) Issues counting properly:

  PROMPT: lines containing a 5 letter word beginning with 't'

  PRED: .*\bt[A-z]{5}\b.*

  GOLD: .*\bt[A-z]{4}\b.*

2) Misallocation of parenthesis (to be fair, the prompt is slightly ambiguous):

  PROMPT: lines with 'dog' follwed by 'truck' and a lower-case

  PRED: (dog).*((truck)&([a-z])).*

  GOLD: (dog.*truck.*)&(.*[a-z].*)

1 more reply

joe_the_user9y ago

Have you can considered generating something like a formal grammar, a recursive-descent parser or a software library?

nickpsecurity9y ago

RankingMember9y ago· 6 in thread

flanbiscuit9y ago

> how great it would be to have a solid NaturalLanguage -> Regex translator

This idea sounds good, but as soon as you start getting slightly more complicated you'll be writing paragraphs:

Try writing this in a natural language format:

    <a\s+(?:[^>]*?\s+)?href="([^"]*)"

That's a regex to get the value of an href from anchor links.

I'm sure that's not even correct but you can see what I mean. I can see this idea being a good tool for learning though, especially for smaller regex

modeless9y ago

> Try writing this in a natural language format:

> [...] That's a regex to get the value of an href from anchor links.

hamandcheese9y ago

It's not wise to attempt to parse html with a regex :)

http://stackoverflow.com/questions/1732348/regex-match-open-...

mynewtb9y ago

There is one that I always meant to try but I cannot find it again. One chained methods together or something.

Retra9y ago

https://github.com/VerbalExpressions/JSVerbalExpressions

1 more reply

jasonjmcghee9y ago

1 more reply

duaneb9y ago· 1 in thread

The applications of this for scraping alone are amazing

staticautomatic9y ago

Imagine translating natural language into XPath queries. Now that would be something.

j / k navigate · click thread line to collapse