RegExpBuilder – Create regular expressions using chained methods (opens in new tab)

(github.com)

85 pointsjrullmann11y ago54 comments

54 comments

44 comments · 14 top-level

UnoriginalGuy11y ago· 7 in thread

Looks like Linq (from .Net/C#). Pretty sexy way to write Regular Expressions if you ask me.

I've "learned" regular expressions multiple times but it just never sticks, I have no idea why. It certainly doesn't help that there are several different incompatible syntaxes (so what I remember and think "should" work doesn't).

I'd prefer to write RegX's in this style, however I would pay attention to performance (not that Regular Expressions are high performance, however I wouldn't want to see a large performance loss either).

UK-AL11y ago

Regular expressions are high performance if you use automata style(Regular Language) regular expressions, which limits the use of some of the features you can use.

Modern regular expression engines in a lot of languages, actually go beyond the expressiveness of a regular language. This is what damages performance.

There is no reason why this would reduce performance... if its not doing anything crazy.

If anything your taking work away from it. Your building the tree directly here, where as parser would normally build a tree from the string. But since this is integrating into the languages RE library i'm guessing its writing that tree as a string, which is then passed into the regular expression engine, to be turned into a tree again :)

UnoriginalGuy11y ago

I guess it depends on your definition of "high performance."

If a regular expression runs too often, even pre-compiled (as they should be), you'll want to replace them with code written in the native language. I've gone in and replaced a one line search/replace written in RegX (compiled), with just a C-style for() loop over the wchar array, and had the memory usage drop by near 80% and performance increase by over 60%.

So high performance is all relative. However RegX isn't something I'd describe that way, even compiled. It is a nice way to write complex string parsing code quickly however.

3 more replies

Blackthorn11y ago

I wasn't sure to be impressed or horrified when I learned that Perl supported recursive regular expressions.

amageed11y ago

If you're interested in something similar for .NET / C#, check out my Regextra library, specifically the Passphrase Regex Builder: https://github.com/amageed/Regextra

As the name suggests though, the focus was on passphrase criteria and it wasn't to produce a DSL for general regex building. The library also supports named templates and a few utility methods.

Retra11y ago

This is why I dislike the design of Linq. The pattern of chaining function calls to implement a DSL is common enough that they should have employed a general solution, not just a wonky SQL-specific version.

amageed11y ago

LINQ isn't SQL-specific and does apply generally. It can be used against the standard .NET framework objects and collections. There are different LINQ focuses or flavors, and there are 2 ways to write queries. There's LINQ to Objects, LINQ to XML, LINQ to SQL (no longer actively maintained; nowadays Entity Framework is the Microsoft alternative), and you can write your own LINQ providers to target other purposes.

As for syntax, there's the fluent syntax (chained methods), and there's the query syntax which is syntactic sugar that gets compiled to the methods. The query syntax is probably the biggest reason people mistake LINQ for being SQL specific since it resembles SQL.

E.g.,

  var results = SomeCollection.Where(c => c.SomeProperty < 10)
                              .Select(c => new { c.SomeProperty, c.OtherProperty });

The same thing in query syntax:

  var results = from c in SomeCollection
                where c.SomeProperty < 10
                select new { c.SomeProperty, c.OtherProperty };

Then you can iterate over both the same way:

  foreach (var result in results)
  {
      Console.WriteLine(result);
  }

rripken11y ago

Performance is unaffected. This provides a fluent and verbose way of building a regular expression. Users of the library then feed the built regular expression into their standard regular expression engine.

jluxenberg11y ago· 7 in thread

S-expressions are a natural fit for construction of regular expressions, see http://community.schemewiki.org/?scheme-faq-programming#H-1w...

e.g.

  (: (or (in ("az")) (in ("AZ"))) 
    (* (uncase (in ("az09")))))

maratd11y ago

Regular expressions are a natural fit for construction of regular expressions.

Look, I know it takes a while, but once you get the hang of it, you won't need any crutches to write regular expressions. The only tool that's really needed is a way to rigorously test a regular expression to make sure it does what it needs to do and there are a ton of those around.

andrewflnr11y ago

No, they're really not, as evidenced by all the quoting and meta-character nonsense you have to deal with. Sure, it's not too difficult to figure out, most of the time, but I think a solution that puts characters and logic on different quoting levels will almost always be better from an expressiveness standpoint (ignoring ecosystem issues).

1 more reply

Ronsenshi11y ago

I agree with you. Every now and then I see mentions of "all-new-regex-builder" on HN frontpage. What is up with regex and desire to write wrappers upon wrappers on top of it?

I see regex like that: if you have to use it often enough, better to learn it as it is - will be more helpful in the long run. If you don't use regex too often then just google your question - there's a very high chance that somebody already wrote regex for your or similar problem.

Only tools I ever use are regex testers (like regexr.com) when I need to make sure that pattern works correctly.

to3m11y ago

But alternative syntaxes are regular expressions too.

davelnewton11y ago

It's not a "crutch", it's an "alternative". Couching it in negative terms isn't really fair.

While I prefer writing regexes, a regex DSL isn't fundamentally better or worse, just different. In addition, it allows non-computer people to write, or at least specify, regexes in a way that makes more sense to non-developers.

skymt11y ago

Alternate representations of regexes aren't necessarily a crutch to avoid learning the normal syntax. S-expressions in particular could be useful for runtime manipulation or generation of patterns without the bother of string mangling. (I can't think of a reason to do so off-hand, but it's a nifty capability.)

1 more reply

coldtea11y ago

>Regular expressions are a natural fit for construction of regular expressions.

The particular syntax we use (which is not that great) is not THE "regular expressions" is just one syntax we arrived at.

That is, the "regular expressions" name doesn't refer to the syntax, but to the concept.

chris-at11y ago· 6 in thread

Thanks, this is a lot better than writing this (even if the formatting worked here):

``` (?xi) \b ( # Capture 1: entire matched URL (?: [a-z][\w-]+: # URL protocol and colon (?: /{1,3} # 1-3 slashes | # or [a-z0-9%] # Single letter or digit or '%' # (Trying not to match e.g. "URI::Escape") ) | # or www\d{0,3}[.] # "www.", "www1.", "www2." … "www999." | # or [a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash ) (?: # One or more: [^\s()<>]+ # Run of non-space, non-()<> | # or \(([^\s()<>]+|(\([^\s()<>]+\)))\) # balanced parens, up to 2 levels )+ (?: # End with: \(([^\s()<>]+|(\([^\s()<>]+\)))\) # balanced parens, up to 2 levels | # or [^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars ) ) ```

_lce011y ago

actually most of the comments seem to imply that whoever wrote that don't fully understand regexp syntax -- or, worst, she expects that whoever read will not

    /{1,3}                        # 1-3 slashes
    |                             #   or
    [a-z0-9%]                     # Single letter or digit or "%";

GhotiFish11y ago

err... sorry?

https://www.debuggex.com/r/EpocMU_7Fq_B_p9z

edit:

wait, I thought about it for a second and I see what you meant. You're not saying it's wrong, you're saying it's obvious.

I wasn't sure if it was obvious because I wasn't sure if {1,3} was supposed to be {1-3} and there was a mistake in the expression, or if there was some kind of unexpected error in the [a-z0-9%] expression.

Because even in this simple example, there is room for error.

tlrobinson11y ago

Properly formatted (to be fair this is from a blog post explaining how the regex works: http://daringfireball.net/2010/07/improved_regex_for_matchin...):

    (?xi)
    \b
    (                           # Capture 1: entire matched URL
      (?:
        [a-z][\w-]+:                # URL protocol and colon
        (?:
          /{1,3}                        # 1-3 slashes
          |                             #   or
          [a-z0-9%]                     # Single letter or digit or '%'
                                        # (Trying not to match e.g. "URI::Escape")
        )
        |                           #   or
        www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
        |                           #   or
        [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
      )
      (?:                           # One or more:
        [^\s()<>]+                      # Run of non-space, non-()<>
        |                               #   or
        \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
      )+
      (?:                           # End with:
        \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
        |                                   #   or
        [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
      )
    )

raiph11y ago

cf the Perl 6 community module for parsing URIs which features Perl 6's unique unification of regexes and grammars:

https://github.com/perl6-community-modules/uri/blob/master/l...

whichdan11y ago

HN doesn't support Markdown. You'll need to prefix each line with >= 2 spaces for it to be treated as code.

https://news.ycombinator.com/formatdoc

UnoriginalGuy11y ago

That really is Hacker News' worst limitation. I understand if they want to limit what formatting is available, but the fact that basic listing is so clunky is annoying.

draegtun11y ago· 4 in thread

Thought this might be of interest; below shows how the examples provided would look in Rebol:

    digits: digit: charset "0123456789"

    rule: [
        thru "$"
        some digits
        "."
        digit
        digit
    ]

    parse "$10.00" rule    ;; true


    pattern: [
        some "p"
        2 "q" any "q"
    ]

    new-rule: [
        2 pattern
    ]

    parse "pqqpqq" new-rule    ;; true

Rebol doesn't have regular expressions instead it comes with a parse dialect which is a TDPL - http://en.wikipedia.org/wiki/Top-down_parsing_language

Some parse refs: http://en.wikibooks.org/wiki/REBOL_Programming/Language_Feat... | http://www.rebol.net/wiki/Parse_Project | http://www.rebol.com/r3/docs/concepts/parsing-summary.html

_lce011y ago

hey thanks to share!

TIL

    Although Rebol can be used for programming, 
    writing functions, and performing processes, 
    its greatest strength is the ability to 
    easily create domain-specific languages or 
    dialects.
        — Carl Sassenrath [Rebol author]

https://en.wikipedia.org/wiki/Rebol

carlob11y ago

Mathematica also has its own string pattern sytax

http://reference.wolfram.com/language/ref/StringExpression.h...

Something like that would be

    StringExpression[
        "$",
        Repeated[DigitCharacter],
        ".",
        DigitCharacter,
        DigitCharacter
    ]

    StringExpression[
        "$",
        Repeated[DigitCharacter],
        ".",
        Repeated[DigitCharacter, {2}],
    ]

    StringExpression[
        "$",
        NumberString
    ]

and the other is

    StringExpression[
        Repeated[
           StringExpression[
               Repeated["p", {1, Infinity}],
               Repeated["q", {2, Infinity}]
           ],
           {2}
        ]
    ]

This can be made more concise since StringExpression has an infix form (~~) and Repeated can sometimes be replaced by postfix ..

akater11y ago

> Repeated can sometimes be replaced by postfix ..

Always, not sometimes. ;-)

raiph11y ago

Perl 6 unifies "regexes" and recursive descent parsing:

  '$10.00' ~~ rx{ \$ \d+ \. \d\d };

  my $pat = rx{ \p+ \q**2..Inf }; 'pqqpqq' ~~ rx{ <$pat>**2 }

Note that these "regexes" are syntax, not strings, checked and converted in to a hybrid DFA/NFA at compile-time.

marktangotango11y ago· 4 in thread

Generally, I find that if one's regexes are so complex that one needs visualizers or other aids in writing them, one doesn't have a regex problem, but a parsing problem. The method of parsing by recursive descent can often lead to much more understandable (if more verbose) "pattern matching".

otakucode11y ago

The worst regexes I've had to write involved parsing the various IMDB data files, which seem to have been formatted specifically to make them as difficult to parse as possible. I hear mediawiki syntax is similarly arcane and evil, but I've never tried to parse it (though last night I started writing some tools to deal with wikipedia dumps so I might end up in that corner). I'd really like to see different approaches to parsing really ugly formats that feature an exception to almost every single pattern you think you've found. I honestly think the regex is easiest...

DenisM11y ago

Recursive descend is imperative, while regex is declarative.

Regex may be ugly, but you lose something important when you move from declarative to imperative.

jerf11y ago

"Recursive descent" has that name precisely because it is not the only parsing alternative, hence we can not simply call it "parsing".

raiph11y ago

Perl 6 unifies "regexes" and recursive descent. See https://news.ycombinator.com/item?id=9039680 or, say, https://github.com/Mouq/json5/blob/master/lib/JSON5/Tiny/Gra...

jgalt21211y ago· 2 in thread

Definitely a debugable way to write regexes. Whenever I have to maintain a hairy regex, I like to plot the regex as a railroad diagram.

These web based tools can do it:

https://www.debuggex.com/

http://jex.im/regulex/

philjohn11y ago

Love it - just visualised the PCRE generated from the EBNF for the N-Triples RDF serialisation format[1] :)

https://www.debuggex.com/r/Yxqws81Uif-BGBN8

Important note - this is built up programmatically, it's not just a string dumped in a parser!

[1] http://www.w3.org/TR/n-triples/#n-triples-grammar

jgalt21211y ago

That is one hairy regex. Now the inverse would be even better. You modify the railroad chart and the regex updates.

1 more reply

tragomaskhalos11y ago

There have been many efforts similar to this in many languages, but most of us seem happy to stick to the more succinct canonical form, supplemented via /x # comments when things get too hairy

dkarapetyan11y ago

Generalize just a little bit and you got parser combinators.

zzzcpan11y ago

Regexpes exist to avoid cumbersome code like this, to make it less error prone. Makes me sad to see so many upvotes.

I get that some people have a hard time understanding regexpes with all the backtracking and greediness. Yes, syntax is a bit complicated. Maybe simplified predictable default mode could help. But there is no problem with DSL being used as an abstraction. In fact, we need more DSLs, for everything!

psychometry11y ago

Now you have three problems.

kazinator11y ago

Yes, regexes can have other syntactic representations, like:

    (compound "$" (1+ :digit) "." :digit :digit)

Run:

    $ txr -p "(regex-compile '(compound \"$\" (1+ :digit) \".\" :digit :digit))"
    #/$\d+\.\d\d/

epicureanideal11y ago

Nice work! I don't know if it'll be ideal for all use cases, but it does add some readability.

otakucode11y ago

Now do an example where you create a regex to parse the IMDB movies.list data file!

gcao11y ago

Great work! This is very intriguing!

j / k navigate · click thread line to collapse

54 comments

44 comments · 14 top-level

UnoriginalGuy11y ago· 7 in thread

Looks like Linq (from .Net/C#). Pretty sexy way to write Regular Expressions if you ask me.

UK-AL11y ago

Regular expressions are high performance if you use automata style(Regular Language) regular expressions, which limits the use of some of the features you can use.

Modern regular expression engines in a lot of languages, actually go beyond the expressiveness of a regular language. This is what damages performance.

There is no reason why this would reduce performance... if its not doing anything crazy.

UnoriginalGuy11y ago

I guess it depends on your definition of "high performance."

So high performance is all relative. However RegX isn't something I'd describe that way, even compiled. It is a nice way to write complex string parsing code quickly however.

3 more replies

Blackthorn11y ago

I wasn't sure to be impressed or horrified when I learned that Perl supported recursive regular expressions.

amageed11y ago

If you're interested in something similar for .NET / C#, check out my Regextra library, specifically the Passphrase Regex Builder: https://github.com/amageed/Regextra

As the name suggests though, the focus was on passphrase criteria and it wasn't to produce a DSL for general regex building. The library also supports named templates and a few utility methods.

Retra11y ago

amageed11y ago

E.g.,

  var results = SomeCollection.Where(c => c.SomeProperty < 10)
                              .Select(c => new { c.SomeProperty, c.OtherProperty });

The same thing in query syntax:

  var results = from c in SomeCollection
                where c.SomeProperty < 10
                select new { c.SomeProperty, c.OtherProperty };

Then you can iterate over both the same way:

  foreach (var result in results)
  {
      Console.WriteLine(result);
  }

rripken11y ago

jluxenberg11y ago· 7 in thread

S-expressions are a natural fit for construction of regular expressions, see http://community.schemewiki.org/?scheme-faq-programming#H-1w...

e.g.

  (: (or (in ("az")) (in ("AZ"))) 
    (* (uncase (in ("az09")))))

maratd11y ago

Regular expressions are a natural fit for construction of regular expressions.

andrewflnr11y ago

1 more reply

Ronsenshi11y ago

I agree with you. Every now and then I see mentions of "all-new-regex-builder" on HN frontpage. What is up with regex and desire to write wrappers upon wrappers on top of it?

Only tools I ever use are regex testers (like regexr.com) when I need to make sure that pattern works correctly.

to3m11y ago

But alternative syntaxes are regular expressions too.

davelnewton11y ago

It's not a "crutch", it's an "alternative". Couching it in negative terms isn't really fair.

skymt11y ago

1 more reply

coldtea11y ago

>Regular expressions are a natural fit for construction of regular expressions.

The particular syntax we use (which is not that great) is not THE "regular expressions" is just one syntax we arrived at.

That is, the "regular expressions" name doesn't refer to the syntax, but to the concept.

chris-at11y ago· 6 in thread

Thanks, this is a lot better than writing this (even if the formatting worked here):

_lce011y ago

actually most of the comments seem to imply that whoever wrote that don't fully understand regexp syntax -- or, worst, she expects that whoever read will not

    /{1,3}                        # 1-3 slashes
    |                             #   or
    [a-z0-9%]                     # Single letter or digit or "%";

GhotiFish11y ago

err... sorry?

https://www.debuggex.com/r/EpocMU_7Fq_B_p9z

edit:

wait, I thought about it for a second and I see what you meant. You're not saying it's wrong, you're saying it's obvious.

Because even in this simple example, there is room for error.

tlrobinson11y ago

Properly formatted (to be fair this is from a blog post explaining how the regex works: http://daringfireball.net/2010/07/improved_regex_for_matchin...):

    (?xi)
    \b
    (                           # Capture 1: entire matched URL
      (?:
        [a-z][\w-]+:                # URL protocol and colon
        (?:
          /{1,3}                        # 1-3 slashes
          |                             #   or
          [a-z0-9%]                     # Single letter or digit or '%'
                                        # (Trying not to match e.g. "URI::Escape")
        )
        |                           #   or
        www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
        |                           #   or
        [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
      )
      (?:                           # One or more:
        [^\s()<>]+                      # Run of non-space, non-()<>
        |                               #   or
        \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
      )+
      (?:                           # End with:
        \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
        |                                   #   or
        [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
      )
    )

raiph11y ago

cf the Perl 6 community module for parsing URIs which features Perl 6's unique unification of regexes and grammars:

https://github.com/perl6-community-modules/uri/blob/master/l...

whichdan11y ago

HN doesn't support Markdown. You'll need to prefix each line with >= 2 spaces for it to be treated as code.

https://news.ycombinator.com/formatdoc

UnoriginalGuy11y ago

That really is Hacker News' worst limitation. I understand if they want to limit what formatting is available, but the fact that basic listing is so clunky is annoying.

draegtun11y ago· 4 in thread

Thought this might be of interest; below shows how the examples provided would look in Rebol:

    digits: digit: charset "0123456789"

    rule: [
        thru "$"
        some digits
        "."
        digit
        digit
    ]

    parse "$10.00" rule    ;; true


    pattern: [
        some "p"
        2 "q" any "q"
    ]

    new-rule: [
        2 pattern
    ]

    parse "pqqpqq" new-rule    ;; true

Rebol doesn't have regular expressions instead it comes with a parse dialect which is a TDPL - http://en.wikipedia.org/wiki/Top-down_parsing_language

Some parse refs: http://en.wikibooks.org/wiki/REBOL_Programming/Language_Feat... | http://www.rebol.net/wiki/Parse_Project | http://www.rebol.com/r3/docs/concepts/parsing-summary.html

_lce011y ago

hey thanks to share!

TIL

    Although Rebol can be used for programming, 
    writing functions, and performing processes, 
    its greatest strength is the ability to 
    easily create domain-specific languages or 
    dialects.
        — Carl Sassenrath [Rebol author]

https://en.wikipedia.org/wiki/Rebol

carlob11y ago

Mathematica also has its own string pattern sytax

http://reference.wolfram.com/language/ref/StringExpression.h...

Something like that would be

    StringExpression[
        "$",
        Repeated[DigitCharacter],
        ".",
        DigitCharacter,
        DigitCharacter
    ]

    StringExpression[
        "$",
        Repeated[DigitCharacter],
        ".",
        Repeated[DigitCharacter, {2}],
    ]

    StringExpression[
        "$",
        NumberString
    ]

and the other is

    StringExpression[
        Repeated[
           StringExpression[
               Repeated["p", {1, Infinity}],
               Repeated["q", {2, Infinity}]
           ],
           {2}
        ]
    ]

This can be made more concise since StringExpression has an infix form (~~) and Repeated can sometimes be replaced by postfix ..

akater11y ago

> Repeated can sometimes be replaced by postfix ..

Always, not sometimes. ;-)

raiph11y ago

Perl 6 unifies "regexes" and recursive descent parsing:

  '$10.00' ~~ rx{ \$ \d+ \. \d\d };

  my $pat = rx{ \p+ \q**2..Inf }; 'pqqpqq' ~~ rx{ <$pat>**2 }

Note that these "regexes" are syntax, not strings, checked and converted in to a hybrid DFA/NFA at compile-time.

marktangotango11y ago· 4 in thread

otakucode11y ago

DenisM11y ago

Recursive descend is imperative, while regex is declarative.

Regex may be ugly, but you lose something important when you move from declarative to imperative.

jerf11y ago

"Recursive descent" has that name precisely because it is not the only parsing alternative, hence we can not simply call it "parsing".

raiph11y ago

Perl 6 unifies "regexes" and recursive descent. See https://news.ycombinator.com/item?id=9039680 or, say, https://github.com/Mouq/json5/blob/master/lib/JSON5/Tiny/Gra...

jgalt21211y ago· 2 in thread

Definitely a debugable way to write regexes. Whenever I have to maintain a hairy regex, I like to plot the regex as a railroad diagram.

These web based tools can do it:

https://www.debuggex.com/

http://jex.im/regulex/

philjohn11y ago

Love it - just visualised the PCRE generated from the EBNF for the N-Triples RDF serialisation format[1] :)

https://www.debuggex.com/r/Yxqws81Uif-BGBN8

Important note - this is built up programmatically, it's not just a string dumped in a parser!

[1] http://www.w3.org/TR/n-triples/#n-triples-grammar

jgalt21211y ago

That is one hairy regex. Now the inverse would be even better. You modify the railroad chart and the regex updates.

1 more reply

tragomaskhalos11y ago

There have been many efforts similar to this in many languages, but most of us seem happy to stick to the more succinct canonical form, supplemented via /x # comments when things get too hairy

dkarapetyan11y ago

Generalize just a little bit and you got parser combinators.

zzzcpan11y ago

Regexpes exist to avoid cumbersome code like this, to make it less error prone. Makes me sad to see so many upvotes.

psychometry11y ago

Now you have three problems.

kazinator11y ago

Yes, regexes can have other syntactic representations, like:

    (compound "$" (1+ :digit) "." :digit :digit)

Run:

    $ txr -p "(regex-compile '(compound \"$\" (1+ :digit) \".\" :digit :digit))"
    #/$\d+\.\d\d/

epicureanideal11y ago

Nice work! I don't know if it'll be ideal for all use cases, but it does add some readability.

otakucode11y ago

Now do an example where you create a regex to parse the IMDB movies.list data file!

gcao11y ago

Great work! This is very intriguing!

j / k navigate · click thread line to collapse