Oh Yes You Can Use Regexes to Parse HTML (opens in new tab)

(stackoverflow.com)

26 pointsJackeJR5y ago7 comments

7 comments

7 comments · 4 top-level

kevincox5y ago· 2 in thread

To be pedantic this isn't parsing HTML with a Regex. This is using Regexes to write a HTML parser. You can definitely use Regexes in an HTML parser but that doesn't mean that you can parse HTML with just a Regex.

dokem5y ago

Which is how these things usually go. You tokenize with regex then parse the token stream.

poisson_myfish5y ago

I mean, I've seen parsers made up of Regex only. My eyes hurt after that.

inshadows5y ago· 1 in thread

I do it often, partly to troll people who religiously claim I must not. When I extract stuff from HTML with regex I don't really care about reliability, robustness, interface design, encapsulation, separation of concerns, etc. I just want that damn string and be done with it, and what's easier than curl | sed pipeline?

EDIT: OK, the article actually talks about parsing the tree. I meant just extracting some strings.

perl4ever5y ago

I made an anti-regex comment ( https://news.ycombinator.com/item?id=26310559) a few days ago that could be misconstrued as inviting this sort of response. But I was talking about XML in files used for data interchange that were sometimes a gigabyte or two.

There's a link in the stackoverflow thread to a purported XML parser using regexes at https://www2.cs.sfu.ca/~cameron/REX.html#IV.3

Looking at the CDATA section, I think it's an example of the sort of pitfalls you run into. He thinks you can put "arbitrary content" in it, but that's a pet peeve of mine; it's not so. I once got a file to ingest that made that mistake, so I'm sure he isn't the only one.

avmich5y ago

Should we follow the French example when they stopped accepting claims of perpetuum mobile to consideration in the Academy?

19965y ago

It is dangerous but it often pays.

perl parsers for HTML are still one of the fastests way to handle it - no parallelism, so spin one thread per core and keep it well fed. You will quickly get IO bound.

j / k navigate · click thread line to collapse