Parsing HTML Using Regular Expressions (opens in new tab)

(stackoverflow.com)

22 pointsyammesicka8y ago17 comments

17 comments

14 comments · 6 top-level

you know when you first read that you think - damn straight you can't parse html with regex, but as it goes on the idea gets strangely enticing. I mean maybe, with the correct rituals, and a gun and a willingness to fight with ancient evils you could maybe parse some html with regex. A sort of Lovecraft/Action flick.

ythn8y ago

Using regex in python to scrape some data from a website works just fine for me... shrugs

Analemma_8y ago

It’s an acceptable choice for scraping, because with scraping you usually go into it with the expectation that it will break and need updates sometimes as the scraped site changes. The warning bells are more for things like validating user input on your own site, where there might be security implications of an impossible parsing task.

DonHopkins8y ago

And I use a handgun to hunt houseflies in my motel room after a week long crystal meth binge. Works just fine for me.

scarface748y ago

Is there ever a good reason to use regex to do a web scraper instead of using a proper prebuilt parser and walking the DOM?

2 more replies

monk_e_boy8y ago

beautiful soup is a better tool.

Regex is fine for grabbing something in a page that you have looked at yourself.

Parsing millions of pages you don't have this option, you need something robust, a tool that is flexible, that doesn't barf out too many errors, that is quick.

1 more reply

vermooten8y ago

yep me too, no problem.

BrandoElFollito8y ago· 2 in thread

I am a moderately active user of SE (~25k of flair) and I find the contrast between the regular channel (say, Stack Overflow) and the Meta one (SO Meta) horrifying.

The SO Meta community is such a bunch of bullies that I now hardly go there (even though I recently found two bugs which I did not bother to post). In contrast, the regular channels are pragmatically helpful (pragmatically because you still need to do some God offering sacrifices (called "what effort have you put in the question" and suffer some psychotic down voters). It is interesting to see that both populations are composed from the same individuals who seem to have a personality flip when switching channels.

I would be interested someday to learn about the dynamics of such groups. There are plenty of places on Internet populated by mentally deranged participants (cowards hiding behind Internet) but the SE Meta ones are, I belive, more educated / intelligent in average and, sometimes, more traceable.

xenomachina8y ago

For a long time, the SO and Meta.SO scores we completely separate. This meant that even if you were a top contributor on SO, you might have very low meta-reputation. It was pretty screwed up. I have a fairly high SO rep (top < 0.2%), but in those days my meta-rep wasn't even high enough to unlock many basic features. I remember reporting bugs and getting treated like a noob.

I'd even complained about the reps being separate, pointing out how this gave the power on meta to people who didn't even necessarily contribute on the main site. A bunch of high meta-rep users descended, simultaneously shooting down the idea of merging the reps while admitting that the main reason they like the status quo was because they didn't want to lose their precious karma. I was petty surprised when SO eventually fixed this, but it's kind of too little, too late. I don't bother with meta anymore despite now having a high rep on it. Too many bad memories.

BrandoElFollito8y ago

If it was me, I would not have put any rep in Meta. The fact that one is a genius in Java and Python does not mean that he or she is a good moderator or sie admin.

Anyway, I find it sad that they are losing some possibly useful feedback in the name of self-adoration. And this particularly because SE is a fantastic source of knowledge, just reading the Hot Topics made me learn about subjects I did have never looked at.

jlhawn8y ago

Maybe I'm misunderstanding the question, but it sounds like the question is not asking how to parse HTML with a regex, but how to match HTML open tags specifically.

While you obviously can't match arbitrary HTML with a regex (because arbitrary levels of nested elements requires a stack-based parser), can you not match HTML tags with a regex? It seems to be that it should be possible since you always have the pattern '<' followed by the name of the tag, followed my zero or more "key=quoted-val" attributes, and finally a '>' token.

So, if the question is limited to just how to parse a single open token then it seems like all of the answers have just decided to echo what they've heard in the past which is "don't use regular expressions to parse HTML" when the truth is that a real HTML lexer/parser does use regular expressions for creating these "open" and "close" element tokens for the parser.

Tloewald8y ago

This is a fun (and classic) thread and it's worth reading the pro and con arguments.

It really falls under the old joke "you have a problem and you decide to solve it with regex, now you have two problems". HTML is very gnarly, and regex is very gnarly. Doesn't mean you can't get shit done if you're aware of the pitfalls.

dukoid8y ago

It should be possible to tokenize html with regular expressions, an that's all he seems to be asking for...

krallja8y ago

(2009)

j / k navigate · click thread line to collapse