EDIT: OK, the article actually talks about parsing the tree. I meant just extracting some strings.
There's a link in the stackoverflow thread to a purported XML parser using regexes at https://www2.cs.sfu.ca/~cameron/REX.html#IV.3
Looking at the CDATA section, I think it's an example of the sort of pitfalls you run into. He thinks you can put "arbitrary content" in it, but that's a pet peeve of mine; it's not so. I once got a file to ingest that made that mistake, so I'm sure he isn't the only one.
perl parsers for HTML are still one of the fastests way to handle it - no parallelism, so spin one thread per core and keep it well fed. You will quickly get IO bound.