Use an html parser to parse html.
You also are extremely off on your estimation of how common xhtml is on the web since you thought this would be a useful PSA and you seem unaware of what <!doctype html> means here, as it specifically is not xml. I’m not tying to be mean, but you came in with guns blazing with weird advice and it seems very mislead.