I suspect this is a hard problem and that deep learning is the state of the art. But maybe I'm missing something?
Just to be clear, given the html of a wapo article I want to discard all the affiliate links/comments and focus on the article text. I want a generalized solution for many blogs and news sites.
I'd like a daily feed of all the major global newspapers prominent headlines. Does anyone know of any good sources to start compiling this data? Besides independently scraping hundreds of sites ?