This is an evaluation of article body extraction for AutoExtract (ours), Diffbot, newspaper3k, readability-lxml, dragnet, boilerpipe and html-text. Since we're evaluating the quality of our system as well, we tried to be extra careful to be fair and transparent, releasing dataset, evaluation scripts and all details in the technical report.
How do these compare with popular services like Pocket, Instapaper, Firefox & Safari extractors, etc — or do those services use these libraries/algorithms in the backend?
Browser extensions are in an interesting position here as they can probably have access to much richer features from the browser context (element size, position, CSS properties), but still want to be low overhead. I think I saw such an implementation, maybe even from Mozilla, but can't find it right now.
In terms of the approach, the whole page is rendered in a headless browser, we extract the whole page screenshot, text and other features, and feed them into one neural network where all modalities are joined and which handles the extraction.