Article extraction benchmark: open-source libraries and commercial services (opens in new tab)

(github.com)

19 pointslopuhin6y ago10 comments

10 comments

7 comments · 2 top-level

lopuhinOP6y ago· 3 in thread

Author here, ready to answer any questions.

This is an evaluation of article body extraction for AutoExtract (ours), Diffbot, newspaper3k, readability-lxml, dragnet, boilerpipe and html-text. Since we're evaluating the quality of our system as well, we tried to be extra careful to be fair and transparent, releasing dataset, evaluation scripts and all details in the technical report.

ssivark5y ago

Thanks, as an avid reader who saves a lot of articles from the web, this is quite interesting. (I’m a Pocket subscriber btw)

How do these compare with popular services like Pocket, Instapaper, Firefox & Safari extractors, etc — or do those services use these libraries/algorithms in the backend?

kmike845y ago

Firefox Reader View uses https://github.com/mozilla/readability; if I'm not mistaken, it should be an algorithm which is similar to the one implemented in python-readability.

lopuhinOP5y ago

Great question! I'm an active Pocket user myself, would love to know what they use on the backend. From seeing their failures - when they think something is not an article or excluding some relevant stuff, I would guess they use something working on pure html and more similar to current open source solutions - wheres for example Diffbot failures looked quite similar to ours as we seem to use a similar approach (and it's quite rare to miss a large chunk of the article). I imagine Pocket margins must be quite slim so they can't throw a headless browser + neural network on every page. Maybe they can use higher quality and more expensive extractors for popular articles.

Browser extensions are in an interesting position here as they can probably have access to much richer features from the browser context (element size, position, CSS properties), but still want to be low overhead. I think I saw such an implementation, maybe even from Mozilla, but can't find it right now.

1 more reply

freediver5y ago· 2 in thread

Do you plan to release this? If not, can you discuss your approach?

lopuhinOP5y ago

The service itself is already released, you can play with the demo without registration at https://www.scrapinghub.com/data-api-news (scroll to "Try it out here for yourself"), and we also have a free API trial. Would be curious to know how you plan to use it.

In terms of the approach, the whole page is rendered in a headless browser, we extract the whole page screenshot, text and other features, and feed them into one neural network where all modalities are joined and which handles the extraction.

freediver5y ago

Did you base it off Dragnet? Can you comment on the other important parameter, the extraction speed?

1 more reply

j / k navigate · click thread line to collapse

10 comments

7 comments · 2 top-level

lopuhinOP6y ago· 3 in thread

Author here, ready to answer any questions.

ssivark5y ago

Thanks, as an avid reader who saves a lot of articles from the web, this is quite interesting. (I’m a Pocket subscriber btw)

How do these compare with popular services like Pocket, Instapaper, Firefox & Safari extractors, etc — or do those services use these libraries/algorithms in the backend?

kmike845y ago

Firefox Reader View uses https://github.com/mozilla/readability; if I'm not mistaken, it should be an algorithm which is similar to the one implemented in python-readability.

lopuhinOP5y ago

1 more reply

freediver5y ago· 2 in thread

Do you plan to release this? If not, can you discuss your approach?

lopuhinOP5y ago

freediver5y ago

Did you base it off Dragnet? Can you comment on the other important parameter, the extraction speed?

1 more reply

j / k navigate · click thread line to collapse