undefined | Better HN

0 pointslopuhin6y ago0 comments

Great question! I'm an active Pocket user myself, would love to know what they use on the backend. From seeing their failures - when they think something is not an article or excluding some relevant stuff, I would guess they use something working on pure html and more similar to current open source solutions - wheres for example Diffbot failures looked quite similar to ours as we seem to use a similar approach (and it's quite rare to miss a large chunk of the article). I imagine Pocket margins must be quite slim so they can't throw a headless browser + neural network on every page. Maybe they can use higher quality and more expensive extractors for popular articles.

Browser extensions are in an interesting position here as they can probably have access to much richer features from the browser context (element size, position, CSS properties), but still want to be low overhead. I think I saw such an implementation, maybe even from Mozilla, but can't find it right now.

0 comments

2 comments · 1 top-level

ssivark6y ago· 1 in thread

Hmm, that’s interesting. BTW, are there known/standard ways to “ensemble” these different algorithms to build more robust solutions (at the price of some extra computation)? It’s not obvious to me how one would combine different extraction results, but maybe one could use some more heuristics to pick the best result for each example.

lopuhinOP6y ago

Yeah I don't think it's trivial. If an algorithm made predictions on the level of html elements - whether an element should be a part of article body or not, it would be possible to combine probabilities or at least vote. But (a) it's probably a non-trivial modification (b) a lot of methods would also use some heuristics/postprocessing/have other quirks which make combining results difficult.

j / k navigate · click thread line to collapse