undefined | Better HN

0 pointsramraj075y ago0 comments

ocr?

0 comments

4 comments · 1 top-level

Triv8885y ago· 3 in thread

why go from text to image and back to text? seems wasteful and error prone...

It's a hard problem to figure out what's readable text on a page, and what isn't. Even Google has a hard time figuring that out. OCR works very well with screenshots, and is purely computation time. But the real reason is generally just having timestamps, urls, and screenshots is good enough. I usually remember about when it was, and some words in the url, and don't need the heavyweight text search setup.

Moru5y ago

Just hard with the "read more" buttons.

ramraj07OP5y ago

Trying to parse the SPAs of today is just painful. Simpler to just render the page screenshot and OCR! Guaranteed to only index text that actually matters

j / k navigate · click thread line to collapse