undefined | Better HN

0 pointssmcin1y ago0 comments

Even the parsing of obfuscated HTML + CSS + dynamic JSON content?

0 comments

2 comments · 2 top-level

Surprisingly yes, most of the time. I’ve put in a few optimizations:

1. Remove all <style> and <svg > tags. These rarely add value, and can dramatically increase token counts.

2. For the “crawl” step, I exclusively pull out <a> tags and only look at those. The “extract” step looks at full HTML

3. For now, it only looks at the first 50k text characters, and the first 120k HTML characters. This is to stay within token limits.

The last part will be what I focus on improving in the next version.

Could go the google way, capture an image screenshot of state, ocr, then parse it.

They keep throwing it in my url bar. I refuse to click (big warning it sends to google's servers)

j / k navigate · click thread line to collapse