Surprisingly yes, most of the time. I’ve put in a few optimizations:
1. Remove all <style> and <svg > tags. These rarely add value, and can dramatically increase token counts.
2. For the “crawl” step, I exclusively pull out <a> tags and only look at those. The “extract” step looks at full HTML
3. For now, it only looks at the first 50k text characters, and the first 120k HTML characters. This is to stay within token limits.
The last part will be what I focus on improving in the next version.