There are a few ways to do this that we've tried, namely Extractus[0] and dom-to-semantic-markdown[1].
Internally we use Apify[2] and Firecrawl[3] for Magic Loops[4] that run in the cloud, both of which have options for simplifying pages built-in, but for our Chrome Extension we use dom-to-semantic-markdown.
Similar to the article, we're currently exploring a user-assisted flow to generate XPaths for a given site, which we can then use to extract specific elements before hitting the LLM.
By simplifying the "problem" we've had decent success, even with GPT-4o mini.
[0] https://github.com/extractus
[1] https://github.com/romansky/dom-to-semantic-markdown
We even have an iFrame-able live view of the browser, so your users can get real-time feedback on the XPaths they're generating: https://docs.browserbase.com/features/session-live-view#give...
Happy to answer any questions!
Do you handle authentication? We have lots of users that want to automate some part of their daily workflow but the pages are often behind a login and/or require a few clicks to reach the desired content.
Happy to chat: username@gmail.com
I do scraping, but I struggle to see what these tools are offering, but maybe I'm just not the target audience. If the websites don't have much anti-scraping protection to speak of, and I only do a few pages per day, is there still something I can get out of using a tool like Browserbase? I wonder because of this talk about semantic markdown and LLMs, what's the benefit between writing (or even having an AI write) standard fetching and parsing code using playwright/beautifulsoup/cheerio?
I was just a bit confused that the sign up buttons for the Hobby and Scale plans are grey, I thought that they are disabled until randomly hovering over them.
The page I found is labeled “Alpha Draft,” which suggests there isn’t a huge corpus of Semantic Markdown content out there. This might impede LLM’s ability to understand it due to lack of training data. However, it seems sufficiently readable that LLMs could get by pretty well by treating its structured metadata as parathenicals
=====
What is Semantic Markdown?
Semantic Markdown is a plain-text format for writing documents that embed machine-readable data. The documents are easy to author and both human and machine-readable, so that the structured data contained within these documents is available to tools and applications.
Technically speaking, Semantic Markdown is "RDFa Lite for Markdown" and aims at enhancing the HTML generated from Markdown with RDFa Lite attributes.
Design Rationale:
Embed RDFa-like semantic annotation within Markdown
Ability to mix unstructured human-text with machine-readable data in JSON-LD-like lists
Ability to semantically annotate an existing plain Markdown document with semantic annotations
Keep human-readability to a maximum About this document
=====
I've been wanting to try the same approach and have been looking for the right tools.
Translating a complex JSON representing an execution graph to a simpler graphviz dot format first and then feeding it to an LLM. We had decent success.
Source: I have a toddler at home.
There is no need for a new API endpoint. Just send multiple requests at once.
I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).
For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.
If you do like the author did and ask it to generate xPaths, you can use it once, use the xPaths it generated for regular scraping, then once it breaks fall back to the LLM to update the xPaths and fall back one more time to alerting a human if the data doesn't start flowing again, or if something breaks further down the pipeline because the data is in an unexpected format.
> Turns out, a simple table from Wikipedia (Human development index) breaks the model because rows with repeated values are merged
If the cost of updating some xPath things every now and then is relatively low - which I guess means "your target site is not actively & deliberately obfuscating their website specifically to stop people scraping it"), running a basic xPath scraper would be maybe multiple orders of magnitude more efficient.
Using LLMs to monitor the changes and generate new xPaths is an awesome idea though - it takes the expensive part of the process and (hopefully) automates it away, so you get the benefits of both worlds.
I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.
We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.
Whats the gold standard for something like this?
Here's an example: https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato... - for this page: https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-us...
Their code is open source so you can run your own copy if you like: https://github.com/jina-ai/reader - it's written in TypeScript and uses Puppeteer and https://github.com/mozilla/readability
I've been using Readability (minus the Markdown) bit myself to extract the title and main content from a page - I have a recipe for running it via Playwright using my shot-scraper tool here: https://shot-scraper.datasette.io/en/stable/javascript.html#...
shot-scraper javascript https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-user-interface/ "
async () => {
const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
return (new readability.Readability(document)).parse();
}"It's adapted from vimium and works like a charm. Distill the html down to it's important bits, and handle a ton of edge cases along the way haha
DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.
If these readers do not use already rendered HTML to parse the information on the screen, then...
It’s strips all JS/event handlers, most attributes and most CSS, and only keeps important text nodes
I needed this because I was using LLM to reimplement portions of a page using just tailwind, so needed to minimise input tokens
Here is what we ended up with:
- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.
- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us
Using an LLM for the actual parsing, that's simultaneously overkill while risking your results being polluted with hallucinations.
I also forgot to mention another interesting scraper that's an LLM based service. A quick search here tells me it was mentioned once by simonw, but I think it should be better known just for the convenience! Prepend "r.jina.ai" to any URL to extract text. For ex., check out [2] or [3].
[1] https://aclanthology.org/2021.acl-demo.15.pdf
[2] https://r.jina.ai/news.ycombinator.com/
[3] (this discussion) https://r.jina.ai/news.ycombinator.com/item?id=41428274
We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT.
"Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.
Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.
So, we're looking for LLM to generate a code to parse HTML.
Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com
We've been working on this for quite a while. I'll contact you to show how far we've gotten
I’ve had problems with hallucinations though even for something as simple as city names; also the model often ignores my prompt and returns country names - am thinking of trying a two-stage scrape with one checking the output of the other.
It's very early still but check it out at https://FetchFoxAI.com
One of the cool things is that you can scrape non-uniform pages easily. For example I helped someone scrape auto dealer leads from different websites: https://youtu.be/QlWX83uHgHs . This would be a lot harder with a "traditional" scraper.
I got these results just now: https://fetchfoxai.com/s/UOqL5HtuNe
If you want to do the same scrape, here is the prompt I used: https://imgur.com/XhguCk4
- Using chatgpt-mini was the only cheap option, worked well (although I have a feeling it's dumbing down these days) and made it virtually free.
- Just extracting the webpage text from HTML, with `BeautifulSoup(html).text` slashes the number of tokens (but can be risky when dealing with complex tables)
- At some point I needed to scrape ~10,000 pages that have the same format and it was much more efficient speed-wise and price-wise to provide ChatGPT with the HTML once and say "write some python code that extracts data", then apply that code to the 10,000 pages. I'm thinking a very smart GPT-based web parser could do that, with dynamically generated scraping methods.
- Finally because this article mentions tables, Pandas has a very nice feature `from_html("http:/the-website.com")` that will detect and parse all tables on a page. But the article does a good job pointing at websites where the method would fail because the tables don't use `<table/>`
Depending on how you use it, the wikitext may or may not be more ingestible if you're passing it through to an LLM anyway. You may also be able to pare it down a bit by heading/section so that you can reduce it do only sections that are likely to be relevant (eg. "Life and career") type sections.
You can also download full dumps [0] from Wikipedia and query them via SQL to make your life easier if you're processing them.
[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?
True but I also managed to do this from HTML. I tried getting pages wikitext through the API but couldn't find how to.
Just querying the HTML page was less friction and fast enough that I didn't need a dump (although when AI becomes cheap enough, there is probably a lot of things to do from a wikipedia dump!).
One advantage of using online wikipedia instead of a dump is that I have a pipeline on Github Actions where I just enter a composer name and it automagically scrapes the web and adds the composer to the database (takes exactly one minute from the click of the button!).
> I also tried GPT-4o mini but yielded significantly worse results so I just continued my experiments with GPT-4o.
Would be interesting to compare with the other inexpensive top tier models, Claude 3 Haiku and Gemini 1.5 Flash.
With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.
It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.
The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.
Any website that has my data and doesn’t give me access to it is a great target for scraping.
Parsing the rendered HTML is the only way to extract the data you need.
With serverless GPUs, the cost has been basically nothing.
You ship your code as a container within a library they provide that allows them to execute it, and then you're billed per-second for execution time.
Like most FaaS, if your load is steady-state it's more expensive than just spinning up a GPU instance.
If your use-case is more on-demand, with a lot of peaks and troughs, it's dramatically cheaper. Particularly if your trough frequently goes to zero. Think small-scale chatbots and the like.
Runpod, for example, would cost $3.29/hr or ~$2400/mo for a single H100. I can use their serverless offering instead for $0.00155/second. I get the same H100 performance, but it's not sitting around idle (read: costing me money) all the time.
This includes benchmarks around cold-starts, performance consistency, scalability, and cost-effectiveness for models like Llama2 7Bn & Stable Diffusion across different providers -https://www.inferless.com/learn/the-state-of-serverless-gpus... .Can save months of your time. Do give it a read.
P.S: I am from Inferless.
Another thing I wonder is, regarding text extraction, would it be a crazy idea to just snapshot the page and ask it to OCR & generate a bare minimum html table layout. That way both the content and the spatial relationship of elements are maintained (not sure how useful but I'd like to keep it anyway).
Would recommend web scraping as a "growth hack" in that way, we got a lot of partnerships that we wouldn't otherwise have got.
I have the same opinion about a man and his animals crossing a river on a boat. Instead of spending tokens on trying to solve a word problem, have it create a constraint solver and then run that. Same thing.
Plus you can probably use that until it fails (website changes) and then just re scrape it with llm request
What are some good frameworks for webscraping and PDF document processing -- some public and some behind login, some requiring multiple clicks before the sites display relevant data.
We need to ingest a wide variety of data sources for one solution. Very few of those sources supply data as API / json.
"I'm starting to think computers are a solution in the need of a problem. Have we not already solved doing math?"