how do you combat silent failures?
for example, I am scraping website A, getting 500+ pdf files; then they change their layout, the ETL breaks, we autoregenerate it with Claude, but then we get only 450 PDFs. The orchestrator still marks it as a successful run, but we get only part of the data.
Or: the ETL for website B breaks. We use our agentic solution, we successfully repair it, and it completes without errors, but we start missing a few fields that were moved in another sub-page.
Did you encounter any such issues?
Quick clarification: the AI agent writes the config once and is out of the loop after that. You run crawls yourself or via cron. So the "auto-regenerate and silently get wrong data" scenario doesn't quite apply since there's no agent in the runtime loop.
But configs going stale is a real problem. Two things help:
1. The agent tests on 5 real pages before saving any config. Empty fields = rewrite before it hits production.
2. `./scrapai health --project <n>` tests all your spiders and flags extraction failures. We run it monthly via cron. Broken spider? Point the agent at it, it re-analyzes and fixes.
The gap: result count drops (your 500 to 450 example). Health checks catch broken extraction, not "fewer pages matched." We list structural change detection as an open contribution area in the README.
Hi HN, I built this. It's been in production across 500+ websites.
We're a research group that studies online communications. We needed to scrape hundreds of sites regularly — news,
blogs, forums, policy orgs — and maintain all those scrapers. At 10 sites, individual scrapers were fine. At 200+
we were spending more time fixing broken scrapers than doing actual work. Every redesign broke something, every new
site meant another scraper from scratch.
ScrapAI flips the cost model. You tell an AI agent "add bbc.co.uk to my news project." It analyzes the site, writes
URL patterns and extraction rules, tests on 5 pages, and saves a JSON config to a database. After that it's just
Scrapy — no AI in the loop, no per-page inference calls. ~$1-3 in tokens per website with Sonnet 4.5, not per page.
Cloudflare was the hardest part. Most tools keep a browser open for every request (~5-10s per page). We use
CloakBrowser (open source, C++ stealth patches, 0.9 reCAPTCHA v3 score) to solve the challenge once, cache the
cookies, kill the browser, and hit the site with normal HTTP. Re-solves every ~10 minutes. 1,000 pages in ~8
minutes vs 2+ hours.
The agent writes JSON configs, not Python. An agent that writes and runs code can do anything an unsupervised
developer can — one prompt injection from a malicious page and you have a real problem. JSON goes through Pydantic
validation before it touches the database. Worst case is a bad config that extracts wrong fields. This also makes
it safe to use as a tool for Claws — structured web data without arbitrary code execution.
~4,000 lines of Python. Scrapy, SQLAlchemy, Alembic. Apache 2.0. We recommend Claude Code with Sonnet 4.5 but it
works with any agent that can read instructions and run shell commands. We tried GLM 4.7 and it performed
similarly, just slower.