It also shows how to set up evals in different parts of the book. (Depending on what you want to do, the structured outputs has evals show comparing models/prompt changes to ground truth, and the agent chapter has evals LLM as a judge.)
Most of the day gig is structured extraction and agents, which the foundation LLMs are much better than any of the small models. (And I would not be able to provision necessary compute for large models given our throughput.)
I do have on the ToDo list though evaluating Textract vs the smaller OCR models (in the book I show using docling, their are others though, like the newer GLM-OCR). Our spend for that on AWS is large enough and they are small enough for me to be able to spin up resources sufficient to meet our demand.
Part of the reason the book goes through examples with AWS/Google (in additiona to OpenAI/Anthropic) is that I suspect many individuals will be stuck with the cloud provider that their org uses out of the box. So I wanted to have as wide of coverage as possible for those folks.
I do have a follow up post planned on some reliability issues with the APIs I uncovered with compiling the book so much -- I would not use Google Maps grounding in production!
Or is this actually a law enforcement related example?
Right there it says in a big page width box
CRIME De-Coder
Customized Consulting services, focused on crime analysis for police agencies.
Contact MeDoes this guide cover systematic eval at all?
For Chapter 5 on RAG, it goes through precision/recall (with emphasis typically on recall for RAG systems).
For Chapter 6, I show a demo of LLM as a judge (using structured outputs to have specific errors it looks for) to evaluate a more fuzzy objective (writing a report based on table output).