It's surprisingly easy to extend this type of workflow to scanned pdfs (as opposed to software-generated, text-containing ones). tesseract(1) makes short work of ToC pages with --psm set to 6 (an OCR setting that tends to collapse convoluted text layouts into a regular, software-parseable output).
It should also be straightforward, but I don't know of an out-of-the-box solution, to automate that example of extracting "text that looks like a header"–based on page layout/relative positioning, or font weight. (I'm working on an adjacent problem, an automatic re-layout of raster documents to squeeze out whitespace and make them slightly nicer on small e-ink devices. Text islands are trivial to identify. I don't know how to quantify font weight, or things like that. I'm "wasting" a lot of time diving into lots of mathematics rabbit holes, but I don't know in advance which ones will be productive or not).
I wonder how difficult it is to develop a better OCR engine than tesseract.
In the end, I found ChatGPT-4's multimodal capability can recognize text + page number pairs well if I feed screenshots of TOC into it, and I have settled on that.
What you can't do is deny others the same freedoms the license grants to you.
There's the additional oddity that a portion of the repository (the recipes directory) is licensed under CC-BY-NC-SA, and so the repository is not fully open source. This is particularly confusing, however, as the functional content of the recipes directory appears to be mostly records of direct observations of parameter choices in external documents and tools, and so doesn't seem like it would be copyrightable at all, at least in the US.
Apart from the code blocks. Syntax-highlighting in `<code>` elements, when, browser manufacturers?
> The typeface you are reading right now is Garibaldi by Henrique Beier, with some custom tweaks, as you might have noticed. I hope you enjoy it as much as I do. If you want some free alternatives, check out Alegreya ht and Vollkorn, though I still prefer the look and details of Garibaldi (just look at all the punctuation marks!).
It's almost like this page is part of the web from some parallel universe, which has been disenshittified to the same extent that our own web has been... well, you know.
*(Table of contents? Tables of content?)
The upside, compared to Krasjet's approach, is that this works not only for text-based PDFs but also for scanned PDFs, even old scanned journal papers.
The downside is that, before baking the TOCs, you need to make adjustments to the PDF as sometimes the empty pages are not included. You also need to calculate the offset for the prologs, cover, etc. I have a script for this kind of adjustment, but there always is manual intervention involved.
For example, paragraphs, code blocks, code inlined in paragraphs etc?
I tried tesseract but it recognises code blocks as tables.
Also there are edge cases like paragraphs starting with an indentation and without an indentation are hard to differentiate.
Appreciate any help.
\tableofcontents does the job.