Pdf.tocgen (opens in new tab)

(krasjet.com)

175 pointsnbernard2y ago39 comments

39 comments

33 comments · 12 top-level

perihelions2y ago· 7 in thread

- "That is, you shouldn’t expect it to work with scanned PDFs"

It's surprisingly easy to extend this type of workflow to scanned pdfs (as opposed to software-generated, text-containing ones). tesseract(1) makes short work of ToC pages with --psm set to 6 (an OCR setting that tends to collapse convoluted text layouts into a regular, software-parseable output).

It should also be straightforward, but I don't know of an out-of-the-box solution, to automate that example of extracting "text that looks like a header"–based on page layout/relative positioning, or font weight. (I'm working on an adjacent problem, an automatic re-layout of raster documents to squeeze out whitespace and make them slightly nicer on small e-ink devices. Text islands are trivial to identify. I don't know how to quantify font weight, or things like that. I'm "wasting" a lot of time diving into lots of mathematics rabbit holes, but I don't know in advance which ones will be productive or not).

felipefar2y ago

tesseract is fine for basic use cases, but it fails when the image is tilted (and thus the text isn't laid out horizontally), which can happen several times with scanned books. Compared to how well the Google OCR engine works, tesseract should be much better than it is.

I wonder how difficult it is to develop a better OCR engine than tesseract.

aidenn02y ago

You are supposed to deskew (and de-warp if the image isn't flat) images before running through tesseract. There are other tools for doing that.

HeatrayEnjoyer2y ago

Tesseract is last gen. Multimodal is SOTA, and can handle even heavily distorted or destroyed text.

perihelions2y ago

Am I overlooking something, or is automating page rotation no more work than just a 2d FFT?

2 more replies

bsharper2y ago

I've found EasyOCR to work much better at pulling text out of irregular or unknown images. Requires more resources than tesseract but gets much better results in my projects.

aidenn02y ago

It seems to be not significantly better than tesseract for non-mixed images though, and it takes about 5 orders of magnitude longer to process a page on my machine; I can literally read a book 100 times faster than EasyOCR can process a book on my Ryzen 7 2700.

1 more reply

chazeon2y ago

I have thought about using tessaract, using it to OCR the TOC and generate something like this. But there are just so many edge cases that make the whole process fail. For example, how do you handle it if the title breaks into two lines? What if the page number is not recognized correctly? For example, 10 can be 1o What if there are dots? Maybe you can use GPT to clean the extracted text.

In the end, I found ChatGPT-4's multimodal capability can recognize text + page number pairs well if I feed screenshots of TOC into it, and I have settled on that.

janpmz2y ago· 4 in thread

Recently I found the getToc function in PyMuPdf was too slow. I told them about it in their discord, and a day later they had fixed it. Now it only takes a couple of milliseconds. I'm using it for my project pdftomp3. Pdf.tocgen looks useful too, but I'm not sure if I can use it because of the licencse?

karma_pharmer2y ago

Of course you can use it.

What you can't do is deny others the same freedoms the license grants to you.

cge2y ago

There does appear to be some licensing awkwardness here. The license is nominally GPLv3, but it says it is based on AGPLv3 projects. It also appears to misidentify (it may have been correct at the time) PyMuPDF as GPLv3 when that appears to actually be AGPLv3. My assumption is that using this would require complying with AGPLv3?

There's the additional oddity that a portion of the repository (the recipes directory) is licensed under CC-BY-NC-SA, and so the repository is not fully open source. This is particularly confusing, however, as the functional content of the recipes directory appears to be mostly records of direct observations of parameter choices in external documents and tools, and so doesn't seem like it would be copyrightable at all, at least in the US.

zerop2y ago

Interested to know what is pdftomp3?

janpmz2y ago

You can upload a PDF and convert the chapters into MP3s (either original text or simplified text). But for PDFs without a table of contents, you can only convert single pages.

mrtx012y ago· 4 in thread

What a beautiful website!

porker2y ago

It took a bit of digging from the Pdf.tocgen page, but https://krasjet.com/colophon/ tells us how it's created.

lelandfe2y ago

Uncommon to see someone so caring about the specifics of their chosen font. Love it.

GrumpyNl2y ago

And build with very little CSS and basic HTML.

oneeyedpigeon2y ago

> basic HTML

Apart from the code blocks. Syntax-highlighting in `<code>` elements, when, browser manufacturers?

mbana2y ago· 3 in thread

I love the typography on the site. What fonts are you using? I'm on a mobile browser so I can't really see.

porker2y ago

According to https://krasjet.com/colophon/:

> The typeface you are reading right now is Garibaldi by Henrique Beier, with some custom tweaks, as you might have noticed. I hope you enjoy it as much as I do. If you want some free alternatives, check out Alegreya ht and Vollkorn, though I still prefer the look and details of Garibaldi (just look at all the punctuation marks!).

karma_pharmer2y ago

I was going to post the same thing. This has to be the most beautifully typeset webpage I've seen in quite a while. Not just the font but the layout too.

It's almost like this page is part of the web from some parallel universe, which has been disenshittified to the same extent that our own web has been... well, you know.

StayTrue2y ago

Garibaldi, $300 for up to 10k page views per month.

bionade242y ago· 3 in thread

Does someone know a tool that is sed- or awk-like for PDFs?

maxerickson2y ago

Qpdf has tools that go in that direction (but not a flat text format that allows arbitrary edits).

https://qpdf.readthedocs.io/en/stable/qdf.html#qdf

manaskarekar2y ago

Perhaps you can use lesspipe with sed/awk?

https://github.com/wofr06/lesspipe

perihelions2y ago

pdftk is a CLI tool that can extract and edit PDF metadata such as tables of contents*, if that's what you mean?

*(Table of contents? Tables of content?)

chazeon2y ago

I have been thinking about this, but for a while now, I have settled on using ChatGPT's GPT-4v's multimodal capability to generate a text file containing the titles and pages based on screenshots of the TOC. After that, I used a pikepdf-based Python script to bake the TOC into the PDF I had.

The upside, compared to Krasjet's approach, is that this works not only for text-based PDFs but also for scanned PDFs, even old scanned journal papers.

The downside is that, before baking the TOCs, you need to make adjustments to the PDF as sometimes the empty pages are not included. You also need to calculate the offset for the prologs, cover, etc. I have a script for this kind of adjustment, but there always is manual intervention involved.

papichulo20232y ago

Looks like a very good tool to integrate with Knowledge Graphs or just RAG (llm).

rajaravivarma_r2y ago

Is it possible to extract different patterns of text from a PDF document?

For example, paragraphs, code blocks, code inlined in paragraphs etc?

I tried tesseract but it recognises code blocks as tables.

Also there are edge cases like paragraphs starting with an indentation and without an indentation are hard to differentiate.

Appreciate any help.

jbecke2y ago

We (macro.com) have something similar but without the recipe part in our pdf/word processor. It works pretty well on numbered headings but not so well on non-numbered. We’re thinking of porting over to LLMs at some point.

pseingatl2y ago

Since when do you need the hyperref package to generate a table of contents under LaTeX (as the author claims)?

\tableofcontents does the job.

maCDzP2y ago

That is a beautiful website. I got lost in it and it created a sense of wonder. Nice.

zerop2y ago

Can I use this tool to get toc for arxiv papers ?

j / k navigate · click thread line to collapse

39 comments

33 comments · 12 top-level

perihelions2y ago· 7 in thread

- "That is, you shouldn’t expect it to work with scanned PDFs"

felipefar2y ago

I wonder how difficult it is to develop a better OCR engine than tesseract.

aidenn02y ago

You are supposed to deskew (and de-warp if the image isn't flat) images before running through tesseract. There are other tools for doing that.

HeatrayEnjoyer2y ago

Tesseract is last gen. Multimodal is SOTA, and can handle even heavily distorted or destroyed text.

perihelions2y ago

Am I overlooking something, or is automating page rotation no more work than just a 2d FFT?

2 more replies

bsharper2y ago

I've found EasyOCR to work much better at pulling text out of irregular or unknown images. Requires more resources than tesseract but gets much better results in my projects.

aidenn02y ago

1 more reply

chazeon2y ago

In the end, I found ChatGPT-4's multimodal capability can recognize text + page number pairs well if I feed screenshots of TOC into it, and I have settled on that.

janpmz2y ago· 4 in thread

karma_pharmer2y ago

Of course you can use it.

What you can't do is deny others the same freedoms the license grants to you.

cge2y ago

zerop2y ago

Interested to know what is pdftomp3?

janpmz2y ago

You can upload a PDF and convert the chapters into MP3s (either original text or simplified text). But for PDFs without a table of contents, you can only convert single pages.

mrtx012y ago· 4 in thread

What a beautiful website!

porker2y ago

It took a bit of digging from the Pdf.tocgen page, but https://krasjet.com/colophon/ tells us how it's created.

lelandfe2y ago

Uncommon to see someone so caring about the specifics of their chosen font. Love it.

GrumpyNl2y ago

And build with very little CSS and basic HTML.

oneeyedpigeon2y ago

> basic HTML

Apart from the code blocks. Syntax-highlighting in `<code>` elements, when, browser manufacturers?

mbana2y ago· 3 in thread

I love the typography on the site. What fonts are you using? I'm on a mobile browser so I can't really see.

porker2y ago

According to https://krasjet.com/colophon/:

karma_pharmer2y ago

I was going to post the same thing. This has to be the most beautifully typeset webpage I've seen in quite a while. Not just the font but the layout too.

It's almost like this page is part of the web from some parallel universe, which has been disenshittified to the same extent that our own web has been... well, you know.

StayTrue2y ago

Garibaldi, $300 for up to 10k page views per month.

bionade242y ago· 3 in thread

Does someone know a tool that is sed- or awk-like for PDFs?

maxerickson2y ago

Qpdf has tools that go in that direction (but not a flat text format that allows arbitrary edits).

https://qpdf.readthedocs.io/en/stable/qdf.html#qdf

manaskarekar2y ago

Perhaps you can use lesspipe with sed/awk?

https://github.com/wofr06/lesspipe

perihelions2y ago

pdftk is a CLI tool that can extract and edit PDF metadata such as tables of contents*, if that's what you mean?

*(Table of contents? Tables of content?)

chazeon2y ago

The upside, compared to Krasjet's approach, is that this works not only for text-based PDFs but also for scanned PDFs, even old scanned journal papers.

papichulo20232y ago

Looks like a very good tool to integrate with Knowledge Graphs or just RAG (llm).

rajaravivarma_r2y ago

Is it possible to extract different patterns of text from a PDF document?

For example, paragraphs, code blocks, code inlined in paragraphs etc?

I tried tesseract but it recognises code blocks as tables.

Also there are edge cases like paragraphs starting with an indentation and without an indentation are hard to differentiate.

Appreciate any help.

jbecke2y ago

pseingatl2y ago

Since when do you need the hyperref package to generate a table of contents under LaTeX (as the author claims)?

\tableofcontents does the job.

maCDzP2y ago

That is a beautiful website. I got lost in it and it created a sense of wonder. Nice.

zerop2y ago

Can I use this tool to get toc for arxiv papers ?

j / k navigate · click thread line to collapse