Show HN: I am building a new Python library to read/write PDF files (opens in new tab)

(github.com)

279 pointsdesgeeko3y ago121 comments

Hi HN! This is my pet project, written from scratch because there is so much to discover and learn in the process. The focus is on simplicity and incremental updates. Progress is slow because I do not have much spare time to work on this, but I would love to hear some feedback. Regards

Show HN: I am building a new Python library to read/write PDF files

(github.com)

279 pointsdesgeeko3y ago121 comments

121 comments

113 comments · 29 top-level

jeremynixon3y ago· 20 in thread

Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?

jahewson3y ago

Despite the thousands of pages of ISO 32000, the reality is that the format is not defined. Acrobat tolerates unfathomably malformed PDF files generated by old software that predates the opening-up of the standard when people were reverse-engineering it. There’s always some utterly insane file that Acrobat opens just fine and now you get to play the game of figuring out how Acrobat repaired it.

Plus all the fun of the fact that you can embed the following formats inside a PDF:

PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.

All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.

Note the lack of OpenType fonts, also lack of proper Unicode!

userbinator3y ago

JPEG 2000 (dead)

Not sure what you mean by "dead", but tons of book scans, particularly those at archive.org, are PDFs of entirely JPEG2000 images.

joe_guy3y ago

Believe it or not, but digital cinema projection is done with jpeg 2000 https://en.wikipedia.org/wiki/Digital_cinema

1 more reply

jahewson3y ago

I mean dead as in the fact that it’s used somewhere is noteworthy.

I’d love for JPEG XL to replace such uses!

jwilk3y ago

> lack of proper Unicode

What do you mean?

1 more reply

dbrueck3y ago

Others have given some good answers already, but I'll add one more: PDF is all about creating a final layout, but given a final layout, there are an infinite number of inputs that could have produced it, but if you are parsing the PDF, most of the time you are trying to get back something higher level than e.g. a command to draw a single character at some X,Y location. But many PDFs in the wild were not generated with that type of semantic extraction in mind, so you have to sort of fudge things to get what you want, and that's where it becomes complex and crazy.

For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).

layer83y ago

1. PDF is mostly designed as a write-only (or render-only) format. PDF’s original purpose was as a device-independent page output language for printers, because PostScript documents at the time were specific to the targeted printer model. Interpreting a PDF for any other purpose than rendering resembles the task of disassembling machine code back into intelligible source code.

2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.

3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.

So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.

autotune3y ago

My current company uses ML to parse PDF invoices and identify fraud. I have no idea how the devs manage this black magic wizardry because they also spend time contributing to infra code before they hired more people like me on board. If anyone wants a great startup idea, look to solving a problem involving parsing PDFs en masse. Maybe something in legal tech. That market is absolutely ripe for disruption.

bmitc3y ago

Droit does similar things.

autotune3y ago

That is awesome! Can “cross border” do things like process GDPR compliance regulations or is that not the intended use case?

1 more reply

newsclues3y ago

PDF has always seemed to be a janky Adobe product.

Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?

jahewson3y ago

That would be XPS https://en.m.wikipedia.org/wiki/Open_XML_Paper_Specification

copperbrick253y ago

Sadly XPS is not supported by most software, I'd love to use something better than PDF, but even LibreOffice can't export as OXPS.

userbinator3y ago

Anything related to XML is arguably even worse.

1 more reply

manv13y ago

It was a fight between DiskPaper and PDF. PDF won because the tools were better and it was cross-platform.

And PDF is a subset of PostScript, the product that made Adobe and the DTP industry.

It's janky because the goal was to render identically everywhere. If you think it's easy look at the code abortion that is CSS.

mdaniel3y ago

I know this is likely a case of "you know what I meant," but there already is a PDF 2.0: https://www.pdfa.org/resource/iso-32000-pdf/

steampilot3y ago

I think it's not too late to create a modern open-source alternative to PDF. I find it unacceptable that something that has become so widely used doesn't have proper free tools for editing. Society shouldn't be limited by income if they want (have?) to use PDFs, or else suffer from a bad experience. The other bigger problem with PDF is that a lot of the times it's used for something for which it wasn't made to be used for. Anything that is expected to be consumed on both mobile and desktop devices should never use PDFs. Government forms should not use PDFs with hacky embedded scripts either.

macintux3y ago

It does seem like that would be a good opportunity to weed out some of the insecure aspects of the format.

Unfortunately in practice it would mean that everyone would have to support both PDF and PDF2.

brailsafe3y ago

It's not that bad, it's just that the problem is big enough in scope to get right that the state of the art is provided by private industry, and to a lesser extent some open source tools, and you're probably way better off joining them rather than trying to beat them unless you want to grind your brain into the dust for a page layout spec.

userbinator3y ago

The PDF format is itself a weird hybrid of text and binary.

(I have written a PDF parser myself.)

password43213y ago· 12 in thread

Is there a list of open source PDF libraries for various languages?

And related: the best tools to generate PDFs from HTML.

e12e3y ago

As I'm currently fighting with css3/paged media[pm] - I've recently tried to figure it out.

Theres a rather comprehensive list at: https://www.print-css.rocks/tools

As far as Foss tools go, I've only found paged.js (a polyfyll) in combination with a browser print-to-pdf (eg wkhtmltoodf (webkit) or puppeteer (chrome)) that has any semblance of css support.

There's also ghostscript - but AFAIK it doesn't support much/any css3 for print.

[pm] https://www.w3.org/TR/css-page-3/

2rsf3y ago

I've seen Puppeteer being used successfully in a few projects, it's ugly as hell (you need a VM, a browser and extra software to generate a file) and might require maintenance but it works quite well for simple, mostly textual documents.

password43213y ago

Playwright is a new favorite browser automation tool, I wonder if they've done anything to help with generating PDFs?

mdaniel3y ago

https://playwright.dev/docs/cli#generate-pdf

e12e3y ago

The main issue with the browser based tools is that developers seem to have given up on implementing html/css print standards. So there's a limit to what automation tools have to work with, so to speak. That said, paged.js makes a good effort.

mdaniel3y ago

"list" is probably harsh, but I've had good luck trawling through the GitHub topics to find such things: https://github.com/search?p=1&q=topic%3Apdf&type=Repositorie...

Makes me miss freshmeat.net which would have been my answer a few years ago (freshcode.club just isn't the same, although bless them for trying)

macintux3y ago

Circa 25 years ago I was newly wed and my wife happened to be watching over my shoulder as I looked up something online. I started typing "fr" and "freshmeat.net" came up as a possible completion.

She was, to put it mildly, immediately suspicious of my browsing habits.

criddell3y ago

The best generator is Prince XML but it can be expensive.

pronik3y ago

Seconded. I've recreated a corporate CI with it and had a great time. To have a single compile target is a blessing.

Also: the only software I know of written in Mercury.

criddell3y ago

Mercury and Rust (according to Wikipedia).

1 more reply

stuaxo3y ago

In python I would choose WeasyPrint most of the time these days.

g8oz3y ago

Dompdf is a good html to pdf library for PHP.

991120003y ago· 7 in thread

I once had to help an accountant friend to fill in 1000's of docx files, and convert them to pdf. No open source tool does a proper conversion, it really sucked.

chrisdbanks3y ago

I once had to do this and turned all the docx into one document, used Word to export as PDF and then used a PDF splitter to get separate documents.

coupdejarnac3y ago

You could script Libre Office to do this.

pbhjpbhj3y ago

>No open source tool does a proper conversion //

Presumably the key word here is "proper" because LibreOffice, etc., read docx and write pdf. For example, `libreoffice --headless --convert-to pdf myfile.docx`.

forinti3y ago

I have used Perl and Win32::OLE for this kind of job.

Converting to PDF is actually quite easy. Before Office 2010, you had to print to Postscript and then convert to PDF using Ghost. Nowadays Word gives you the option of saving to PDF.

andylynch3y ago

Yeah, this would be horrible, but on flip side relatively easy to do with Word to hand.

thejosh3y ago

it's such a complex messy format that i'm really not surprised.

daemoens3y ago

ilovepdf.com is free but isn't open source.

scoofy3y ago· 7 in thread

I desperately need to be able to display .SVG files with gradients on .PDFs, but no library currently exist in python as far as I know.

I would be willing to help make this happen, but I do not know much about the PDF format.

jchw3y ago

It's actually not so bad: it's mostly ASCII, even though some parts of it really need to be treated as binary. If you open up a simple/old PDF in your favorite text editor, you can begin to grok the basic structures quite easy.

One trick for getting started: PDFs are read from the bottom. The first thing that is read is actually an offset pointing back to the xref table, at the end of the file. Then, the xref table itself points to the latest version of all of the objects.

The part you're most likely interested in is the content streams, which contain postscript-like drawing commands. To get a feel for it, following the official spec when reading a simple-looking document can help.

edit: I didn't link any actually useful resources, in part because I actually just have a corpus of files in various file formats that I keep handy as a reference for some weird reason. However, Googling for simple PDF files yielded this, which I feel is very readable in a text editor. https://www.africau.edu/images/default/sample.pdf

viraptor3y ago

If you look for "pdf inspector" apps, there's also lots of those that will let you poke around the parsed tree.

manv13y ago

You can include binary data in PDF files, so it's not necessarily ASCII.

jchw3y ago

The structure however is still largely ASCII text. It needs to be treated as binary of course, due to the use of offsets everywhere and the fact that the xref table is hardcoded to have a specific length per xref in bytes. But if you look at a lot of simple or old PDFs, it's not hard to find some that don't use any binary encodings.

1 more reply

layer83y ago

The PDF spec is quite readable. Use the “Adobe equivalent” of PDF 1.7 here: https://www.pdfa.org/resource/pdf-specification-index/

bobince3y ago

ReportLab can render gradients, but it's poorly documented. See eg https://stackoverflow.com/questions/452074/creating-a-gradie...

I use the method with `canvas.clipPath(path, stroke=False, fill=True)` on a path I've parsed manually from SVG then `canvas.linearGradient`.

phonon3y ago

This issue?

https://github.com/Kozea/WeasyPrint/issues/1719

Spivak3y ago· 7 in thread

I wish you all the best! This space has a lot of stuff in it and they’re lacking in some aspect. And that’s not a admonishment, PDF is such a complicated format that there will never be a library that doesn’t come with asterisks — it’s just a matter of picking the thing you want your library to focus on and be good at and you can pretty easily be someone’s favorite lib.

notacop313373y ago

My understanding is that this is largely because you're fighting an adversarial format provider in Adobe, I've read a few papers and journal entries on file format polyglotting, with some focus on PDF and approaches are constantly shifting in nature due to Adobe mooting pathways to success, I think it's partly for security and also IMO partly for obscurity as PDF is a horrific format in all reality except for human visual interpretation. Many organisations and industries ONLY produce public data via PDF and it makes the parsing of that information a far more difficult task (again, I suspect by design). After trying a few parsing options for PDF, I've come to hate it as a format. Luckily though, some options from a few cloud providers seem to be really hammering the problems complexity down, but the cost of the solutions can be very steep.

manv13y ago

PDF is a subset of Postscript, which is a full-blown programming language disguised as a page description language.

People who think of the format as "adversarial" are wrong. Adobe never gave a shit about being adversarial in that sense.

The problem is that PDF is not a file format, it's a defined subset of a programming language (PostScript) used for portable rendering with fidelity. It's portable, in the sense that it should render the same way on whatever device it's rendered on (printed on a page or mastered to a display). And it's portable because it doesn't allow any postscript job-level commands, and it tries to ensure that each PDF File is standalone and can be concatenated together into a multi-page document or embedded in another document.

Postscript (and PDF) are also postfix, which can be confusing.

hanche3y ago

> PDF is a subset of Postscript

That’s a bit of an oversimplification. There’s a whole layer of structure atop the postscript subset. Much software deals only with that layer, never looking into the chunks of rendering code. That’s plenty complicated already!

> Postscript (and PDF) are also postfix, which can be confusing.

I handwrote quite a bit of postscript wsy back when. It wasn’t that bad, really, you just had to keep the state of the stack firmly in your head. Being used to HP scientific calculators helped. I would never dream of handwriting a pdf file, though. Even the low level parts are harder to deal with, since most command names have been shortened to a single letter for efficiency.

jjgreen3y ago

Postfix is fine, you get used to it fairly quickly. But when you've finished and want to go the the toilet, you walk there backwards.

jeremynixon3y ago

What libraries do you see as being SOTA? Fitz? Tika?

My hope is that computer vision + OCR will solve this once and for all in near future.

notacop313373y ago

To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.

Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).

Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!

9999000009993y ago

OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.

1 more reply

mdaniel3y ago· 5 in thread

I never knew about the J number suffix in python: https://docs.python.org/3/reference/lexical_analysis.html#im... which it would appear is used to represent references: https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_p...

I wish you good luck, this file format has tripped up many, many a developer. It blew up on a pdf I had lying around:

    ValueError: could not convert string to float: b'5.0.0'


    104 0 obj <<
    /Producer (pdfTeX-1.40.10)
    /Creator (TeX)
    /CreationDate (D:20131209161146-08'00')
    /ModDate (D:20131209161146-08'00')
    /Trapped /False
    /PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian) kpathsea version 5.0.0)
    >> endobj

as it seems a string with nested parens jams up the parser

fgeiger3y ago

The PDF format is diverse enough for such a new project to still have plenty of incompatibilities. If you wanted to find many of them quickly, you might want to have a look at the documents used as test cases in other projects, such as pdf.js:

https://github.com/mozilla/pdf.js/tree/master/test/pdfs

jonnycomputer3y ago

Valuable practical, actionable advice

mdaniel3y ago

This is tangential to your submission, but PDF is the file format I use for exercising any library that claims to be a declarative file format (ala https://github.com/kaitai-io/kaitai_struct_formats#readme )

BiteCode_dev3y ago

J is for complex numbers. While math.sqrt(-1) raises an exception, cmath.sqrt(-1) returns 1j.

There is not distinction between values and references in python, everything is a reference. In fact, primitive like numbers are big struct objects in cpython, you cannot just manipulate the raw numbers.

The difference will rather wether you can modify an object or not. You cannot modify numbers, as they are immutable. Any increment will produce a new object. But you can modify a list. This gives the feeling numbers are passed as values and list are passed as references.

lordgrenville3y ago

Parent was referring to "references" in the PDF spec (eg 42 0 R)

ESultanik3y ago· 3 in thread

Be careful with PDF! There are many ambiguities in the specification that are implemented differently between parsers, as well as implicitly accepted malformations that almost all parsers will silently accept without warning. It is very easy to accidentally produce so-called file format schizophrenia: When the same file is rendered differently between two parsers. For example, with PDF, what if you have a PDF object stream that has a length that doesn't agree with the position of its `endstream` token? What if you have a PDF dictionary with duplicate keys? Do you use the value of the first key or the second? What if you have two, valid PDFs concatenated one after the other? Do you render the first or the second? What if an object in the XREF table has an incorrect offset?

Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.

https://github.com/trailofbits/polyfile

sdgluck3y ago

In the README for that repository it mentions "schizophrenic files". What is a schizophrenic file, out of interest?

quirino3y ago

Not OP. It seems they're files which display different contents depending on the program you open them in.

Here's a CCC talk on it: https://media.ccc.de/v/MRMCD2014_-_6008_-_en_-_grossbaustell...

And the slies from the talk: https://www.slideshare.net/ange4771/schizophrenic-files-v2

julian_t3y ago

"The ALAN Parsers Project"

Bravo! Best wordplay I've reaed today.

programmarchy3y ago· 3 in thread

I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land.

PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.

However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily.

Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data.

There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.

Will be curious to see how this project develops!

[1] https://pypi.org/project/PyPDF2/

[2] https://pypi.org/project/reportlab/

missblit3y ago

QPDF is a good C++ library for "content preserving" PDF transformations, and is used by the Python PikePDF library.

pronik3y ago

I've found out the hard way that boxing/unboxing of PDF primitives to Python is _really_ expensive, so that my workflow has been counter-intuitively quite a lot slower than with PyPDF2.

lhuser1233y ago

I had the same experience. Thanks for the summary. Need to read that the next time.

neilv3y ago· 3 in thread

Neat. Another use case for which you might want to think about a sample is extracting data from filled PDF forms. (That use case is why I once had to write a PDF parser.)

Since you read&write, maybe also a use case of programmatically filling some form fields in an editable PDF form. Such pre-filling some of the fields for a particular Web site user in a dynamically-modified PDF form they download. But the source PDF form can be hand-crafted and maintained separately, like people often want to do, not generated from scratch by your code.

Helmut100013y ago

I recently tried pdfplumber [1] to extract tables from (relatively) difficult formatted tables in PDF, and it was a great experience. I can recommend it. Before I ended up using pdfplumber, I tried at least three other PDF packages and they did not work as easily or as expected.

[1]: https://github.com/jsvine/pdfplumber

poxrud3y ago

Fun project story... During the first covid school shutdown my son's day care wanted parents to print a daily screening symptom checklist, take a photo of it and email it to them every morning. This was a tedious process that I automated with PyPDF2 + PDFtk + pypdftk. It's easy to generate your own PDF's but it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.

Eventually I turned it into a website, added AWS API Gateway + Lambda and put the whole thing up for other daycare parents to use. Two weeks later the daycare switched to google forms and my project was not useful anymore.

mdaniel3y ago

> it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.

That has been on my wishlist for several years: build a "PDF annotation" service that takes in a PDF that is not an XObject form (e.g. this random example: https://www.dentalworks.com/wp-content/uploads/2021/08/Patie... ) and replace those _____ areas with actual PDF inputs. My handwriting is terrible, and it's a waste of human capital for some poor soul to try and decipher handwriting only to (almost undoubtedly) re-type it into a computer on their end

I am sure we ended up in this situation because people just "File > Print to PDF" from Word or whatever, because knowing that PDF forms exist and then how to use Adobe(R) whatever(tm) to make a real editable PDF is "too much to ask."

I have had about 10% success with Preview.app detecting the lines and allowing me to click on them and type, but having https://notstupidpdf.example.com/www.dentalworks.com/wp-cont... would be much better for humanity

larsonnn3y ago· 3 in thread

This reminds me back in the day where we got some properties and thought, PDF is a defined file format. Every pdf has this values…

We were so naiv and didn’t know.

nomel3y ago

I asked the vendor if they had a 3d model viewer, so we could inspect a part they were making. He sent me a PDF. Full pan, tilt, zoom, and hideable pieces. I don't know what kind of witchcraft was involved, but I suspect OP will come out of this cursed by its unholy nature.

mkl3y ago

Most PDF viewers don't support the 3D models feature, and just show a static image (literally an embedded image; they don't look at the 3D data at all). I've used it to make 3D diagrams for multivariable calculus, done with Asymptote (https://asymptote.sourceforge.io/).

IshKebab3y ago

PDF has a load of advanced features - 3D models, video, Flash. Basically only Adobe supports any of it though.

I wish more readers supported video but IIRC the standard doesn't actually support a normal modern format.

chazeon3y ago· 3 in thread

On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution.

[1]: https://github.com/jcushman/pdfquery

mythrwy3y ago

There is a command line utility (pdf2text) that will also parse the pdf to an XML tree and you can query with XPaths. I found it works well.

https://pdfminersix.readthedocs.io/en/latest/reference/comma...

mdaniel3y ago

That makes sense, as "pdfquery" uses pdfminer.six as a dep: https://github.com/jcushman/pdfquery/blob/master/requirement...

contentboot3y ago

This is great, thank you for posting it.

svat3y ago· 2 in thread

This is really wonderful, thank you! It's great to see someone focusing on the internal structure of PDF files (the "Syntax" chapter of the spec), and doing things with a focus on browsing the internal structure etc. (I had a similar idea and did something in Rust/WASM back in May; let me see if I can dust it off and put it on GitHub. Edit: not very usable, but here FWIW: https://github.com/shreevatsa/pdf-explorer)

In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.

mdaniel3y ago

  Commiting a change (from Jun 5 20:57) that I don't understand any more 

     // From real life, lightly modified. Note the "/companyName, LLC" as key!

With absolutely no slight toward the author, that matches my mental model of dealing with PDFs: `git commit -mwtf`

svat3y ago

I'm the author and I just meant I had left behind a small uncommitted diff back when I stopped working on it, and I didn't bother to read the diff before committing. I actually understand it just fine, on second look…

Overall, at least so far, I haven't encountered much "WTF" dealing with PDFs actually. The spec (especially the Adobe version: the ISO version based on it is only slightly different but feels much worse) is quite pleasant to read. There are some warts from backward compatibility with earlier poor decisions, but not too many of them. And while it's surprising what different PDF programs will produce as long as any PDF reader in existence happens to accepts it (Hyrum's law) (e.g. in this example, the dictionary key having a space in it), for my purposes it hasn't been a big deal as I'm only trying to do the first level of parsing, and when even that is problematic I can happily just declare the PDF malformed.

Eatcats3y ago· 2 in thread

Thank you, it is much needed, right now the most reliable way of generating PDF's I used not so long time ago is - create DOCX with content and some template variable strings, like {{}} - unpack document and get into text, replace text - use DOCX->PDF linux tool to generate document.

Maybe this will be the good solution

mkl3y ago

Latex seems way better than either to me (but then, I know Latex). Certainly Latex makes it much easier to get consistency and precision. For your use-case, generate Latex code from a template using Python or whatever, substituting in what you need, then compile the Latex into a PDF. If you need graphical elements or precise layout control, use the Latex package TikZ.

If what you need is very simple (e.g. no word wrapping, same number of variable strings in the same positions), even manipulating the code of a template PDF directly is not too hard. This library would help with that.

Terretta3y ago

> the most reliable way of generating PDFs

See pandoc: https://pandoc.org/

And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.

Including the author's own djot: https://github.com/jgm/djot

EDIT:

Sibling reply suggests latex. OK, but then you're also learning latex.

UglyToad3y ago· 2 in thread

Good luck, I once started to scratch the same itch to learn this file format, several years later I think I got about 30% of the way through!

More open source PDF code is good. If you can find a version of iText RUPS application from somewhere on the internet it's a useful tool for viewing the syntax / structure.

mdaniel3y ago

> find a version of iText RUPS application from somewhere on the internet

You mean this, right? https://github.com/itext/i7j-rups#readme

UglyToad3y ago

Yes, that's the one. I found a compiled version somewhere because I was too lazy to install/learn Maven/Java stuff in order to build from source.

gettalong3y ago· 1 in thread

As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.

If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at https://pdf-issues.pdfa.org/32000-2-2020/index.html.

As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see https://github.com/gettalong/annotated-pdf-spec. That might help you in parsing some invalid PDFs

mdaniel3y ago

Merely for your consideration, if those were actual issues on that repo, (a) it would allow adding labels to them (as in https://github.com/pdf-association/pdf-issues/issues?q=is%3A... ) (b) folks could comment, acting as a low-rent stackoverflow, and (c) it would allow anyone to contribute new ones versus the "PR against README.md" situation right now

That also more closely matches the mental model of those items: bugs against the specification, whether the official PDF Association agrees that they are or not

jl63y ago· 1 in thread

Good to see work in the PDF space. It’s still one of the most important formats. I would love to see more time invested in tools that can create PDF/A documents, which I believe to be the sane subset of PDF.

dagw3y ago

A PDF generator library that only generated guarenteed PDF/A compliant PDFs would actually be really good selling point for a new PDF library.

RantyDave3y ago· 1 in thread

I once had to parse a reference manual, provided as a PDF and emit a mostly-usable CSV of its content.

That shit was hard. Writing PDF is one thing but there are some psychopathic PDF's out there when you scratch below the surface. People do .... well, you'll find out.

ehutch793y ago

Table elements not being in consecutive display order in any direction. Like elements 0-5 from column a, followed by elements 10-12 from column b, elements 6,8.19 from column a, then elements 19-4 in reverse order from column d, all of column c....

This is a real thing I dealt with.

cochne3y ago· 1 in thread

You’re in for it! I highly recommend checking out mupdf, it was one of the more pleasant Python libraries I dealt with for this purpose.

cgdae3y ago

I think you might mean PyMuPDF (https://github.com/pymupdf/PyMuPDF), a Python library built on top of the MuPDF C library (https://mupdf.com/).

PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.

[Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]

pyuser5833y ago· 1 in thread

Good luck! Really! I hate ReportLab!

I hate using ReportLab … reading its code is fascinating. Interesting seeing what 1990s Python code looked like.

meitham3y ago

Curious to hear what you hate about it? Any specific point? If it’s just the code age then keep in mind the main author is Robin is celebrating his 75th birthday today. Like most early Python pioneers he comes from lisp background and that’s pretty much the same style of code you see by Peter Norvig and other of his generation.

truemotive3y ago

Please, for the love of all that is holy, run away!

strangus3y ago

You brave soul, I wish you luck.

cafard3y ago

I had some luck with Camelot (https://camelot-py.readthedocs.io/en/master/). However, as many of the comments here say, PDF is a beast.

elmcrest3y ago

Hey desgeeko,

from a past project we‘ve left a python PDF renderer - might be somehow useful or inspirational…

https://github.com/systori/bericht

tmaly3y ago

there is a Perl library that does this, but it only supports pdf 1.5

https://metacpan.org/pod/CAM::PDF

I have used it in the past.

Silencerxyz3y ago

Hey, I took a look at your GitHub Page and I'm wondering can you provide more information in the readme so I can understand the value of the product better.

yupis3y ago

Is it possible to directly edit the text?

voz_3y ago

Wish you luck.

jonathansa3y ago

Awesome work

sarahhenry3y ago

Amazing Python guide, Thank you. How long you have been working as a developer?

j / k navigate · click thread line to collapse

121 comments

113 comments · 29 top-level

jeremynixon3y ago· 20 in thread

Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?

jahewson3y ago

Plus all the fun of the fact that you can embed the following formats inside a PDF:

All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.

Note the lack of OpenType fonts, also lack of proper Unicode!

userbinator3y ago

JPEG 2000 (dead)

Not sure what you mean by "dead", but tons of book scans, particularly those at archive.org, are PDFs of entirely JPEG2000 images.

joe_guy3y ago

Believe it or not, but digital cinema projection is done with jpeg 2000 https://en.wikipedia.org/wiki/Digital_cinema

1 more reply

jahewson3y ago

I mean dead as in the fact that it’s used somewhere is noteworthy.

I’d love for JPEG XL to replace such uses!

jwilk3y ago

> lack of proper Unicode

What do you mean?

1 more reply

dbrueck3y ago

layer83y ago

3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.

So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.

autotune3y ago

bmitc3y ago

Droit does similar things.

autotune3y ago

That is awesome! Can “cross border” do things like process GDPR compliance regulations or is that not the intended use case?

1 more reply

newsclues3y ago

PDF has always seemed to be a janky Adobe product.

Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?

jahewson3y ago

That would be XPS https://en.m.wikipedia.org/wiki/Open_XML_Paper_Specification

copperbrick253y ago

Sadly XPS is not supported by most software, I'd love to use something better than PDF, but even LibreOffice can't export as OXPS.

userbinator3y ago

Anything related to XML is arguably even worse.

1 more reply

manv13y ago

It was a fight between DiskPaper and PDF. PDF won because the tools were better and it was cross-platform.

And PDF is a subset of PostScript, the product that made Adobe and the DTP industry.

It's janky because the goal was to render identically everywhere. If you think it's easy look at the code abortion that is CSS.

mdaniel3y ago

I know this is likely a case of "you know what I meant," but there already is a PDF 2.0: https://www.pdfa.org/resource/iso-32000-pdf/

steampilot3y ago

macintux3y ago

It does seem like that would be a good opportunity to weed out some of the insecure aspects of the format.

Unfortunately in practice it would mean that everyone would have to support both PDF and PDF2.

brailsafe3y ago

userbinator3y ago

The PDF format is itself a weird hybrid of text and binary.

(I have written a PDF parser myself.)

password43213y ago· 12 in thread

Is there a list of open source PDF libraries for various languages?

And related: the best tools to generate PDFs from HTML.

e12e3y ago

As I'm currently fighting with css3/paged media[pm] - I've recently tried to figure it out.

Theres a rather comprehensive list at: https://www.print-css.rocks/tools

As far as Foss tools go, I've only found paged.js (a polyfyll) in combination with a browser print-to-pdf (eg wkhtmltoodf (webkit) or puppeteer (chrome)) that has any semblance of css support.

There's also ghostscript - but AFAIK it doesn't support much/any css3 for print.

[pm] https://www.w3.org/TR/css-page-3/

2rsf3y ago

password43213y ago

Playwright is a new favorite browser automation tool, I wonder if they've done anything to help with generating PDFs?

mdaniel3y ago

https://playwright.dev/docs/cli#generate-pdf

e12e3y ago

mdaniel3y ago

"list" is probably harsh, but I've had good luck trawling through the GitHub topics to find such things: https://github.com/search?p=1&q=topic%3Apdf&type=Repositorie...

Makes me miss freshmeat.net which would have been my answer a few years ago (freshcode.club just isn't the same, although bless them for trying)

macintux3y ago

Circa 25 years ago I was newly wed and my wife happened to be watching over my shoulder as I looked up something online. I started typing "fr" and "freshmeat.net" came up as a possible completion.

She was, to put it mildly, immediately suspicious of my browsing habits.

criddell3y ago

The best generator is Prince XML but it can be expensive.

pronik3y ago

Seconded. I've recreated a corporate CI with it and had a great time. To have a single compile target is a blessing.

Also: the only software I know of written in Mercury.

criddell3y ago

Mercury and Rust (according to Wikipedia).

1 more reply

stuaxo3y ago

In python I would choose WeasyPrint most of the time these days.

g8oz3y ago

Dompdf is a good html to pdf library for PHP.

991120003y ago· 7 in thread

I once had to help an accountant friend to fill in 1000's of docx files, and convert them to pdf. No open source tool does a proper conversion, it really sucked.

chrisdbanks3y ago

I once had to do this and turned all the docx into one document, used Word to export as PDF and then used a PDF splitter to get separate documents.

coupdejarnac3y ago

You could script Libre Office to do this.

pbhjpbhj3y ago

>No open source tool does a proper conversion //

Presumably the key word here is "proper" because LibreOffice, etc., read docx and write pdf. For example, `libreoffice --headless --convert-to pdf myfile.docx`.

forinti3y ago

I have used Perl and Win32::OLE for this kind of job.

Converting to PDF is actually quite easy. Before Office 2010, you had to print to Postscript and then convert to PDF using Ghost. Nowadays Word gives you the option of saving to PDF.

andylynch3y ago

Yeah, this would be horrible, but on flip side relatively easy to do with Word to hand.

thejosh3y ago

it's such a complex messy format that i'm really not surprised.

daemoens3y ago

ilovepdf.com is free but isn't open source.

scoofy3y ago· 7 in thread

I desperately need to be able to display .SVG files with gradients on .PDFs, but no library currently exist in python as far as I know.

I would be willing to help make this happen, but I do not know much about the PDF format.

jchw3y ago

viraptor3y ago

If you look for "pdf inspector" apps, there's also lots of those that will let you poke around the parsed tree.

manv13y ago

You can include binary data in PDF files, so it's not necessarily ASCII.

jchw3y ago

1 more reply

layer83y ago

The PDF spec is quite readable. Use the “Adobe equivalent” of PDF 1.7 here: https://www.pdfa.org/resource/pdf-specification-index/

bobince3y ago

ReportLab can render gradients, but it's poorly documented. See eg https://stackoverflow.com/questions/452074/creating-a-gradie...

I use the method with `canvas.clipPath(path, stroke=False, fill=True)` on a path I've parsed manually from SVG then `canvas.linearGradient`.

phonon3y ago

This issue?

https://github.com/Kozea/WeasyPrint/issues/1719

Spivak3y ago· 7 in thread

notacop313373y ago

manv13y ago

PDF is a subset of Postscript, which is a full-blown programming language disguised as a page description language.

People who think of the format as "adversarial" are wrong. Adobe never gave a shit about being adversarial in that sense.

Postscript (and PDF) are also postfix, which can be confusing.

hanche3y ago

> PDF is a subset of Postscript

> Postscript (and PDF) are also postfix, which can be confusing.

jjgreen3y ago

Postfix is fine, you get used to it fairly quickly. But when you've finished and want to go the the toilet, you walk there backwards.

jeremynixon3y ago

What libraries do you see as being SOTA? Fitz? Tika?

My hope is that computer vision + OCR will solve this once and for all in near future.

notacop313373y ago

To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.

Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!

9999000009993y ago

OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.

1 more reply

mdaniel3y ago· 5 in thread

I wish you good luck, this file format has tripped up many, many a developer. It blew up on a pdf I had lying around:

    ValueError: could not convert string to float: b'5.0.0'


    104 0 obj <<
    /Producer (pdfTeX-1.40.10)
    /Creator (TeX)
    /CreationDate (D:20131209161146-08'00')
    /ModDate (D:20131209161146-08'00')
    /Trapped /False
    /PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian) kpathsea version 5.0.0)
    >> endobj

as it seems a string with nested parens jams up the parser

fgeiger3y ago

https://github.com/mozilla/pdf.js/tree/master/test/pdfs

jonnycomputer3y ago

Valuable practical, actionable advice

mdaniel3y ago

BiteCode_dev3y ago

J is for complex numbers. While math.sqrt(-1) raises an exception, cmath.sqrt(-1) returns 1j.

lordgrenville3y ago

Parent was referring to "references" in the PDF spec (eg 42 0 R)

ESultanik3y ago· 3 in thread

https://github.com/trailofbits/polyfile

sdgluck3y ago

In the README for that repository it mentions "schizophrenic files". What is a schizophrenic file, out of interest?

quirino3y ago

Not OP. It seems they're files which display different contents depending on the program you open them in.

Here's a CCC talk on it: https://media.ccc.de/v/MRMCD2014_-_6008_-_en_-_grossbaustell...

And the slies from the talk: https://www.slideshare.net/ange4771/schizophrenic-files-v2

julian_t3y ago

"The ALAN Parsers Project"

Bravo! Best wordplay I've reaed today.

programmarchy3y ago· 3 in thread

I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land.

PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.

There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.

Will be curious to see how this project develops!

[1] https://pypi.org/project/PyPDF2/

[2] https://pypi.org/project/reportlab/

missblit3y ago

QPDF is a good C++ library for "content preserving" PDF transformations, and is used by the Python PikePDF library.

pronik3y ago

I've found out the hard way that boxing/unboxing of PDF primitives to Python is _really_ expensive, so that my workflow has been counter-intuitively quite a lot slower than with PyPDF2.

lhuser1233y ago

I had the same experience. Thanks for the summary. Need to read that the next time.

neilv3y ago· 3 in thread

Neat. Another use case for which you might want to think about a sample is extracting data from filled PDF forms. (That use case is why I once had to write a PDF parser.)

Helmut100013y ago

[1]: https://github.com/jsvine/pdfplumber

poxrud3y ago

mdaniel3y ago

> it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.

larsonnn3y ago· 3 in thread

This reminds me back in the day where we got some properties and thought, PDF is a defined file format. Every pdf has this values…

We were so naiv and didn’t know.

nomel3y ago

mkl3y ago

IshKebab3y ago

PDF has a load of advanced features - 3D models, video, Flash. Basically only Adobe supports any of it though.

I wish more readers supported video but IIRC the standard doesn't actually support a normal modern format.

chazeon3y ago· 3 in thread

[1]: https://github.com/jcushman/pdfquery

mythrwy3y ago

There is a command line utility (pdf2text) that will also parse the pdf to an XML tree and you can query with XPaths. I found it works well.

https://pdfminersix.readthedocs.io/en/latest/reference/comma...

mdaniel3y ago

That makes sense, as "pdfquery" uses pdfminer.six as a dep: https://github.com/jcushman/pdfquery/blob/master/requirement...

contentboot3y ago

This is great, thank you for posting it.

svat3y ago· 2 in thread

mdaniel3y ago

  Commiting a change (from Jun 5 20:57) that I don't understand any more 

     // From real life, lightly modified. Note the "/companyName, LLC" as key!

With absolutely no slight toward the author, that matches my mental model of dealing with PDFs: `git commit -mwtf`

svat3y ago

Eatcats3y ago· 2 in thread

Maybe this will be the good solution

mkl3y ago

Terretta3y ago

> the most reliable way of generating PDFs

See pandoc: https://pandoc.org/

And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.

Including the author's own djot: https://github.com/jgm/djot

EDIT:

Sibling reply suggests latex. OK, but then you're also learning latex.

UglyToad3y ago· 2 in thread

Good luck, I once started to scratch the same itch to learn this file format, several years later I think I got about 30% of the way through!

More open source PDF code is good. If you can find a version of iText RUPS application from somewhere on the internet it's a useful tool for viewing the syntax / structure.

mdaniel3y ago

> find a version of iText RUPS application from somewhere on the internet

You mean this, right? https://github.com/itext/i7j-rups#readme

UglyToad3y ago

Yes, that's the one. I found a compiled version somewhere because I was too lazy to install/learn Maven/Java stuff in order to build from source.

gettalong3y ago· 1 in thread

As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.

mdaniel3y ago

That also more closely matches the mental model of those items: bugs against the specification, whether the official PDF Association agrees that they are or not

jl63y ago· 1 in thread

dagw3y ago

A PDF generator library that only generated guarenteed PDF/A compliant PDFs would actually be really good selling point for a new PDF library.

RantyDave3y ago· 1 in thread

I once had to parse a reference manual, provided as a PDF and emit a mostly-usable CSV of its content.

That shit was hard. Writing PDF is one thing but there are some psychopathic PDF's out there when you scratch below the surface. People do .... well, you'll find out.

ehutch793y ago

This is a real thing I dealt with.

cochne3y ago· 1 in thread

You’re in for it! I highly recommend checking out mupdf, it was one of the more pleasant Python libraries I dealt with for this purpose.

cgdae3y ago

I think you might mean PyMuPDF (https://github.com/pymupdf/PyMuPDF), a Python library built on top of the MuPDF C library (https://mupdf.com/).

PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.

[Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]

pyuser5833y ago· 1 in thread

Good luck! Really! I hate ReportLab!

I hate using ReportLab … reading its code is fascinating. Interesting seeing what 1990s Python code looked like.

meitham3y ago

truemotive3y ago

Please, for the love of all that is holy, run away!

strangus3y ago

You brave soul, I wish you luck.

cafard3y ago

I had some luck with Camelot (https://camelot-py.readthedocs.io/en/master/). However, as many of the comments here say, PDF is a beast.

elmcrest3y ago

Hey desgeeko,

from a past project we‘ve left a python PDF renderer - might be somehow useful or inspirational…

https://github.com/systori/bericht

tmaly3y ago

there is a Perl library that does this, but it only supports pdf 1.5

https://metacpan.org/pod/CAM::PDF

I have used it in the past.

Silencerxyz3y ago

Hey, I took a look at your GitHub Page and I'm wondering can you provide more information in the readme so I can understand the value of the product better.

yupis3y ago

Is it possible to directly edit the text?

voz_3y ago

Wish you luck.

jonathansa3y ago

Awesome work

sarahhenry3y ago

Amazing Python guide, Thank you. How long you have been working as a developer?

j / k navigate · click thread line to collapse