Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.
Here's a CCC talk on it: https://media.ccc.de/v/MRMCD2014_-_6008_-_en_-_grossbaustell...
And the slies from the talk: https://www.slideshare.net/ange4771/schizophrenic-files-v2
Bravo! Best wordplay I've reaed today.
I wish you good luck, this file format has tripped up many, many a developer. It blew up on a pdf I had lying around:
ValueError: could not convert string to float: b'5.0.0'
104 0 obj <<
/Producer (pdfTeX-1.40.10)
/Creator (TeX)
/CreationDate (D:20131209161146-08'00')
/ModDate (D:20131209161146-08'00')
/Trapped /False
/PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian) kpathsea version 5.0.0)
>> endobj
as it seems a string with nested parens jams up the parserThere is not distinction between values and references in python, everything is a reference. In fact, primitive like numbers are big struct objects in cpython, you cannot just manipulate the raw numbers.
The difference will rather wether you can modify an object or not. You cannot modify numbers, as they are immutable. Any increment will produce a new object. But you can modify a list. This gives the feeling numbers are passed as values and list are passed as references.
In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.
Commiting a change (from Jun 5 20:57) that I don't understand any more
// From real life, lightly modified. Note the "/companyName, LLC" as key!
With absolutely no slight toward the author, that matches my mental model of dealing with PDFs: `git commit -mwtf`Overall, at least so far, I haven't encountered much "WTF" dealing with PDFs actually. The spec (especially the Adobe version: the ISO version based on it is only slightly different but feels much worse) is quite pleasant to read. There are some warts from backward compatibility with earlier poor decisions, but not too many of them. And while it's surprising what different PDF programs will produce as long as any PDF reader in existence happens to accepts it (Hyrum's law) (e.g. in this example, the dictionary key having a space in it), for my purposes it hasn't been a big deal as I'm only trying to do the first level of parsing, and when even that is problematic I can happily just declare the PDF malformed.
PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.
However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily.
Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data.
There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.
Will be curious to see how this project develops!
If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at https://pdf-issues.pdfa.org/32000-2-2020/index.html.
As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see https://github.com/gettalong/annotated-pdf-spec. That might help you in parsing some invalid PDFs
That also more closely matches the mental model of those items: bugs against the specification, whether the official PDF Association agrees that they are or not
Maybe this will be the good solution
If what you need is very simple (e.g. no word wrapping, same number of variable strings in the same positions), even manipulating the code of a template PDF directly is not too hard. This library would help with that.
See pandoc: https://pandoc.org/
And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.
Including the author's own djot: https://github.com/jgm/djot
EDIT:
Sibling reply suggests latex. OK, but then you're also learning latex.
More open source PDF code is good. If you can find a version of iText RUPS application from somewhere on the internet it's a useful tool for viewing the syntax / structure.
You mean this, right? https://github.com/itext/i7j-rups#readme
Presumably the key word here is "proper" because LibreOffice, etc., read docx and write pdf. For example, `libreoffice --headless --convert-to pdf myfile.docx`.
Converting to PDF is actually quite easy. Before Office 2010, you had to print to Postscript and then convert to PDF using Ghost. Nowadays Word gives you the option of saving to PDF.
And related: the best tools to generate PDFs from HTML.
Theres a rather comprehensive list at: https://www.print-css.rocks/tools
As far as Foss tools go, I've only found paged.js (a polyfyll) in combination with a browser print-to-pdf (eg wkhtmltoodf (webkit) or puppeteer (chrome)) that has any semblance of css support.
There's also ghostscript - but AFAIK it doesn't support much/any css3 for print.
Makes me miss freshmeat.net which would have been my answer a few years ago (freshcode.club just isn't the same, although bless them for trying)
She was, to put it mildly, immediately suspicious of my browsing habits.
Also: the only software I know of written in Mercury.
I would be willing to help make this happen, but I do not know much about the PDF format.
One trick for getting started: PDFs are read from the bottom. The first thing that is read is actually an offset pointing back to the xref table, at the end of the file. Then, the xref table itself points to the latest version of all of the objects.
The part you're most likely interested in is the content streams, which contain postscript-like drawing commands. To get a feel for it, following the official spec when reading a simple-looking document can help.
edit: I didn't link any actually useful resources, in part because I actually just have a corpus of files in various file formats that I keep handy as a reference for some weird reason. However, Googling for simple PDF files yielded this, which I feel is very readable in a text editor. https://www.africau.edu/images/default/sample.pdf
I use the method with `canvas.clipPath(path, stroke=False, fill=True)` on a path I've parsed manually from SVG then `canvas.linearGradient`.
Since you read&write, maybe also a use case of programmatically filling some form fields in an editable PDF form. Such pre-filling some of the fields for a particular Web site user in a dynamically-modified PDF form they download. But the source PDF form can be hand-crafted and maintained separately, like people often want to do, not generated from scratch by your code.
Eventually I turned it into a website, added AWS API Gateway + Lambda and put the whole thing up for other daycare parents to use. Two weeks later the daycare switched to google forms and my project was not useful anymore.
That has been on my wishlist for several years: build a "PDF annotation" service that takes in a PDF that is not an XObject form (e.g. this random example: https://www.dentalworks.com/wp-content/uploads/2021/08/Patie... ) and replace those _____ areas with actual PDF inputs. My handwriting is terrible, and it's a waste of human capital for some poor soul to try and decipher handwriting only to (almost undoubtedly) re-type it into a computer on their end
I am sure we ended up in this situation because people just "File > Print to PDF" from Word or whatever, because knowing that PDF forms exist and then how to use Adobe(R) whatever(tm) to make a real editable PDF is "too much to ask."
I have had about 10% success with Preview.app detecting the lines and allowing me to click on them and type, but having https://notstupidpdf.example.com/www.dentalworks.com/wp-cont... would be much better for humanity
That shit was hard. Writing PDF is one thing but there are some psychopathic PDF's out there when you scratch below the surface. People do .... well, you'll find out.
This is a real thing I dealt with.
We were so naiv and didn’t know.
I wish more readers supported video but IIRC the standard doesn't actually support a normal modern format.
People who think of the format as "adversarial" are wrong. Adobe never gave a shit about being adversarial in that sense.
The problem is that PDF is not a file format, it's a defined subset of a programming language (PostScript) used for portable rendering with fidelity. It's portable, in the sense that it should render the same way on whatever device it's rendered on (printed on a page or mastered to a display). And it's portable because it doesn't allow any postscript job-level commands, and it tries to ensure that each PDF File is standalone and can be concatenated together into a multi-page document or embedded in another document.
Postscript (and PDF) are also postfix, which can be confusing.
My hope is that computer vision + OCR will solve this once and for all in near future.
from a past project we‘ve left a python PDF renderer - might be somehow useful or inspirational…
https://metacpan.org/pod/CAM::PDF
I have used it in the past.
PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.
[Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]
I hate using ReportLab … reading its code is fascinating. Interesting seeing what 1990s Python code looked like.
https://pdfminersix.readthedocs.io/en/latest/reference/comma...
Plus all the fun of the fact that you can embed the following formats inside a PDF:
PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.
All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.
Note the lack of OpenType fonts, also lack of proper Unicode!
Not sure what you mean by "dead", but tons of book scans, particularly those at archive.org, are PDFs of entirely JPEG2000 images.
For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).
2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.
3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.
So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.
Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?
And PDF is a subset of PostScript, the product that made Adobe and the DTP industry.
It's janky because the goal was to render identically everywhere. If you think it's easy look at the code abortion that is CSS.
Unfortunately in practice it would mean that everyone would have to support both PDF and PDF2.
(I have written a PDF parser myself.)