It's also a fun example of the futility of DRM. The spec includes password-based encryption, and allows for different "owner" and "user" passwords. There's a bitfield with options for things like "prevent printing", "prevent copying text", and so forth,[3] but because reading the document necessarily involves decrypting it, one can use the "user" password to open an encrypted PDF in a non-compliant tool,[4] then save the unencrypted version to get an editable equivalent.
[1] "More than just transparency" section of https://blog.adobe.com/en/publish/2022/01/31/20-years-of-tra...
[2] https://blog.didierstevens.com/2008/05/07/solving-a-little-p...
[3] Page 61 of https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
[4] For example, a script that uses the pypdf library.
qpdf --decrypt <source pdf> <destination pdf>http://ssp.impulsetrain.com/porterduff.html
The Porter-Duff operators are appealingly rigorous and easy to implement because they’re simply the possible combinations of a simple formula. But many of these operators are not very useful either.
The Photoshop blending modes are practically the opposite: they are not derived from anything mathematically appealing, it’s really just a collection of algorithms that Photoshop’s designers originally found useful. They reflect the limitations of their early 1990s desktop computer implementations (for example, no attempt is made to account for gamma correction when combining the layers, which makes many of these operations behave very differently from actual light that they mean to emulate).
They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun).
My advice is always to convert the pdf to a version without compressed data like the author here has. My tool of choice is mutool (mutool clean -d in.pdf out.pdf). Then just have a rummage. You’ll be surprised by how much you can follow.
In the article the author missed a step where you look at the page object to see the resources. That’s where the mapping from the font name use in the content stream to the underlying object is made.
There’s also another important bit missing - most fonts are subset into the pdf. Ie, only the glyphs that are needed are maintained in the font. I think that’s often where the re-encoding happens. ToUnicode is maintained to allow you to copy text (or search in a PDF). It’s a nice to have for users (in my experience it’s normally there and correct though).
I think this is called masochist. Now, if you participated in writing the spec or were making others read it...
Their design philosophy of creating a read-only format was flawed to begin with. What's the first feature people are going to ask for??
PDF was not designed to be editable, nor for anyone to "work with" it in that way.
It was designed (at least the original purpose circa 1989) to represent printed pages electronically in a format that would view and print identically everywhere. In fact, the initial advertising for the "value" of the PDF format was exactly this, no matter where a recipient viewed your PDF output, it would look, and print, identically to everywhere else.
It was originally meant to be "electronic paper".
Do you have any other insights on how to do a good job at that natively, i.e. without a cloud provider? Especially when dealing with tables.
pdf2ps a.pdf # convert to postscript "a.ps"
vim a.ps # edit postscript by hand
ps2pdf a.ps # convert back to pdf
Some complex pdf (with embedded javascript, animations, etc) fail to work correctly after this back and forth. Yet for "plain" documents this works alright. You can easily remove watermarks, change some words and numbers, etc. Spacing is harder to modify. Of course you need to know some postscript.Even if the output was correct, we still needed to re-order pages, apply barcodes for mail machine processing and postal sorting, and produce reports - which usually involved text scraping off the page to get an address via perl and other tools. Much easier in PS than PDF usually, but sometimes very unreliable when e.g. their PDFs were 'secure' and didn't have correct glyph mappings.
In the worst cases, they would supply single document PDFs, and merging those would cause an explosion of subset fonts in the output which would fill the printer's memory and crash it. When I stopped working in the area, I think there still wasn't a useful tool to consolidate and merge subset fonts known to come from the same parent font - it would have been a very useful tool and should be possible but I didn't have the time or knowledge to look into it.
Honestly, it seems like only malware authors benefit from the complexity of pdfs.
http://sdh33b.blogspot.com/2008/07/icfp-contest-2008.html?m=...
[1] source: made it up
If you modify things within the file, typically these offsets will change and the file will be corrupt. It looks like in this article, maybe they were only interested in changing one number to another, so none of the positions change.
But, generally, adding/removing/modifying things in the middle of the file require recomputing the xref table and thus become much easier to use a library rather than direct text editing.
This made a coherent point in a digital workflow that could be saved and reprinted with ease. This was a big deal before the portable document format came to be.
I once made a workflow that took pdf files from Word, filemaker, excel, and mini-cad. This all got combined into a single 9,000 page pdf. The final pdf had a coherent thumbnails, page numbers and headers and footer.
Only took a couple of hours to get the final documnet after pushing the go buttton.
It indeed started life as “not Turing complete postscript with an index” (those makes it easy to render just the third page of a PDF file, something that’s impossible in postscript without rendering the first and second pages first). Like postscript, that was a pure text format.
One nice feature is that you can append a few pieces and a new index to an existing PDF file and get a new valid PDF file (which would still contain its old index as a piece of “junk DNA”)
I think compression was added because users complained about file sizes. Ascii85 (https://en.m.wikipedia.org/wiki/Ascii85) grows binary data by 25%.
> but then requiring the xref table to have fixed-length entries is odd
My guess is that made it easier to hack together a tool to convert PDF to postscript.
You can modify binaries all you want as long as you preserve the length of everything.
Some piece of software we had authenticated against a server, but everything was done on the client. The client executed SQL against the server directly, etc. Basically, the server checked to see if this client would put you over the number of licenses you purchased and that's it.
I had run it against a disassembler, found the part where it performed the check, and was able to change it to a straight JMP and then pad the rest of the space with NOPs.
My top 3 fun PDF facts:
1) Although PDF documents are typically 8-bit binary files, you can make one that is valid UTF-8 “plain text”, even including images, through the use of the ASCII85 filter.[0]
2) PDF allows an incredible variety of freaky features (3D objects, JavaScript, movies in an embedded flash object, invisible annotations…). PDF/A is a much saner, safer subset.
3) The PDF spec allows you to write widgets (e.g. form controls) using “rich text”, which is a subset of XHTML and CSS - but this feature is very sparsely supported outside the official Adobe Reader.
[0] For example: https://lab6.com/2
The primary purpose of a PDF file is to tell you what to display (or print), with perfect clarity, in much fewer bytes than an actual image would take. It exploits the fact that the document creator knows about patterns in the document structure that, if expressed properly, make the document much more compressible than anything that an actual image compression algorithm could accomplish. For example, if you have access to the actual font, it's better to say "put these characters at these coordinates with that much spacing between them" than to include every occurrence of every character as a part of the image, hoping that the compression algorithm notices and compresses away the repetitions. Things like what character is part of what word, or even what unicode codepoint is mapped to which font glyph are basically unimportant if all you're after is efficiently transferring the image of a document.
If you have an editable document, you care a lot more about the semantics of the content, not just about its presentation. It matters to you whether a particular break in the text is supposed to be multiple spaces, the next column in a table or just a weird page layout caused by an image being present. If you have some text at the bottom of each page, you care whether that text was put there by the document author multiple times, or whether it was entered once and set as a footer. If you add a new paragraph and have to change page layout, it matters to you that the last paragraph on this page is a footnote and should not be moved to the next one. If a section heading moves to another page, you care about the fact that the table of contents should update automatically and isn't just some text that the author has manually entered. If you're a printer or screen, you care about none of these things, you just print or display whatever you're told to print or display. For a PDF, footnotes, section headings, footers or tables of contents don't have to be special, they can just be text with some meaningless formatting applied to it. This is why making PDF work for any purpose which isn't displaying or printing is never going to be 100% accurate. Of course, there are efforts to remedy this, and a PDF-creating program is free to include any metadata it sees fit, but it's by no means required to do so.
This isn't necessarily the mental model that the PDF authors had in mind, but it's an useful way to look at PDF and understand why it is the way it is.
If you printed it on a postscript printer it would look exactly the same (or better, if you used type 1 fonts).
That's the primary requirement of PDF, and has been since the beginning.
They added a bunch of interactive stuff to it, which are used occasionally (forms). But to understand PDF you need to understand the above first.
One should not attempt to edit a PDF, one should edit the document from which the PDF is generated.
Somehow it became "unprofessional" to just send meant-to-be-editable documents around for everyone to enjoy, so this is where we end up...
When I send a client a final document (that's not intended to be edited) in a .PDF format you can almost guarantee that it will look the same to them as it did for me. When I send someone a Word document, I can't guarantee that it will look the same between different versions of Word, Mac Word, Pages, Google doc etc.
I'm not saying .PDF formats are perfect, but they're certainly more consistently presented to the end user.
Same thing goes on with Word docs being sent out, formatted in a particular way like PDF's, ie a questionnaire, and the recipient needs to edit said document and then send it back.
HTML forms are other examples.
All these years later, still no globally standard way to achieve this quickly and easily, and yet it would seem perfect for the Open Source world to tackle.
[1] https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
Maybe they should have called it ‘Page Description Format’ then? Instead of ‘Portable Document Format’
I think his name was Bill. He took me, a 17 year old, to a Sigur Ros concert. Great dude. Wow two stories that don't involve pdfs!
https://github.com/whyboris/PDF-to-JPG-to-PDF
A government form didn't have editable fields that needed to be filled out. And editing the PDF was impossible (password protection). This was my solution.
Converting to JPG unnecessarily rasterizes text and introduces ugly compression artefacts.
It's relatively performant and it's a mature and supported codebase that can accomplish most pdf tasks.
https://pdfbox.apache.org/2.0/migration.html#why-was-the-rep...
> ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily
That’s a matter of the toolset. I program C#, and I have good experience with that open source library: https://www.nuget.org/packages/iTextSharp-LGPL/ It’s a decade old by now, but PDF ain’t exactly a new format. That library is not terribly bad for many practical use cases. Particularly good when you only need to create the documents as opposed to editing them, because for that use case you’d want to use an old version of the format anyway, for optimal compatibility.
What followed was a deep dive down the rabbit hole, a lot of fiddling with the same tools the author of this gist is using trying to make sense of it all.
The final solution worked better than I thought while at the same time felt incredibly wrong.
I'm very thankful for all the (probably painful) work that went into those open source PDF tools.
Very hard?
I worked on a tool that generated PDFs based on API responses. The tool added charts from the api data.
Those PDFs were reports with some hardcoded text.
Yesh what a fun ride that was.
It would reach the point where things would start to break, and .... "good times were had, by all".
This means that we can open those files, read them as one single string and match the expected text in unit tests. I've got a few projects doing that and it was fine.
If the text is compressed, pipe its content to qpdf first.
2. The number of user devices with unpatched PDF readers is likely large.
3. The system of paywalled scientific knowledge drives millions of students and researchers to get their science PDFs from scihub and libgen pirate sites hosted in former Soviet countries, sometimes over http (not https).
These three facts combine to a huge vulnerabilty space.
On the flipside a sane and open PDF replacement format that also offered reduced file size could gain many users quickly by convincing scihub and libgen to convert and offer their files in the new format to cut costs and shorten download time, with reduced vuln as a positive externality.
https://qpdf.readthedocs.io/en/stable/overview.html
It turns a PDF (typically everything in it is compressed binary blobs) into a mixed binary/ASCII file (which itself is a PDF) that can be edited with vim.
> To view the compressed data, you can use a command line tool called qpdf.