So you want to modify the text of a PDF by hand (2020) (opens in new tab)

(gist.github.com)

325 pointsmutant_glofish2y ago101 comments

101 comments

The PDF specification is wild. My current favourite trivia is that it supports all of Photoshop's layer blend modes for rendering overlapping elements.[1] My second-favourite is that it supports appended content that modifies earlier content, so one should always look for forensic evidence in all distinct versions represented in a given file.[2]

It's also a fun example of the futility of DRM. The spec includes password-based encryption, and allows for different "owner" and "user" passwords. There's a bitfield with options for things like "prevent printing", "prevent copying text", and so forth,[3] but because reading the document necessarily involves decrypting it, one can use the "user" password to open an encrypted PDF in a non-compliant tool,[4] then save the unencrypted version to get an editable equivalent.

[1] "More than just transparency" section of https://blog.adobe.com/en/publish/2022/01/31/20-years-of-tra...

[2] https://blog.didierstevens.com/2008/05/07/solving-a-little-p...

[3] Page 61 of https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

[4] For example, a script that uses the pypdf library.

userbinator2y ago

In the context of a format that was originally proprietary and not widely available to everyone, and conceived in an era where encryption was strongly controlled by export law, that sort of security-by-obscurity was very common. Incidentally, a popular cracking tutorial back then was to de-DRM the official reader by patching the function that checks those permissions.

totetsu2y ago

Howto https://www.cyberciti.biz/faq/removing-password-from-pdf-on-...

Helmut100012y ago

Or:

    qpdf --decrypt <source pdf> <destination pdf>

aardvark1792y ago

Aren’t the blend modes supported just the Porter-Duff compositing modes? You might think that’s overkill, but it’s a really good mapping of what other rendering pipelines offer and it can really help reduce the work to produce a PDF.

pavlov2y ago

The original Porter-Duff compositing operators don’t cover Photoshop-style blending. Here’s a link with pictures:

http://ssp.impulsetrain.com/porterduff.html

The Porter-Duff operators are appealingly rigorous and easy to implement because they’re simply the possible combinations of a simple formula. But many of these operators are not very useful either.

The Photoshop blending modes are practically the opposite: they are not derived from anything mathematically appealing, it’s really just a collection of algorithms that Photoshop’s designers originally found useful. They reflect the limitations of their early 1990s desktop computer implementations (for example, no attempt is made to account for gamma correction when combining the layers, which makes many of these operations behave very differently from actual light that they mean to emulate).

1 more reply

pph2y ago

The permission field can also lead you down the rabbit hole of discovering noncompliance to its specification in some PDF writers and workarounds for these that may or may not be present in different PDF readers/libraries.

aidos2y ago

To be fair, if you wanted to stop copying of text it would be easiest just to drop the ToUnicode mapping against the fonts and then it’s a manual process for people to recreate them.

miki1232112y ago

That also breaks search (and more importantly screen reader accessibility), and if you're professionally required to specifically produce PDFs with these security features enabled, you're pretty likely to be working in a context where that would be illegal.

1 more reply

johnalbertearle2y ago

I used to be rich with selling a part of the stuff. FrameMaker. Used to be $5K US / copy Which came originally from Frame Technologies. [ Hi Steve Kirsch . I see you're rich still ]. PDF specification is wild. So right you are. At the time, many - including yours truly - said it was rude capitalism. So, you got it. People did not talk enough about DRM. Ps: I left Adobe embrace courtesy of my then wife, and me myself. I hate DRM as a user and as a -former- Salesman. Hola

aidos2y ago

This topic comes up periodically as most people think PDFs are some impenetrable binary format, but they’re really not.

They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun).

My advice is always to convert the pdf to a version without compressed data like the author here has. My tool of choice is mutool (mutool clean -d in.pdf out.pdf). Then just have a rummage. You’ll be surprised by how much you can follow.

In the article the author missed a step where you look at the page object to see the resources. That’s where the mapping from the font name use in the content stream to the underlying object is made.

There’s also another important bit missing - most fonts are subset into the pdf. Ie, only the glyphs that are needed are maintained in the font. I think that’s often where the re-encoding happens. ToUnicode is maintained to allow you to copy text (or search in a PDF). It’s a nice to have for users (in my experience it’s normally there and correct though).

azangru2y ago

> I’m a sadist, I read it for fun.

I think this is called masochist. Now, if you participated in writing the spec or were making others read it...

aidos2y ago

Yup, slip of the tongue. Though, I do make other people read the spec at work, so I’m that too.

1 more reply

esafak2y ago

It is a shame Adobe designed a format so hard to work with that people are amazed when someone accomplishes what should be a basic task with it.

Their design philosophy of creating a read-only format was flawed to begin with. What's the first feature people are going to ask for??

pwg2y ago

> It is a shame Adobe designed a format so hard to work with

PDF was not designed to be editable, nor for anyone to "work with" it in that way.

It was designed (at least the original purpose circa 1989) to represent printed pages electronically in a format that would view and print identically everywhere. In fact, the initial advertising for the "value" of the PDF format was exactly this, no matter where a recipient viewed your PDF output, it would look, and print, identically to everywhere else.

It was originally meant to be "electronic paper".

1 more reply

mistrial92y ago

.. waves to Leonard Rosenthol

gobdovan2y ago

If you find pleasure in something that gives you pain, you're a masochist. A sadist likes inflicting pain onto others. Since you seem that you like helping people I'd say it's more likely you're the former. I appreciate the mutool advice!

haolez2y ago

That's awesome. I'm relying a lot on Amazon Textract for my PDF parsing needs.

Do you have any other insights on how to do a good job at that natively, i.e. without a cloud provider? Especially when dealing with tables.

kccqzy2y ago

PDF format does not give you enough semantic information to understand there is a table. The stream contains instructions such as moving to a coordinate, adding some text, adding some lines. No tool can extract tables with 100% precision.

1 more reply

enriquto2y ago

You can do this:

    pdf2ps a.pdf    # convert to postscript "a.ps"
    vim a.ps        # edit postscript by hand
    ps2pdf a.ps     # convert back to pdf

Some complex pdf (with embedded javascript, animations, etc) fail to work correctly after this back and forth. Yet for "plain" documents this works alright. You can easily remove watermarks, change some words and numbers, etc. Spacing is harder to modify. Of course you need to know some postscript.

hnick2y ago

This is essentially how we used to do it for some "print-ready" jobs at a mailhouse I worked at. Usually we'd use proper tools to produce documents ready to print, but sometimes clients thought they knew better and would send us PDFs. It was more effort to work with those usually, and had a higher chance of errors.

Even if the output was correct, we still needed to re-order pages, apply barcodes for mail machine processing and postal sorting, and produce reports - which usually involved text scraping off the page to get an address via perl and other tools. Much easier in PS than PDF usually, but sometimes very unreliable when e.g. their PDFs were 'secure' and didn't have correct glyph mappings.

In the worst cases, they would supply single document PDFs, and merging those would cause an explosion of subset fonts in the output which would fill the printer's memory and crash it. When I stopped working in the area, I think there still wasn't a useful tool to consolidate and merge subset fonts known to come from the same parent font - it would have been a very useful tool and should be possible but I didn't have the time or knowledge to look into it.

indeedmug2y ago

If you can put javascript and animations in pdf, what's stopping you from making a frontend in it? I wonder what are the frontiers of things you can do in pdf.

Honestly, it seems like only malware authors benefit from the complexity of pdfs.

haraldooo2y ago

MacOS‘s Quartz (original 2D rendering) is based on pdf: https://www.prepressure.com/pdf/basics/osx-quartz

Tyr422y ago

You might enjoy this about controlling a river using Tex and postscript

http://sdh33b.blogspot.com/2008/07/icfp-contest-2008.html?m=...

lionkor2y ago

The most widespread javascript packaging format turns out to be PDF[1]

[1] source: made it up

hackernewds2y ago

in vim?

enriquto2y ago

It doesn't matter, just use any text editor.

ks20482y ago

This seems to be missing an important point: at the end of PDF is a table ("cross-reference" table) that stores the BYTE-OFFSET to different objects in the file.

If you modify things within the file, typically these offsets will change and the file will be corrupt. It looks like in this article, maybe they were only interested in changing one number to another, so none of the positions change.

But, generally, adding/removing/modifying things in the middle of the file require recomputing the xref table and thus become much easier to use a library rather than direct text editing.

gpvos2y ago

That's why they decode it with qpdf and re-encode it again afterwards, so qpdf takes care of that. qpdf reconstructs the original PDF structure, and I think it even tries to keep the object numbers the same, but the offsets are recalculated completely.

userbinator2y ago

That's the weirdest part of the PDF spec IMHO. It's a mix of both binary and text, with text-specified byte offsets. It would be very interesting to read about why the format became like that, if its authors would ever talk about it. My guess is that it was meant to be completely textual at first (but then requiring the xref table to have fixed-length entries is odd), and then they decided binary would be more efficient.

detourdog2y ago

I actually was at a Acrobat/PDF launch event in midtown NYC. It was an embedded file type that could be generated at the type of publishing and all dependencies could either be embedded or not.

This made a coherent point in a digital workflow that could be saved and reprinted with ease. This was a big deal before the portable document format came to be.

I once made a workflow that took pdf files from Word, filemaker, excel, and mini-cad. This all got combined into a single 9,000 page pdf. The final pdf had a coherent thumbnails, page numbers and headers and footer.

Only took a couple of hours to get the final documnet after pushing the go buttton.

Someone2y ago

> My guess is that it was meant to be completely textual at first

It indeed started life as “not Turing complete postscript with an index” (those makes it easy to render just the third page of a PDF file, something that’s impossible in postscript without rendering the first and second pages first). Like postscript, that was a pure text format.

One nice feature is that you can append a few pieces and a new index to an existing PDF file and get a new valid PDF file (which would still contain its old index as a piece of “junk DNA”)

I think compression was added because users complained about file sizes. Ascii85 (https://en.m.wikipedia.org/wiki/Ascii85) grows binary data by 25%.

> but then requiring the xref table to have fixed-length entries is odd

My guess is that made it easier to hack together a tool to convert PDF to postscript.

pmarreck2y ago

The roots of PDF are PostScript, which is like Forth, and is text-based, so that’s why

aidos2y ago

In my experience it’s easiest just to break the xref table and run something like “mutool clean” to fix it again. It can be completely derived from the content so it’s safe to do.

bena2y ago

Ah. So it's a lot like editing compiled binaries.

You can modify binaries all you want as long as you preserve the length of everything.

Some piece of software we had authenticated against a server, but everything was done on the client. The client executed SQL against the server directly, etc. Basically, the server checked to see if this client would put you over the number of licenses you purchased and that's it.

I had run it against a disassembler, found the part where it performed the check, and was able to change it to a straight JMP and then pad the rest of the space with NOPs.

jl62y ago

This seems to be missing an important step in the use of qpdf’s --qdf mode: after you’ve finished editing, you need to run the file through the fix-pdf utility to recalculate all the object offsets and rebuild the cross-reference table that lives at the end of the file (unless you only change bytes in-place rather than adding or removing bytes).

My top 3 fun PDF facts:

1) Although PDF documents are typically 8-bit binary files, you can make one that is valid UTF-8 “plain text”, even including images, through the use of the ASCII85 filter.[0]

2) PDF allows an incredible variety of freaky features (3D objects, JavaScript, movies in an embedded flash object, invisible annotations…). PDF/A is a much saner, safer subset.

3) The PDF spec allows you to write widgets (e.g. form controls) using “rich text”, which is a subset of XHTML and CSS - but this feature is very sparsely supported outside the official Adobe Reader.

[0] For example: https://lab6.com/2

gpvos2y ago

After you've finished editing, just run it through qpdf without parameters, as explained in the beginning of the article, and it will recompress the data and recreate the xref table. No need for yet another tool.

jl62y ago

I guess you could, but this is the source of the errors (actually warnings) that the article mentions. Probably best to fix the file with the provided tool (fix-qdf is distributed with qpdf) rather than get in the habit of ignoring warnings.

1 more reply

desgeeko2y ago

If you want to continue this journey and learn more about PDF, you can read the anatomy of a file I documented recently: https://pdfsyntax.dev/introduction_pdf_syntax.html

miki1232112y ago

What people often miss about PDF is that it's closer to an image format in some ways than to a Word document. Word documents, PDFs and images are in document editing what DAW projects, midis and mp3 files are in music and what Java source code, JVM bytecode and pure x86 machine code are in software.

The primary purpose of a PDF file is to tell you what to display (or print), with perfect clarity, in much fewer bytes than an actual image would take. It exploits the fact that the document creator knows about patterns in the document structure that, if expressed properly, make the document much more compressible than anything that an actual image compression algorithm could accomplish. For example, if you have access to the actual font, it's better to say "put these characters at these coordinates with that much spacing between them" than to include every occurrence of every character as a part of the image, hoping that the compression algorithm notices and compresses away the repetitions. Things like what character is part of what word, or even what unicode codepoint is mapped to which font glyph are basically unimportant if all you're after is efficiently transferring the image of a document.

If you have an editable document, you care a lot more about the semantics of the content, not just about its presentation. It matters to you whether a particular break in the text is supposed to be multiple spaces, the next column in a table or just a weird page layout caused by an image being present. If you have some text at the bottom of each page, you care whether that text was put there by the document author multiple times, or whether it was entered once and set as a footer. If you add a new paragraph and have to change page layout, it matters to you that the last paragraph on this page is a footnote and should not be moved to the next one. If a section heading moves to another page, you care about the fact that the table of contents should update automatically and isn't just some text that the author has manually entered. If you're a printer or screen, you care about none of these things, you just print or display whatever you're told to print or display. For a PDF, footnotes, section headings, footers or tables of contents don't have to be special, they can just be text with some meaningless formatting applied to it. This is why making PDF work for any purpose which isn't displaying or printing is never going to be 100% accurate. Of course, there are efforts to remedy this, and a PDF-creating program is free to include any metadata it sees fit, but it's by no means required to do so.

This isn't necessarily the mental model that the PDF authors had in mind, but it's an useful way to look at PDF and understand why it is the way it is.

mannyv2y ago

The original goal of PDF was to have a portable print fidelity copy; WYSIWYG for real. You could take a PDF file and print it on a laser printer, a linotype, or a screen and it would look the same.

If you printed it on a postscript printer it would look exactly the same (or better, if you used type 1 fonts).

mannyv2y ago

PDF is pretty much a symbolic representation of what needs to be printed out. It's symbolic so it can get rasterized onto whatever device in question, in a way that should be as accurate as possible to a print version.

That's the primary requirement of PDF, and has been since the beginning.

They added a bunch of interactive stuff to it, which are used occasionally (forms). But to understand PDF you need to understand the above first.

eschaton2y ago

Anybody trying to do this is missing the point of PDF: It’s a page-description format and therefore only represents the marks on a page, not document structure.

One should not attempt to edit a PDF, one should edit the document from which the PDF is generated.

lucb1e2y ago

I'll stop trying to edit PDFs when people stop sending me PDFs that I want to edit.

Somehow it became "unprofessional" to just send meant-to-be-editable documents around for everyone to enjoy, so this is where we end up...

NoPicklez2y ago

It isn't "somehow" there are some legitimate reasons why.

When I send a client a final document (that's not intended to be edited) in a .PDF format you can almost guarantee that it will look the same to them as it did for me. When I send someone a Word document, I can't guarantee that it will look the same between different versions of Word, Mac Word, Pages, Google doc etc.

I'm not saying .PDF formats are perfect, but they're certainly more consistently presented to the end user.

1 more reply

darkclouds2y ago

Its that age old problem where people send documents, that need to be edited by the recipient in a structured way.

Same thing goes on with Word docs being sent out, formatted in a particular way like PDF's, ie a questionnaire, and the recipient needs to edit said document and then send it back.

HTML forms are other examples.

All these years later, still no globally standard way to achieve this quickly and easily, and yet it would seem perfect for the Open Source world to tackle.

lacrimacida2y ago

Yes, This is exactly the source of the problem. Otherwise editing would rarely be a problem or even needed. And for this it is extremely ill suited. I remember having printed a document, typed by hand and scan it back in a pressing situation. It felt bizantinely complicated and equally frustrating.

layer82y ago

PDF does support incorporating information about the logical document structure, aka Tagged PDF. It’s optional, but recommended for accessibility (e.g. PDF/UA). See chapters 14.7–14.8 in [1].

[1] https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

o1y322y ago

"should not" is meaningless here, because in the real world there are tons of situations where people want you to edit PDF, one way or another

Finnucane2y ago

One challenge is marking up corrections that need to go back to the source document. I get proofs from a typesetter, and I need to mark it up for them to fix. I can't change the pdf text, because the typesetter won't see that. Acrobat's markup tools aren't terrible, but they aren't quite what I could do in the days of paper and red pencils. Unless I use the 'pencil' tool in Acrobat. I'd like to see that improved.

louthy2y ago

> It’s a page-description format and therefore only represents the marks on a page, not document structure

Maybe they should have called it ‘Page Description Format’ then? Instead of ‘Portable Document Format’

jaystraw2y ago

20 years ago, I worked as the plate person at a newspaper: we had two million dollar kodak plate "printers" -- printer is the wrong word, but the emulsion on the plates could be hit by UV light and dissolved in a chemical bath iirc. Regularly, the kodaks would fail, and my boss would go into the postscript (or maybe eps) files manually, change a header or some other malformed bit that came from the layout software that sent us the files, and all would be well again (our giant German offset web press ran linux, btw)

I think his name was Bill. He took me, a 17 year old, to a Sigur Ros concert. Great dude. Wow two stories that don't involve pdfs!

nathan_f772y ago

Great post. I've spend a lot of time reading through the PDF specification over the last ~5 years while building DocSpring [1], and I still feel like I've barely scratched the surface. qpdf is a great tool. One of my other favorites is RUPS [2], which really lets you dig into the structure of a PDF.

[1] https://docspring.com

[2] https://github.com/itext/i7j-rups

seszett2y ago

Although this is an interesting dive into the PDF format, just opening the PDF in Libreoffice or Inkscape usually works fine to modify its text.

gcanyon2y ago

I’m interested in extracting the contents of a pdf form — many individual text boxes. You’re saying libre office would likely be able to parse that pdf into a usable format?

pikrzyszto2y ago

Poppler ( https://poppler.freedesktop.org/ ) handles this for you with pdftotext utility. It also ships with bunch of other utilities to work with PDFs

anon____2y ago

With LibreOffice Draw you can edit the PDF (modify the text, move or change images, etc), then save as pdf, but it can't parse and save it as .odt, .doc, .html or similar.

1 more reply

totetsu2y ago

Pdfmaster is a good tool for this too but the free version leaves a watermark

LispSporks222y ago

As I recall, words aren’t even necessarily made up of contiguous characters. Especially true in OCRed documents in PDF.

yboris2y ago

Semi-related(?) - I created a repository to convert PDF to JPG and back to PDF:

https://github.com/whyboris/PDF-to-JPG-to-PDF

A government form didn't have editable fields that needed to be filled out. And editing the PDF was impossible (password protection). This was my solution.

kccqzy2y ago

On macOS using Preview you can add textual comments on otherwise uneditable PDFs. Then you can simply print the commented PDF as a new PDF.

Converting to JPG unnecessarily rasterizes text and introduces ugly compression artefacts.

jordann2y ago

If you don't mind using java, you can use the open source Apache PDFBox library

https://pdfbox.apache.org/

It's relatively performant and it's a mature and supported codebase that can accomplish most pdf tasks.

Drakinte2y ago

Except text substitution

https://pdfbox.apache.org/2.0/migration.html#why-was-the-rep...

> ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily

Const-me2y ago

> I didn't see an obvious open-source tool that lets you dig into PDF internals

That’s a matter of the toolset. I program C#, and I have good experience with that open source library: https://www.nuget.org/packages/iTextSharp-LGPL/ It’s a decade old by now, but PDF ain’t exactly a new format. That library is not terribly bad for many practical use cases. Particularly good when you only need to create the documents as opposed to editing them, because for that use case you’d want to use an old version of the format anyway, for optimal compatibility.

schlowmo2y ago

PDF is such a weird format. Not so long ago I had to write some Java code for manipulating PDFs: find a string, remove it and place an image at the former string position. I should have known better as I thought "Well, how hard can that be?”

What followed was a deep dive down the rabbit hole, a lot of fiddling with the same tools the author of this gist is using trying to make sense of it all.

The final solution worked better than I thought while at the same time felt incredibly wrong.

I'm very thankful for all the (probably painful) work that went into those open source PDF tools.

mx_022y ago

> "Well, how hard can that be?”

Very hard?

I worked on a tool that generated PDFs based on API responses. The tool added charts from the api data.

Those PDFs were reports with some hardcoded text.

Yesh what a fun ride that was.

kccqzy2y ago

Generating PDFs are incredibly easy. I just generate LaTeX and run it through lualatex. When precise positioning is needed, I just use a giant tikzpicture.

crtified2y ago

This brings back horrible memories of working with large complex maps back in the 2000s. Having various CAD and GIS applications generate messy, inefficient spaghetti-coded PDF outputs - then bouncing those PDFs around the Adobe apps of the time, to add effects and other prettifications not available in the mapping apps.

It would reach the point where things would start to break, and .... "good times were had, by all".

lucascacho2y ago

Every time I read about the hardships of interacting with the PDF format, I gain more respect for Photopea, which has full PDF editing support.

firexcy2y ago

My understanding is that the PDF syntax essentially imitates physical printing in that it instructs the reader software to leave something at a given coordinate on a defined media with supplied resources. Thus it's easily portable but barely mutable.

pmontra2y ago

Some small PDF files are saved as uncompressed text. Invoices are a typical example.

This means that we can open those files, read them as one single string and match the expected text in unit tests. I've got a few projects doing that and it was fine.

If the text is compressed, pipe its content to qpdf first.

mondaymusings2y ago

1. The PDF format is wildly overcapable compared to the majority of actual use (view text, tables and images).

2. The number of user devices with unpatched PDF readers is likely large.

3. The system of paywalled scientific knowledge drives millions of students and researchers to get their science PDFs from scihub and libgen pirate sites hosted in former Soviet countries, sometimes over http (not https).

These three facts combine to a huge vulnerabilty space.

On the flipside a sane and open PDF replacement format that also offered reduced file size could gain many users quickly by convincing scihub and libgen to convert and offer their files in the new format to cut costs and shorten download time, with reduced vuln as a positive externality.

tomalbrc2y ago

I have been using Apples Preview.app to open "encrypted" or protected PDFs for quite a while, until it stopped working (Big Sur)

herbst2y ago

Just a heads up. You can edit PDFs in Gimp. AFAIK it just embeds a huge image in the end but easy to add a signature or something

rogeliodh2y ago

LibreOffice can open and edit PDFs. Last time I tried it was really good. Not sure what limitations are there.

lucb1e2y ago

For me it always seems to change the font from whatever was built into the PDF (rendered just fine in any PDF reader) to a random system font which completely breaks the spacing, making different parts of the document overflow into each other

Finnucane2y ago

That is always going to be a problem with editing pdfs with embedded font information--if you can't match the font, you will screw up the layout. Even if you can match the font, you'll probably still screw up the layout.

Alifatisk2y ago

Is there any tool that competes with Adobe Acrobat? Like the censoring tool is rarely founs anywhere else.

maxerickson2y ago

PDF-XChange Editor. Not really used it much (have Acrobat for work and have checked that some things work as far as viewing).

Alifatisk2y ago

Thank you, it was a bit unclear if I had to pay in order to redact content but I guess I'll have to download it and give it a shot in order to find out.

yair99dd2y ago

Inkscape+1.2 multipage support is Great for editing graphics and text on PDFs

dustypotato2y ago

PSA, if you want to sign a PDF, firefox does it easily. Works like magic.

elyobo2y ago

"want" is probably a misleading term here

aleden2y ago

I'm surprised no one has mentioned qpdf.

https://qpdf.readthedocs.io/en/stable/overview.html

It turns a PDF (typically everything in it is compressed binary blobs) into a mixed binary/ASCII file (which itself is a PDF) that can be edited with vim.

chrnola2y ago

The linked article literally mentions qpdf within the first few paragraphs.

gpvos2y ago

I'm not sure what you were reading, but the fine article is centred around using qpdf.

rhaway847732y ago

It’s mentioned in the gist

> To view the compressed data, you can use a command line tool called qpdf.

j / k navigate · click thread line to collapse

101 comments

blincoln2y ago

[1] "More than just transparency" section of https://blog.adobe.com/en/publish/2022/01/31/20-years-of-tra...

[2] https://blog.didierstevens.com/2008/05/07/solving-a-little-p...

[3] Page 61 of https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

[4] For example, a script that uses the pypdf library.

userbinator2y ago

totetsu2y ago

Howto https://www.cyberciti.biz/faq/removing-password-from-pdf-on-...

Helmut100012y ago

Or:

    qpdf --decrypt <source pdf> <destination pdf>

aardvark1792y ago

pavlov2y ago

The original Porter-Duff compositing operators don’t cover Photoshop-style blending. Here’s a link with pictures:

http://ssp.impulsetrain.com/porterduff.html

1 more reply

pph2y ago

aidos2y ago

To be fair, if you wanted to stop copying of text it would be easiest just to drop the ToUnicode mapping against the fonts and then it’s a manual process for people to recreate them.

miki1232112y ago

1 more reply

johnalbertearle2y ago

aidos2y ago

This topic comes up periodically as most people think PDFs are some impenetrable binary format, but they’re really not.

They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun).

azangru2y ago

> I’m a sadist, I read it for fun.

I think this is called masochist. Now, if you participated in writing the spec or were making others read it...

aidos2y ago

Yup, slip of the tongue. Though, I do make other people read the spec at work, so I’m that too.

1 more reply

esafak2y ago

It is a shame Adobe designed a format so hard to work with that people are amazed when someone accomplishes what should be a basic task with it.

Their design philosophy of creating a read-only format was flawed to begin with. What's the first feature people are going to ask for??

pwg2y ago

> It is a shame Adobe designed a format so hard to work with

PDF was not designed to be editable, nor for anyone to "work with" it in that way.

It was originally meant to be "electronic paper".

1 more reply

mistrial92y ago

.. waves to Leonard Rosenthol

gobdovan2y ago

haolez2y ago

That's awesome. I'm relying a lot on Amazon Textract for my PDF parsing needs.

Do you have any other insights on how to do a good job at that natively, i.e. without a cloud provider? Especially when dealing with tables.

kccqzy2y ago

1 more reply

enriquto2y ago

You can do this:

    pdf2ps a.pdf    # convert to postscript "a.ps"
    vim a.ps        # edit postscript by hand
    ps2pdf a.ps     # convert back to pdf

hnick2y ago

indeedmug2y ago

If you can put javascript and animations in pdf, what's stopping you from making a frontend in it? I wonder what are the frontiers of things you can do in pdf.

Honestly, it seems like only malware authors benefit from the complexity of pdfs.

haraldooo2y ago

MacOS‘s Quartz (original 2D rendering) is based on pdf: https://www.prepressure.com/pdf/basics/osx-quartz

Tyr422y ago

You might enjoy this about controlling a river using Tex and postscript

http://sdh33b.blogspot.com/2008/07/icfp-contest-2008.html?m=...

lionkor2y ago

The most widespread javascript packaging format turns out to be PDF[1]

[1] source: made it up

hackernewds2y ago

in vim?

enriquto2y ago

It doesn't matter, just use any text editor.

ks20482y ago

This seems to be missing an important point: at the end of PDF is a table ("cross-reference" table) that stores the BYTE-OFFSET to different objects in the file.

But, generally, adding/removing/modifying things in the middle of the file require recomputing the xref table and thus become much easier to use a library rather than direct text editing.

gpvos2y ago

userbinator2y ago

detourdog2y ago

I actually was at a Acrobat/PDF launch event in midtown NYC. It was an embedded file type that could be generated at the type of publishing and all dependencies could either be embedded or not.

This made a coherent point in a digital workflow that could be saved and reprinted with ease. This was a big deal before the portable document format came to be.

Only took a couple of hours to get the final documnet after pushing the go buttton.

Someone2y ago

> My guess is that it was meant to be completely textual at first

One nice feature is that you can append a few pieces and a new index to an existing PDF file and get a new valid PDF file (which would still contain its old index as a piece of “junk DNA”)

I think compression was added because users complained about file sizes. Ascii85 (https://en.m.wikipedia.org/wiki/Ascii85) grows binary data by 25%.

> but then requiring the xref table to have fixed-length entries is odd

My guess is that made it easier to hack together a tool to convert PDF to postscript.

pmarreck2y ago

The roots of PDF are PostScript, which is like Forth, and is text-based, so that’s why

aidos2y ago

In my experience it’s easiest just to break the xref table and run something like “mutool clean” to fix it again. It can be completely derived from the content so it’s safe to do.

bena2y ago

Ah. So it's a lot like editing compiled binaries.

You can modify binaries all you want as long as you preserve the length of everything.

I had run it against a disassembler, found the part where it performed the check, and was able to change it to a straight JMP and then pad the rest of the space with NOPs.

jl62y ago

My top 3 fun PDF facts:

1) Although PDF documents are typically 8-bit binary files, you can make one that is valid UTF-8 “plain text”, even including images, through the use of the ASCII85 filter.[0]

2) PDF allows an incredible variety of freaky features (3D objects, JavaScript, movies in an embedded flash object, invisible annotations…). PDF/A is a much saner, safer subset.

[0] For example: https://lab6.com/2

gpvos2y ago

jl62y ago

1 more reply

desgeeko2y ago

If you want to continue this journey and learn more about PDF, you can read the anatomy of a file I documented recently: https://pdfsyntax.dev/introduction_pdf_syntax.html

miki1232112y ago

This isn't necessarily the mental model that the PDF authors had in mind, but it's an useful way to look at PDF and understand why it is the way it is.

mannyv2y ago

The original goal of PDF was to have a portable print fidelity copy; WYSIWYG for real. You could take a PDF file and print it on a laser printer, a linotype, or a screen and it would look the same.

If you printed it on a postscript printer it would look exactly the same (or better, if you used type 1 fonts).

mannyv2y ago

That's the primary requirement of PDF, and has been since the beginning.

They added a bunch of interactive stuff to it, which are used occasionally (forms). But to understand PDF you need to understand the above first.

eschaton2y ago

Anybody trying to do this is missing the point of PDF: It’s a page-description format and therefore only represents the marks on a page, not document structure.

One should not attempt to edit a PDF, one should edit the document from which the PDF is generated.

lucb1e2y ago

I'll stop trying to edit PDFs when people stop sending me PDFs that I want to edit.

Somehow it became "unprofessional" to just send meant-to-be-editable documents around for everyone to enjoy, so this is where we end up...

NoPicklez2y ago

It isn't "somehow" there are some legitimate reasons why.

I'm not saying .PDF formats are perfect, but they're certainly more consistently presented to the end user.

1 more reply

darkclouds2y ago

Its that age old problem where people send documents, that need to be edited by the recipient in a structured way.

Same thing goes on with Word docs being sent out, formatted in a particular way like PDF's, ie a questionnaire, and the recipient needs to edit said document and then send it back.

HTML forms are other examples.

All these years later, still no globally standard way to achieve this quickly and easily, and yet it would seem perfect for the Open Source world to tackle.

lacrimacida2y ago

layer82y ago

PDF does support incorporating information about the logical document structure, aka Tagged PDF. It’s optional, but recommended for accessibility (e.g. PDF/UA). See chapters 14.7–14.8 in [1].

[1] https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

o1y322y ago

"should not" is meaningless here, because in the real world there are tons of situations where people want you to edit PDF, one way or another

Finnucane2y ago

louthy2y ago

> It’s a page-description format and therefore only represents the marks on a page, not document structure

Maybe they should have called it ‘Page Description Format’ then? Instead of ‘Portable Document Format’

jaystraw2y ago

I think his name was Bill. He took me, a 17 year old, to a Sigur Ros concert. Great dude. Wow two stories that don't involve pdfs!

nathan_f772y ago

[1] https://docspring.com

[2] https://github.com/itext/i7j-rups

seszett2y ago

Although this is an interesting dive into the PDF format, just opening the PDF in Libreoffice or Inkscape usually works fine to modify its text.

gcanyon2y ago

I’m interested in extracting the contents of a pdf form — many individual text boxes. You’re saying libre office would likely be able to parse that pdf into a usable format?

pikrzyszto2y ago

Poppler ( https://poppler.freedesktop.org/ ) handles this for you with pdftotext utility. It also ships with bunch of other utilities to work with PDFs

anon____2y ago

With LibreOffice Draw you can edit the PDF (modify the text, move or change images, etc), then save as pdf, but it can't parse and save it as .odt, .doc, .html or similar.

1 more reply

totetsu2y ago

Pdfmaster is a good tool for this too but the free version leaves a watermark

LispSporks222y ago

As I recall, words aren’t even necessarily made up of contiguous characters. Especially true in OCRed documents in PDF.

yboris2y ago

Semi-related(?) - I created a repository to convert PDF to JPG and back to PDF:

https://github.com/whyboris/PDF-to-JPG-to-PDF

A government form didn't have editable fields that needed to be filled out. And editing the PDF was impossible (password protection). This was my solution.

kccqzy2y ago

On macOS using Preview you can add textual comments on otherwise uneditable PDFs. Then you can simply print the commented PDF as a new PDF.

Converting to JPG unnecessarily rasterizes text and introduces ugly compression artefacts.

jordann2y ago

If you don't mind using java, you can use the open source Apache PDFBox library

https://pdfbox.apache.org/

It's relatively performant and it's a mature and supported codebase that can accomplish most pdf tasks.

Drakinte2y ago

Except text substitution

https://pdfbox.apache.org/2.0/migration.html#why-was-the-rep...

> ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily

Const-me2y ago

> I didn't see an obvious open-source tool that lets you dig into PDF internals

schlowmo2y ago

What followed was a deep dive down the rabbit hole, a lot of fiddling with the same tools the author of this gist is using trying to make sense of it all.

The final solution worked better than I thought while at the same time felt incredibly wrong.

I'm very thankful for all the (probably painful) work that went into those open source PDF tools.

mx_022y ago

> "Well, how hard can that be?”

Very hard?

I worked on a tool that generated PDFs based on API responses. The tool added charts from the api data.

Those PDFs were reports with some hardcoded text.

Yesh what a fun ride that was.

kccqzy2y ago

Generating PDFs are incredibly easy. I just generate LaTeX and run it through lualatex. When precise positioning is needed, I just use a giant tikzpicture.

crtified2y ago

It would reach the point where things would start to break, and .... "good times were had, by all".

lucascacho2y ago

Every time I read about the hardships of interacting with the PDF format, I gain more respect for Photopea, which has full PDF editing support.

firexcy2y ago

pmontra2y ago

Some small PDF files are saved as uncompressed text. Invoices are a typical example.

This means that we can open those files, read them as one single string and match the expected text in unit tests. I've got a few projects doing that and it was fine.

If the text is compressed, pipe its content to qpdf first.

mondaymusings2y ago

1. The PDF format is wildly overcapable compared to the majority of actual use (view text, tables and images).

2. The number of user devices with unpatched PDF readers is likely large.

These three facts combine to a huge vulnerabilty space.

tomalbrc2y ago

I have been using Apples Preview.app to open "encrypted" or protected PDFs for quite a while, until it stopped working (Big Sur)

herbst2y ago

Just a heads up. You can edit PDFs in Gimp. AFAIK it just embeds a huge image in the end but easy to add a signature or something

rogeliodh2y ago

LibreOffice can open and edit PDFs. Last time I tried it was really good. Not sure what limitations are there.

lucb1e2y ago

Finnucane2y ago

Alifatisk2y ago

Is there any tool that competes with Adobe Acrobat? Like the censoring tool is rarely founs anywhere else.

maxerickson2y ago

PDF-XChange Editor. Not really used it much (have Acrobat for work and have checked that some things work as far as viewing).

Alifatisk2y ago

Thank you, it was a bit unclear if I had to pay in order to redact content but I guess I'll have to download it and give it a shot in order to find out.

yair99dd2y ago

Inkscape+1.2 multipage support is Great for editing graphics and text on PDFs

dustypotato2y ago

PSA, if you want to sign a PDF, firefox does it easily. Works like magic.

elyobo2y ago

"want" is probably a misleading term here

aleden2y ago

I'm surprised no one has mentioned qpdf.

https://qpdf.readthedocs.io/en/stable/overview.html

It turns a PDF (typically everything in it is compressed binary blobs) into a mixed binary/ASCII file (which itself is a PDF) that can be edited with vim.

chrnola2y ago

The linked article literally mentions qpdf within the first few paragraphs.

gpvos2y ago

I'm not sure what you were reading, but the fine article is centred around using qpdf.

rhaway847732y ago

It’s mentioned in the gist

> To view the compressed data, you can use a command line tool called qpdf.

j / k navigate · click thread line to collapse