Here are some other similar(?) tools, for seeing the inner contents of a PDF file (the raw objects etc), but I haven't compared them to this tool here:
- https://github.com/itext/i7j-rups (java -jar ~/Downloads/itext-rups-7.2.5.jar)
- https://github.com/desgeeko/pdfsyntax (python3 -m pdfsyntax inspect foo.pdf > output.html)
- https://github.com/trailofbits/polyfile (polyfile --html output.html foo.pdf)
- https://www.reportmill.com/snaptea/PDFViewer/ = https://www.reportmill.com/snaptea/PDFViewer/pviewer.html (drag PDF onto it)
- https://sourceforge.net/projects/pdfinspector/ (an "example" of https://superficial.sourceforge.net/)
- https://www.o2sol.com/pdfxplorer/overview.htm
More?
mutool clean -d in.pdf out.pdf
At that point you’ll realise that a PDF is mostly just a list of objects and that those objects can reference each other. After that you’ll journey through the spec understanding what each type of object does and what the fields in it control. The graphics stream itself is just a stack based co-ordinates drawing system that’s easy to follow too.By way of an example. Here's an object that represents a Page. You can see the dimensions in the MediaBox. The contents themselves are contained at object "9 0 obj" ("9 0 R" is the pointer to it):
2 0 obj
<<
/Type /Page
/MediaBox [ 0 0 612 792 ]
/Contents 9 0 R
>>
endobj
Meanwhile "9 0 obj" has the drawing instructions. They seem a little weird at first glance but you see the values ".23999999 0 0 -.23999999 0 792" each get pushed on the stack and then "cm" pops them to interpret them as the transformation matrix. 9 0 obj
<<
/Length 18266
>>
stream
.23999999 0 0 -.23999999 0 792 cm
q
0 0 2551 3301 re
...
The depth and detail of all of the different possible things that can be represented in a PDF is insane. But understanding the structure above is all you need to begin your journey!EDIT The rest of your journey is contained in this epic document: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
My tool can do exactly the same (viewing internal structure, exporting objects, and see the uncompressed raw content for stream) with a graphical interface and without all this kind of flags (which one of the reasons I started to design this project with egui), but thanks for posting yours too.
The HTML output is like a pretty print where you can read view objects and follow links to other objects.
Since I have added a new command (disasm) that is CLI oriented and displays a greppable summary of the structure. Here is an explanation: https://github.com/desgeeko/pdfsyntax/blob/main/docs/disasse...
Now regarding the tools you mentioned, I haven't checked out all of them, but part of them are interesting (and more mature, speaking of testing and compatibility). However some (at least the ones I was trying) are very basic, and they don't allow the "Save object as.." or uncompress it. I like the feature of displaying the PDF for preview :)
> You may not deploy it on a network without disclosing the full source code of your own applications under the AGPL license. You must distribute all source code, including your own product and web-based applications.
They also have this delightful nagware encoded as a base64 string that spits this out in your logs [1]:
> You are using iText under the AGPL.
> If this is your intention, you have published your own source code as AGPL software too. Please let us know where to find your source code by sending a mail to agpl@apryse.com We'd be honored to add it to our list of AGPL projects built on top of iText and we'll explain how to remove this message from your error logs.
> If this wasn't your intention, you are probably using iText in a non-free environment. In this case, please contact us by filling out this form: http://itextpdf.com/sales If you are a customer, we'll explain how to install your license key to avoid this message. If you're not a customer, we'll explain the benefits of becoming a customer.
For using RUPS on a local computer you're probably safe, but I avoid the company because everything about their approach to the AGPL suggests that they chose it as a marketing technique for their paid products (with an extremely strong desire that it never be used commercially without pay), not out of a serious commitment to free software.
[0] https://itextpdf.com/how-buy/AGPLv3-license
[1] https://github.com/itext/itext-dotnet/blob/develop/itext/ite...
Stuff I do with it: Modify content streams, extract images/content, just investigate general structure of the pdf documents, remove pages, repair documents,... it's literally a swiss army knife when working with pdf's
Also, "There is no DOM, HTML, JS or CSS" is some uh-huh given the considerable amount of silliness involved in view-source:https://www.egui.rs/
https://web.archive.org/web/20110902114238/http://www.zynami...
We never got around to open sourcing it, so I'm happy to see that there is work being done in this space.
Congrats to seekbytes for releasing this!
Depending on your transformation use case, you may write an incremental update with only a few bytes at the end of the original file instead of rewriting it entirely. To my knowledge this feature of the PDF specification is often overlooked and not a lot of libraries implements it.
It is a work in progress and I have not developed functions for images yet, though.
I've been needing something to see the x/y bounds of tables to fix some edge cases with camelot, seem to be some good links in the comments here
Installed from git using cargo 1.80.1 on Ubuntu 22.04 on an AMD Framework laptop if that's of any help.
Would appreciate any tool suggestions!
1. Use pdftk to uncompress it: pdftk input.pdf output uncompressed.pdf uncompress
2. Look at the PDF code (it's text based) to find the image insertion code.
3. Replace all instances of the image insertion code with strings of spaces the same length (there's a table of object byte offsets at the end that you don't want to mess up).
4. Use pdftk to compress it again: pdftk edited.pdf output output.pdf compress
I have a script that does this to remove pen strokes of particular colours so I can e.g. strip out marking rubric on test solutions written on a tablet.
Get the PDF 1.7 spec from https://pdfa.org/resource/pdf-specification-archive/. You're looking for the "Do" operator invoking a named image object defined elsewhere with "/Subtype /Image". See section 4.8, particularly the example on p343. Or, if it's badly done, it might instead be an inline image using the "BI" operator (a bit later in the same section).
And what's "public bucket"?