Show HN: PDF Debugger – Inspect Structure of PDF Files (opens in new tab)

(pdf.hyzyla.dev)

157 pointshyzyla2y ago26 comments

26 comments

svat2y ago

This is great, thanks for sharing!

It is also inspiring to me, because I have had the same idea and been working on something like this on-and-off since early 2022, but the contrast between your project and the state of mine[1] is like a textbook example of how to ship vs how not to ship:

• Instead of using an existing parser like pdf.js as you do, I started writing my own parser from scratch, in the process learning Rust and its Nom library for parsing, its integration with Webassembly, etc.

• I wrote not just a straightforward parser, but a crazy one that that preserves all the details like whitespace etc (what a typical parser is supposed to ignore), so that I can test whether it round-trips successfully.

• After I got it working, I didn't stop at "works on almost all PDFs in practice" (the same as with PDF.js or any other PDF implementation) but actually chased down and investigated every single failure, checking whether they work in any other PDF application/library (Preview, Chrome, qpdf, Adobe Reader, etc), until I could prove to my satisfaction that it's not a fault in the parser. (This is still not complete…)

• When I returned to this project again after several months, instead of making further progress I spent time starting to document the code, making minor improvements and tweaks, etc.

So the end result is that my project basically does nothing still, while you have a working PDF debugger. :) This is the difference between a project that intends to actually produce something and one that ends up being mostly for learning/fun with the goal mostly forgotten… not that I have any regrets :)

[Meta: Something similar is true of this comment too, which I started two days ago but left as a draft… until I finally had a burst of energy and posted just now.]

Returning to your project, a couple of feature requests:

- Provide a shortcut to jump directly to the node for page N, for any user-provided page number N.

- (Where possible) Some annotation of the page content stream operators — the Tj, Td, etc.

(Do consider making it an open-source project, whatever the quality of the code…)

[1]: https://github.com/shreevatsa/pdf-explorer / https://shreevatsa.net/pdf-explorer/

hyzylaOP2y ago

Thanks for sharing your story! My goal was to have MVP as fast as possible; otherwise, I could lose interest in it. It is the biggest reason why I chose to use an existing parser instead of writing my own (I've initialized an empty Rust project on my OS for that )

Few things that I have in nice-to-do features list, but hard to implement without writing my own parser: - edit nodes (with XREF table update) - raw source editor - show actual position in source

svat2y ago

For editing, I was able to make some simple edits (not of individual objects, but things like removing or duplicating pages, or editing crop boxes) using pdf-lib instead of pdf.js: see for example (just right-click and "view source") https://shreevatsa.net/pdf-pages/ and https://shreevatsa.net/pdf-unspread/

For seeing the raw source, after using such things for a bit (e.g. the output HTML file generated by https://github.com/desgeeko/pdfsyntax which is very good), I'm starting to feel it's nice to look at the first few times / in some cases, but in the long run / for large PDFs, maybe it's not really so useful or worth it.

Uptrenda2y ago

What's your aim making this here? Is it mostly just to show the sub-components of a PDF? Because if you added more support for Javascript it would fill a need (IDK how niche, lol.)

When I was trying to improve my resume I added some custom Javascript to the PDF using Adobe Reader and what I learned is even Adobe's product makes it painful. Basically the process was something like this:

1. You add a script that loads at certain sections in the document. Let's say this is the equivalent of document.load.

2. To do this there's a field to add the full script which must be typed up correctly beforehand. Only after its added do you know it works and every syntax error requires you to edit your previous script, delete the script you added, and hope your new version works.

3. There's really no interactive way to work with the scripts. Their 'debugger' has almost no features at all or hints of syntax errors. Even getting a script to run in it requires finding the right combination of key strokes in a 1000+ line document on PDF scripting.

The programming itself though is quite simple. It's just Javascript with a different DOM and security model. You can still do event-based programming and write powerful programs - all running inside a PDF. But it will only run in firefox (using PDF.js I think) and Adobe reader (for the JS support.) I just thought I'd tell you that writing these JS programs in PDFs (1) actually seems to have a lot of unrealized potential and (2) the tooling to do so is terrible. So with better JS support it would be useful.

pixelgeek2y ago

So I don't know what your intended audience was but there are elements of this that would be handy for people trying to trouble check PDFs for Print on Demand projects. I'm going to repost the link to a few groups I am in that do this sort of work. Definitely not for everyone but it might be handy for some.

phonon2y ago

hyzylaOP2y ago

I took inspiration from RUPS by iText [1]. It’s maybe most popular tool for inspecting PDF files

1. https://github.com/itext/i7j-rups

qingcharles2y ago

I needed this last night when I was trying to edit some text in a PDF and I ended up having to use iText Rups to browse the tree.

Is there a way to view the stream content in ASCII/Unicode instead of Base64/Hex?

hyzylaOP2y ago

Thanks for the idea. I added this feature a few minutes ago [1]. I'm trying to convert stream content to UTF-8. Maybe later, I will add a more flexible solution to convert to other encodings.

1. https://imgur.com/a/69rbYMw

qingcharles2y ago

This works perfectly now on my test files. I can find the pages that have specific strings I was looking for.

This has saved me clogging up my PC by installing the Java runtime and iText RUPS.

qingcharles2y ago

My bad, I meant UTF-8. My brain had a relapse to 1988.

And, thank you!

Gys2y ago

An example output would have been nice. There is also nothing about privacy and stuff. What happens to the uploaded pdf? Or is it processed inside the browser? But then there could still be a call to a back end?

mkl2y ago

You can get an example output by clicking the "try example PDF file" link.

hyzylaOP2y ago

Uploaded PDF is fully processed inside a browser. I use Posthog analytics, so there are definitely requests to analytics sever

anymoonus2y ago

That's very cool!

1. Is the source available anywhere? I'm curious to see how it works.

2. Is there a way to connect the structure displayed here, to the rendered version in the PDF? To visually display the subcomponents?

zauguin2y ago

Regarding 2.: Most of these objects do not directly correspond to rendered elements. Basically every page has one (typically) content stream which will contain all rendered elements. The biggest rendered thing you see outside of that are annotations (link boxes, form fields, actual annotations, ...).

It's a bit different if you are looking at a tagged PDF, where the tagging structure is in there, but if you want to look at that in detail you are probably better served with e.g. ngPDF (https://ngpdf.com/) which will show the tagging structure including the mapping to rendered elements.

hyzylaOP2y ago

I haven't decided if I want to create an open-source version. In the first place, I made it private to worry less about my code quality and to finish the product faster before I lose interest in it.

It heavily relies on the core part of PDF.js: I've made a fork of the PDF.js project, removed everything not related to the core part, and added an export for low-level primitives [1].

Also, as inspiration, I used the pdf.js.utils [2] project, which almost does the same but in a different form.

1. https://github.com/hyzyla/pdf.js-core

2. https://github.com/brendandahl/pdf.js.utils

pixelgeek2y ago

Very nice work.

I wouldn't worry about the quality of the code. You get better by seeing other people's work and seeing alternative solutions to the problems you had.

Also, as I mentioned in another comment, this could easily be built into a quick trouble-checking app for POD work. Posting it would also let people fork it to make more task-specific apps.

vendiddy2y ago

Very interesting. I'm dealing with a lot of PDF generation at work using pdf-lib so this could come in handy. Thanks for the share!

mkl2y ago

Very neat!

I spotted a typo, which led me to a bug. When I click on a "stream contents" node, the right panel says "It's a actual content" (instead of "an"), and there is some mouse handling issue that prevents me from selecting the text in the right panel.

hyzylaOP2y ago

Thanks! I've fixed the typo and also allowed the selection of the node's text on the left panel (it was disabled by default).

nraynaud2y ago

very cool. Do you think you could hot link the PDF spec of each element for the casual observer?

hyzylaOP2y ago

Yeah, definitely it going to my todo list

davedx2y ago

This is really cool, thank you. Maintaining good PDF parsers is a full time job...

hyzylaOP2y ago

I want to be honest and open here: I did not write the PDF parser on my own. I heavily relied on the PDF.js project from Mozilla. I have a disclosure in the footer, but perhaps I should communicate about it more clearly.

davedx2y ago

No, it was clear already from the webpage that it uses PDF.js. I've also used it. I just think this is a really great way of visualizing PDF's, I shared it with my team as we deal with them a lot.

j / k navigate · click thread line to collapse

26 comments

svat2y ago

This is great, thanks for sharing!

• When I returned to this project again after several months, instead of making further progress I spent time starting to document the code, making minor improvements and tweaks, etc.

[Meta: Something similar is true of this comment too, which I started two days ago but left as a draft… until I finally had a burst of energy and posted just now.]

Returning to your project, a couple of feature requests:

- Provide a shortcut to jump directly to the node for page N, for any user-provided page number N.

- (Where possible) Some annotation of the page content stream operators — the Tj, Td, etc.

(Do consider making it an open-source project, whatever the quality of the code…)

[1]: https://github.com/shreevatsa/pdf-explorer / https://shreevatsa.net/pdf-explorer/

hyzylaOP2y ago

Few things that I have in nice-to-do features list, but hard to implement without writing my own parser: - edit nodes (with XREF table update) - raw source editor - show actual position in source

svat2y ago

Uptrenda2y ago

What's your aim making this here? Is it mostly just to show the sub-components of a PDF? Because if you added more support for Javascript it would fill a need (IDK how niche, lol.)

1. You add a script that loads at certain sections in the document. Let's say this is the equivalent of document.load.

pixelgeek2y ago

phonon2y ago

hyzylaOP2y ago

I took inspiration from RUPS by iText [1]. It’s maybe most popular tool for inspecting PDF files

1. https://github.com/itext/i7j-rups

qingcharles2y ago

I needed this last night when I was trying to edit some text in a PDF and I ended up having to use iText Rups to browse the tree.

Is there a way to view the stream content in ASCII/Unicode instead of Base64/Hex?

hyzylaOP2y ago

Thanks for the idea. I added this feature a few minutes ago [1]. I'm trying to convert stream content to UTF-8. Maybe later, I will add a more flexible solution to convert to other encodings.

1. https://imgur.com/a/69rbYMw

qingcharles2y ago

This works perfectly now on my test files. I can find the pages that have specific strings I was looking for.

This has saved me clogging up my PC by installing the Java runtime and iText RUPS.

qingcharles2y ago

My bad, I meant UTF-8. My brain had a relapse to 1988.

And, thank you!

Gys2y ago

mkl2y ago

You can get an example output by clicking the "try example PDF file" link.

hyzylaOP2y ago

Uploaded PDF is fully processed inside a browser. I use Posthog analytics, so there are definitely requests to analytics sever

anymoonus2y ago

That's very cool!

1. Is the source available anywhere? I'm curious to see how it works.

2. Is there a way to connect the structure displayed here, to the rendered version in the PDF? To visually display the subcomponents?

zauguin2y ago

hyzylaOP2y ago

I haven't decided if I want to create an open-source version. In the first place, I made it private to worry less about my code quality and to finish the product faster before I lose interest in it.

It heavily relies on the core part of PDF.js: I've made a fork of the PDF.js project, removed everything not related to the core part, and added an export for low-level primitives [1].

Also, as inspiration, I used the pdf.js.utils [2] project, which almost does the same but in a different form.

1. https://github.com/hyzyla/pdf.js-core

2. https://github.com/brendandahl/pdf.js.utils

pixelgeek2y ago

Very nice work.

I wouldn't worry about the quality of the code. You get better by seeing other people's work and seeing alternative solutions to the problems you had.

Also, as I mentioned in another comment, this could easily be built into a quick trouble-checking app for POD work. Posting it would also let people fork it to make more task-specific apps.

vendiddy2y ago

Very interesting. I'm dealing with a lot of PDF generation at work using pdf-lib so this could come in handy. Thanks for the share!

mkl2y ago

Very neat!

hyzylaOP2y ago

Thanks! I've fixed the typo and also allowed the selection of the node's text on the left panel (it was disabled by default).

nraynaud2y ago

very cool. Do you think you could hot link the PDF spec of each element for the casual observer?

hyzylaOP2y ago

Yeah, definitely it going to my todo list

davedx2y ago

This is really cool, thank you. Maintaining good PDF parsers is a full time job...

hyzylaOP2y ago

davedx2y ago

No, it was clear already from the webpage that it uses PDF.js. I've also used it. I just think this is a really great way of visualizing PDF's, I shared it with my team as we deal with them a lot.

j / k navigate · click thread line to collapse