Skip to content

Top Best Ask Show New Jobs

Show HN: Paper to HTML Converter (opens in new tab)

(papertohtml.org)

153 pointscodeviking4y ago58 comments

58 comments

45 comments · 14 top-level

codevikingOP4y ago· 15 in thread

Hi all,

I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.

Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.

We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.

We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.

We’re eager to hear what you think, and happy to answer questions.

Do you remove the pdf files we send to your servers?

Edit https://allenai.org/terms point 5, you own all the uploads! So if by mistake we send a medical PDF for example or something else that is under gdpr, we can't ask you to delete it???? ? Wtfffff

codevikingOP4y ago

We don't retain the uploaded document. We cache the extracted content, as to make things more efficient.

See https://papertohtml.org/about:

> What data do we keep? We cache a copy of the extracted content as well as the extracted images. This allows us to serve the results more quickly when a user uploads the same file again. We do not retain the uploaded files themselves. Cached content is never served to a user who has not provided the exact same document.

Also, we can delete the extracted data on request. Just send a note to accessibility@semanticscholar.org.

Sorry for the confusion!

Telemakhos4y ago

Is there any thought about presenting the papers as TEI XML with XSLT to display the paper in a browser or screenreader? TEI provides pagination support (needed for citing page numbers, because most of academia still needs that) and extensive semantic markup for things like bibliographic information. It also serves as one data model that can be converted easily with existing tools (XSLT) to provide many representations for humans, while also serving as a machine-parsable text for datamining. Digital humanities has made heavy use of TEI for years, and this project seems like it could benefit from it.

1vuio0pswjnm74y ago

"We're eager to hear what you think, ..."

I think I will stick with pdftohtml, pdftotext, and pdfimages https://en.wikipedia.org/wiki/Poppler_(software). These take seconds not minutes.

From user perspective I dont understand why not release the source code and let people compile a native application. (Did I miss the link to the source code.) Instead it looks like this is just a means of collecting free data (metadata, more training data, data from submitted papers by default) everytime someone submits a paper.

politelemon4y ago

I've never actually questioned the why, so maybe you could shine some light... why are they usually published as PDFs?

ephbit4y ago

I always assumed the main reason for using PDFs is, that an author/distributor can be pretty sure, that they're rendered almost exactly the same (fonts, layout) no matter with which viewer they're viewed.

This probably evokes some kind of sense of authenticity. Like some physical paper document it has exactly one appearance.

kartoshechka4y ago

Unfortunately for my mental health my thesis was exactly about converting arxiv papers to modern looking html, and there's so much more broken, unjust and ugly things in academia then using pdfs...

Regarding your question, I'd say that it is a natural continuation of centuries long tradition of writing on the actual paper. The invention of TeX actually made it easier to produce more papers, then came PDF, and you could produce virtual papers. Also science journals pretty much have monopoly on scientific knowledge distribution, and they are mostly paper too

codevikingOP4y ago

Y'know, that's a good question. I'm not sure I know the answer.

My guess is it's largely for historical reasons. At the time most venues were organized PDF was probably the best (or only) mechanism for sharing documents for print distribution.

But we think it's time to change that :).

DoreenMichele4y ago

I have no idea at all but as a wild guess, I would assume it's because you can't edit PDFs. So you know it says the same thing forever and no one went and changed it in response to reading criticism of their paper or something.

What alternative do you have? Word file?

PDF is the only widely supported format can guarantee accurate reprint.

znpy4y ago

I'd love to see a way to re-export a paper into a digital-friendly format, say epub/mobi to use on my e-reader.

Any plans on that?

kwhitefoot4y ago

You could give Calibre a try. The result will probably be a long way from perfect for complicated documents but it does work reasonably well for most things. Formulas don't translate well unfortunately.

isaacimagine4y ago

Looks great! Have you considered linking this up to something like arxiv or other preprint sites?

_delirium4y ago

There's already this for arXiv: https://www.arxiv-vanity.com/

Their job is a little bit easier because arXiv papers have the .tex source available, so you can use one of the various tex2html variants, instead of having to extract the paper's contents from a rendered PDF.

codevikingOP4y ago

Yup, we're definitely thinking about this.

Our focus right now is on providing a tool folks can run it on whatever papers they have access to. For instance, some researchers might have access to documents that aren't available to the public. We want them to be able to run this against those.

That said as we expand the effort I imagine we'll eventually pre-convert things that are publicly available, like those on ArXiv, etc.

chrisMyzel4y ago· 6 in thread

This is amazing! Will make my (offline-only) Kindle finally display scientific papers. Took a random link of arxiv and it worked like a charm, including TOC. will this be OS'ed?

kartoshechka4y ago

You may check out https://arxiv-vanity.com as well. OS, convertation rates are close to 70% on random arxiv paper if I'm not mistaken, but hardly can be called stable

Isthatablackgsd4y ago

There is a offline solution if you are looking for, the app is Calibre. It is basically ebook manager & extra. It can convert the PDF into mobi and customizable based on your preference. They have a preset for Kindles. Also it can works with DRM'ed files via DeDRM plugins. And Calibre can export it directly to your Kindle. A fair warning, don't use Calibre if you structured your ebook folder. The app will import everything and keep it within their own database folder thus doubling the space size.

codevikingOP4y ago

Yay, glad to hear it! If you end up viewing one of these on your Kindle, let us know how well (or not) things work.

We're not sure if it's something that we can distribute as OSS just yet. It relies on a few internal libraries that would also need be publicly released, so it's not as simple as adjusting a single repository's visibility.

mintplant4y ago

See also KOReader [0], if jailbreaking is an option for you. The built-in column splitter works pretty well for the papers I've used it to read.

[0] https://github.com/koreader/koreader

I've used KOReader in the past, and it's awesome! Keeping the jailbreak when my kindle randomly decides to updates itself, not so much. (yes I followed instructions to disable updates, but it still somehow managed to update) At some point it becomes too much of a hassle.

Though OP has his kindle offline all the time, so not a issue for them.

chrisMyzel4y ago

(HTML->Mobi is totally possible)

gregsadetsky4y ago· 3 in thread

Great site, congrats!

One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)

I see 421 requests and 150 Mb loaded. As it seems to be mostly thumbnails, have you considered using jpegs instead of pngs, potentially use lazy loading (i.e. not load images outside of the viewport) and potentially use GCP's (or another provider) CDN offering?

Once I clicked a thumbnail, loading the article itself (for example [1]) was quite breezy.

The gallery is a great showcase of what your site does -- I think that it'd be worth making it snappier :-)

Cheers and congrats again

P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"

[0] https://papertohtml.org/gallery

[1] https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...

codevikingOP4y ago

> One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)

Yup. There's no CDN or anything like that right now. We kept things simple to get this out the door. But we definitely intend to make improvements like this as we improve the tool.

The more adoption we see, the more it motivates these types of fixes!

> P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"

Thanks for the catch. As you noted there's still a fair number of extraction errors for us to correct!

mintplant4y ago

Another sample paper that caused some trouble with figure extraction: https://www.cs.utexas.edu/~hovav/dist/vera.pdf

Very cool project, looking forward to seeing how it develops!

> have you considered using jpegs instead of pngs

For thumbs of text papers, perhaps a GIF or PNG would be smaller than a JPEG while retaining pixel accurate crispness?

nanis4y ago· 2 in thread

This seems pdf2tohtml combined with GROBID[1].

It seems to me the masheen learningz technikz boil down to a generalization of my lightbulb moment here[2].

[1]: https://grobid.readthedocs.io/en/latest/

[2]: https://www.nu42.com/2014/09/scraping-pdf-documents-without-...

codevikingOP4y ago

Yup, right now we use GROBID, do some post processing and combine the output with other extraction techniques. For instance, we use a model to extract document figures[1], so that we can render them in the resulting HTML document.

Also, we're working hard on a new extraction mechanism that should allow us to replace GROBID [2].

There's a lot of really smart people at AI2 working on this, I'm excited to see the resulting improvements and the cool things (like this) that we build with the results!

[1]: https://api.semanticscholar.org/CorpusID:4698432

[2]: https://api.semanticscholar.org/CorpusID:235265639

tailspin20194y ago

> It seems to me the masheen learningz technikz...

Off-topic low-value comment, but I'm now going to be getting a T-shirt made with the caption "i can haz masheen learningz?"

oolonthegreat4y ago· 1 in thread

cool project, though the name was confusing for me: I believe to most people "paper" first means actual paper, so I thought this was some kind of OCR system converting printed material to html?

codevikingOP4y ago

Thanks for the feedback. There's two hard problems n' all that... :)

p4bl04y ago· 1 in thread

I tried that a few days ago with one of my papers (a PDF generated using pdflatex) and it didn't work that well: the text was fine but some section titles were off, and all of the math and code parts were broken.

But clearly it is a nice idea and I can't wait that such tools work better!

codevikingOP4y ago

> all of the math and code parts were broken.

Yup, this is a known issue that we're working towards fixing.

> But clearly it is a nice idea and I can't wait that such tools work better!

Glad to hear it!

NmAmDa4y ago· 1 in thread

I tried several physics papers and none of them had any equation extracted. Is it by design have problems with LaTeX equations?

codevikingOP4y ago

Yup, this is a known limitation:

> What are the limitations? There are several known limitations. Tables are currently extracted from PDFs as images, which are not accessible. Mathematical content is either extracted with low fidelity or not being extracted at all from PDFs. Processing of LaTeX source and PubMed Central XML may lack some of the features implemented for PDF processing. We are working to improve these components, but please let us know if you would like some of these features prioritized over others.

But we intend to fix this!

jimmySixDOF4y ago· 1 in thread

I am so amazed at the work you guys are doing at AI2 & the Semantic Scholar project. You guys are really fixing a broken system of research and discovery which suffers from organization design principles based on university library index card filing cabinets as magnified by the exponential content growth.

Cant wait to see what people do with this . . . .

codevikingOP4y ago

Thanks!

There's a lot of amazing people here, doing really great work. It's a really inspiring place to be. I feel really lucky to work with such great people on interesting, important problems.

Also, I should mention...we're hiring!

https://allenai.org/careers#current-openings

johnhenry4y ago· 1 in thread

Retro mode should be default.

codevikingOP4y ago

I agree!

Maybe we'll work on vi bindings next...

Klasiaster4y ago

For non-reflow conversion there is pdf2htmlEX: https://github.com/coolwanglu/pdf2htmlEX is discontinued but there is development under https://github.com/pdf2htmlEX/pdf2htmlEX

Demo: https://pdf2htmlex.github.io/pdf2htmlEX/doc/tb108wang.html

kartoshechka4y ago

Looks exactly like what type of crunch work ML would do, but have you considered using brute force converters like latexml or pandoc where appropriate?

When are, as people, are going to ditch PDF? It's an awful format.

My friend wrote his PHD in Latex, but it all ends up being PDFed anyway for what, eye candy?

It's time to move on. #ditchpdf

tailspin20194y ago

Haven't tried it yet, but a very cool concept.

As per other recent discussions on HN I think the general accessibility of academic papers is ripe for improvement.

Please make it popular in the research field so you can spin up your own Sci-Hub!

j / k navigate · click thread line to collapse