I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.
Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.
We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.
We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.
We’re eager to hear what you think, and happy to answer questions.
Edit https://allenai.org/terms point 5, you own all the uploads! So if by mistake we send a medical PDF for example or something else that is under gdpr, we can't ask you to delete it???? ? Wtfffff
See https://papertohtml.org/about:
> What data do we keep? We cache a copy of the extracted content as well as the extracted images. This allows us to serve the results more quickly when a user uploads the same file again. We do not retain the uploaded files themselves. Cached content is never served to a user who has not provided the exact same document.
Also, we can delete the extracted data on request. Just send a note to accessibility@semanticscholar.org.
Sorry for the confusion!
I think I will stick with pdftohtml, pdftotext, and pdfimages https://en.wikipedia.org/wiki/Poppler_(software). These take seconds not minutes.
From user perspective I dont understand why not release the source code and let people compile a native application. (Did I miss the link to the source code.) Instead it looks like this is just a means of collecting free data (metadata, more training data, data from submitted papers by default) everytime someone submits a paper.
This probably evokes some kind of sense of authenticity. Like some physical paper document it has exactly one appearance.
Regarding your question, I'd say that it is a natural continuation of centuries long tradition of writing on the actual paper. The invention of TeX actually made it easier to produce more papers, then came PDF, and you could produce virtual papers. Also science journals pretty much have monopoly on scientific knowledge distribution, and they are mostly paper too
My guess is it's largely for historical reasons. At the time most venues were organized PDF was probably the best (or only) mechanism for sharing documents for print distribution.
But we think it's time to change that :).
PDF is the only widely supported format can guarantee accurate reprint.
Any plans on that?
Their job is a little bit easier because arXiv papers have the .tex source available, so you can use one of the various tex2html variants, instead of having to extract the paper's contents from a rendered PDF.
Our focus right now is on providing a tool folks can run it on whatever papers they have access to. For instance, some researchers might have access to documents that aren't available to the public. We want them to be able to run this against those.
That said as we expand the effort I imagine we'll eventually pre-convert things that are publicly available, like those on ArXiv, etc.
We're not sure if it's something that we can distribute as OSS just yet. It relies on a few internal libraries that would also need be publicly released, so it's not as simple as adjusting a single repository's visibility.
Though OP has his kindle offline all the time, so not a issue for them.
One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)
I see 421 requests and 150 Mb loaded. As it seems to be mostly thumbnails, have you considered using jpegs instead of pngs, potentially use lazy loading (i.e. not load images outside of the viewport) and potentially use GCP's (or another provider) CDN offering?
Once I clicked a thumbnail, loading the article itself (for example [1]) was quite breezy.
The gallery is a great showcase of what your site does -- I think that it'd be worth making it snappier :-)
Cheers and congrats again
P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"
[0] https://papertohtml.org/gallery
[1] https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...
Yup. There's no CDN or anything like that right now. We kept things simple to get this out the door. But we definitely intend to make improvements like this as we improve the tool.
The more adoption we see, the more it motivates these types of fixes!
> P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"
Thanks for the catch. As you noted there's still a fair number of extraction errors for us to correct!
Very cool project, looking forward to seeing how it develops!
For thumbs of text papers, perhaps a GIF or PNG would be smaller than a JPEG while retaining pixel accurate crispness?
It seems to me the masheen learningz technikz boil down to a generalization of my lightbulb moment here[2].
[1]: https://grobid.readthedocs.io/en/latest/
[2]: https://www.nu42.com/2014/09/scraping-pdf-documents-without-...
Also, we're working hard on a new extraction mechanism that should allow us to replace GROBID [2].
There's a lot of really smart people at AI2 working on this, I'm excited to see the resulting improvements and the cool things (like this) that we build with the results!
Off-topic low-value comment, but I'm now going to be getting a T-shirt made with the caption "i can haz masheen learningz?"
But clearly it is a nice idea and I can't wait that such tools work better!
Yup, this is a known issue that we're working towards fixing.
> But clearly it is a nice idea and I can't wait that such tools work better!
Glad to hear it!
> What are the limitations? There are several known limitations. Tables are currently extracted from PDFs as images, which are not accessible. Mathematical content is either extracted with low fidelity or not being extracted at all from PDFs. Processing of LaTeX source and PubMed Central XML may lack some of the features implemented for PDF processing. We are working to improve these components, but please let us know if you would like some of these features prioritized over others.
But we intend to fix this!
Cant wait to see what people do with this . . . .
There's a lot of amazing people here, doing really great work. It's a really inspiring place to be. I feel really lucky to work with such great people on interesting, important problems.
Also, I should mention...we're hiring!
Maybe we'll work on vi bindings next...
Demo: https://pdf2htmlex.github.io/pdf2htmlEX/doc/tb108wang.html
My friend wrote his PHD in Latex, but it all ends up being PDFed anyway for what, eye candy?
It's time to move on. #ditchpdf
As per other recent discussions on HN I think the general accessibility of academic papers is ripe for improvement.