Skip to content

Top Best Ask Show New Jobs

Show HN: PDFLayoutTextStripper – Converts PDF to text while keeping the layout (opens in new tab)

(github.com)

283 pointsjlink9y ago92 comments

92 comments

55 comments · 11 top-level

jlinkOP9y ago· 12 in thread

who would be interested by an online website doing the job?

logicallee9y ago

if you really want to rake it in, serve, at static speeds (meaning instantly, I swear, boot a ramdrive (Tmpfs) and serve static html from nginx all from RAM), text versions of the top 10,000 web sites. there is so much crap on most sites. re-crawl hourly.

monetize via Google adwords.

EDIT: I'm not sure why I'm being downvoted. I am not suggesting serving PDF's. I am suggesting serving tiny text renders of top sites, that otherwise are much too bloated.

the hard part is getting the text and layout right. many people read many sites for the text IMO.

So I am suggesting you make an all-text version.

As an example, the front page of the New York Times right now, copied into Microsoft Word, is 2504 words. When I save from the word I copied into into .txt - I get a 16.4 KB file.

By comparison, when I put the site into a Page Size Checker -- http://smallseotools.com/website-page-size-checker/ -- I get 214.23 KB. That is impressively small, and it's a fast page.

If I try their competition, the Washington Post, I get 237 KB. If I try the Wall Street Journal, I get 938.15 KB -- nearly a full Megabyte. (This is actually more what I was expecting - I'm impressed by the Times.)

Suppose someone desperately wants to glance at the Wall Street Journal from a poor connection where they barely get data. The difference between 12 KB and nearly a megabyte is huge. Its the difference between 4 seconds and 312 seconds: 4 seconds as compared with 5 full minutes.

So there is a large need in my opinion for such a service in case someone desperately wants to see a text render. Preserving any formatting at all, helps hugely.

cozzyd9y ago

You can use the SHA-1 of the PDF's to avoid serving the same pdf twice.

Sounds cool and all but two huge problems:

1) Copyright: completely re-serving the complete content of the top 100 sites with your own ads does not fall under fair use and would almost certainly be a magnet for lawsuits.

2) Distribution: how do you find your niche of people with poor internet connections and get them to use your mirror instead of whatever site it is they want to read?

Doesn't Opera Mini or Turbo already provide this sevice? Perhaps add PPMD proxy text compression with an English dictionary with a JavaScript browser plugin on top of that. You can't get more efficient than that

rodw9y ago

For what it's worth, here's a service that does that https://documentalchemy.com/demo/pdf2txt (and more: https://documentalchemy.com/demo)

flexie9y ago

Just tried the demos on this website.

I tried to extract text from a pdf that already has searchable text, which can be copy-pasted. This should be the easiest task of all but it made mistakes in every second word.

Then I asked the website to make a pdf into a word-file. It just inserted the whole pdf as a picture in word.

jlinkOP9y ago

thanks for sharing this one, didn't know it.

Yeah, sure, a public one for not privacy-critical PDFs plus something like a Heroku button to build own secure app (with auth and no storage).

See e.g. my file sharing app https://github.com/andreif/SecretFile

akouri9y ago

I bet most would, but privacy would be a big concern for me at least. script is optimal format for me

Privacy is the reason I'd prefer to do this in house.

2_listerine_pls9y ago

check docparser.com

jlinkOP9y ago

interesting service which was not present yet back in 2015 when I wrote my class.

nemild9y ago· 10 in thread

For those interested in converting PDF tables into CSV, there's also Tabula ( http://tabula.technology/ )

(Used by many journalists to analyze the data in PDFs)

scrollaway9y ago

I find it absolutely ridiculous that we have to resort to these kinds of tools :/

We have digital formats, and we decided to standardize document distribution on the one that makes it as hard to extract data as if it were on physical paper.

PDF is a perfectly fine and rich digital format. It also allows you to do proper copy and paste, which is much saner than anything paper offers.

Sure, PDF is a light on context clues for automation and is targeted purely at humans. But formats targeted at both computers and humans consistently fail (XML with accompanying XSLT comes to mind), and/or only have terrible tools for creating files (easily parsable, pretty HTML).

Either there is very little real demand or we consistently fail at making alternatives viable.

vog9y ago

That's Adobe. Look at their other formats, and PDF seems to be one of their better ones. Compare to SWF, PSD, AI and so on.

PDF is the successor of PostScript. PostScript is a stack-based programming language where anything can happen, while PDF enforces some document structure and metadata structure on top of it, so you can e.g. at least determine where pagebreaks are, without having to interpret ("run the code of") the whole document.

Still, PDF is simpler than PostScript in the same sense that XML is a simplification of SGML. Jumping from PDF to a well-designed format would be like jumping from XML to JSON or S-Expr.

It's because PDFs have no concept of lines or paragraphs. It's just characters at an x,y co-ord which happen to line up. So figuring out whats a line or a column is a pain in the ass.

That's most likely why copying and pasting sucks too.

acbart9y ago

I had to use Tabula to extract a decade of SAT scores from PDFs for each state/year. It was a nightmare, but I managed it. More recently, I was hoping to do something similar with decennial census data, but it was just too much. Far, far too many groups publish data to PDF, which is about as bad as if they just deleted it straight-out. It's very upsetting.

krick9y ago

PDF is fucked up beyond all doubt. But there seems to be no better (even if unpopular) alternative.

How do you imagine a better format PDF alternative? On the one hand, we have text-based formats. They are not serialization of the exact rendering. On the other hand, we have ps, which is, probably, too complex to be manipulated as text when rendered. PDF and dejavu do kinda both, even if quite imperfectly.

So how do we construct a file format, which can render a symbol (not necessarily a unicode one) anywhere, pixel-perfect, but still has concept of words, paragraphs, preferably tables and such?

krakaukiosk9y ago

Tabula is a great tool. In my experience it's the most reliable open source software for extracting tables from PDFs. We are using their underlying Tabula-Java library for some parts of https://docparser.com and are happily sponsoring their project.

jlinkOP9y ago

I didn't know about Tabula and i've given a try at the instant. Apparently it only extracts tables and ignores everything around. This might be good in some cases but it is a problem if you want to extract a form, a whole textbook, your bank statements or anything. Also, I noticed that Tabula has some slight troubles when columns are not drawn in the table. But overall it is a good tool for extracting only tables, that's true.

eumm9y ago

Tabula is the nice free tool but requires technical background to run it. There is a free https://pdf.co with both online and offline tools (Windows) for PDF to CSV. (disclaimer: i work on it)

Hi there. We try to make a tool that's as simple to use as possible (given the constraints of a volunteer-run project such as Tabula). What technical background do you think is required to use it? (disclaimer: I'm the main author of Tabula)

tyingq9y ago· 7 in thread

Curious if this works better than the pdftotext utility that comes in the Debian poppler-utils package.

That has a --layout option that works really well sometimes and really terrible other times. Doesn't seem to be related to document complexity either.

vram229y ago

I had used the xpdf [1] package, a C library and a set of CLI tools (mentioned by others in this thread too, and which the pdftotext command-line utility and xppdf/pdftotext library are a part of), in a consulting project for a client some years ago. (Client had asked me to evaluate some libraries for PDF text extraction, and then recommend one, which I did (I chose xpdf), and I then consulted to them on their product, using xtpdf for part of the work. Also did some post-processing of the extracted text in Python. Interesting project, overall.)

As part of this work, I communicated over a period, with one of the key technical people at the company behind xpdf, Glyph and Cog. Got to know from him about some of the issues with text extraction from PDF, one of the key points being that in some or many cases, the extraction can be imperfect or incomplete, due to factors inherent in the PDF format itself, and its differences from text format. PDFTextStream (for Java) is another one which I had heard of, from someone I know personally, who said it was quite good. But those inherent issues of text extraction do exist.

So wherever possible, a good option is to go to the source from which the PDF was originally generated, instead of trying to reverse-engineer it, and get the text you want from there. Not always possible, of course, but a preferred approach, particularly for cases where maximum accuracy of text extraction is desired.

[1] Not to be confused with xtopdf, my PDF toolkit for PDF generation from other formats.

jlinkOP9y ago

During the development I compared my results with the ones of pdftotext utility and i obtained more or less similar results. The objective of my code was to have an equivalent tool easily embeddable in any java/android project and to learn more about apache pdfbox.

tyingq9y ago

I imagine it's not an easy task guessing about proportionally spaced fonts, overlapping bounding boxes, columns, tables, wrapping, and so forth.

It probably works reasonably well with the documents it has been tested with. It's a very hard problem to crack if you ask me. (edit: word choice)

dmoo9y ago

Also available for windows and mac at http://www.foolabs.com/xpdf/download.html

krylon9y ago

Last year, my boss gave me a task that looked simple enough at first glance - get data on how many vacation days each employee has in total, how many they have used in the current year, and how many they have left, and put that data in our SharePoint server (so people can see when filling out a vacation request if they actually have enough days left).

Most of that was fairly easy, except that the POS program that sits in the actual data only allows exporting data in one single format - PDF. Converting that PDF file to a CSV that I can feed into SharePoint was one of the nastiest things I did last year. I did manage to get it to work though, by toying around with pdftotext for a while and exploring its command line parameters.

It was a pleasure to use! It took me a while to discover the correct set of command line parameters I needed, but I got it to work! Thanks, xpdf!

pdftotext from xpdf (http://www.foolabs.com/xpdf/download.html) also has the -table option which usually works better than -layout. Unfortunately the poppler-utils fork doesn't have this option.

rsync9y ago· 5 in thread

This is important for (al)pine users ... when reading email in a terminal it is very useful to be able to open a PDF attachment as text and view it in the (terminal) mailtool ...

Yes, (al)pine is my mailtool in 2017.

Also mutt, which I've switched back to recently. I've got a little Atom powered Chromebook converted to Linux that just does not like modern heavy webmail clients(even GMail when it was still running ChromeOS, and this is one still on the market, Acer CB3-131) so a combination of mutt, mbsync, and msmtp is a much nicer combo. Mutt is a terrific mail reader but its internal SMTP and IMAP handling can be a bit iffy, hence mbsync and msmtp.

Though I can generally open attachments just fine, this text rendering of PDFs would be useful for when I'm SSH'd into my home machine and reading stuff remotely(usually from work where I don't want to download my personal email).

JetSpiegel9y ago

Mutt + mbsync + msmtp is my setup too. I'm using Neomutt, since that's being actively maintained by a sizeable community of friendly people.

shakna9y ago

I use alpine in 2017.

It's easy to use, pluggable, and faster than any GUI I've touched.

As do I, because it's faster than browser based email and many GUI clients (like Thunderbird). I also like the fact that I can just copy across my .pinerc file to a new computer and my mail client is setup.

I had not considered the PDF issue. I just open them with an external application. The potential of reading them within Alpine hadn't occurred to me, but now it has, I want it!

zatkin9y ago

No need to feel ashamed. I set up my own email server in 2016 and use mutt, squirrelmail, and iOS Mail very frequently.

WalterGR9y ago· 4 in thread

Fairly frequently, OCR engines are posted here. But almost without exception, they lack layout analysis, which renders them largely useless.

Is this something that could be combined with those OCR engines? (e.g. TesseractOCR...)

RandomBookmarks9y ago

I would not call these services useless ;) - but I wonder the same... Some apis like https://ocr.space return the coordinates of each converted word. Can that be a used input? (I have not tried it yet)

PretzelFisch9y ago

ephesoft seems to use this for classifying and data extraction from documents.

2_listerine_pls9y ago

some services allow you to set the layout manually: Docparser

eumm9y ago

PDF.co offline tool (for Windows) supports OCR and partial OCR for pdf to text and pdf to csv with layout preserved. (disclaimer: i work on it)

marak8309y ago· 2 in thread

Ahh this will be useful for my kitchen receipts. Thanks. Now I just need to roll that with an auto translator too.(I guess I have my day off project now :-) )

jlinkOP9y ago

Happy to know it could help you. Good cooking to you!

Both I and my accountant thank you haha.

Animats9y ago· 2 in thread

Is there a PDF to HTML converter which can consistently get line breaks right?

jlinkOP9y ago

could be a nice feature but not easy task. I'll give a try, though.

ganwar9y ago

Please update us/me when you do. I'm also working on the same problem, would love to chat.

curiousgal9y ago· 2 in thread

Although I haven't tested this yet, these utilities tend to fail when fed a table with empty cells.

gpvos9y ago

The first example image in the linked article shows a conversion from a table with some empty cells. It looks fine.

curiousgal9y ago

Those are at the end. I meant empty cells in the middle. The ones I tried don't account for them.

agumonkey9y ago

Fun, I did the same thing as a clojure repl exploration to pipe PDF text to a bare Swing GUI (I know, a little absurd in a way).

The deja vu made squint for a minute.

ps: pdfbox is nice

robinhowlett9y ago

Nice. I recently got very familiar with PDFBox and parsing complex layouts - it is a great library.

But does it keep both the layout and Sha-1 hash? Not sure it's HN worthy otherwise.

j / k navigate · click thread line to collapse