Show HN: Open Paperless – Scan, index, and archive paper documents (opens in new tab)

(github.com)

285 pointszhoubear8y ago96 comments

96 comments

80 comments · 19 top-level

jopsen8y ago· 16 in thread

Question: why bother organizing papers?

I just throw everything in a box, if I ever need it again later it'll take a long time to find.. but I rarely need to find a document again.

Complexity of archiving a document is O(1) with a very small constant. Complexity of retrieval is O(N) for a large N.

But I have few retrievals in my system, so why pay a higher per document cost?

tombrossman8y ago

> Question: why bother organizing papers?

Because being organised makes you more effective. With your 'throw it all in a box' system, you have a high barrier to finding documents in the future and this discourages you from doing so. However, with a more organised approach you are more likely to retrieve specific documents.

One example: Some mid-priced electronic device breaks a few months after you buy it. You might weigh digging through all the paperwork versus shrugging your shoulders and throwing it away. I would go straight to the warranty document and also look at my credit card issuer's warranty/returns policies(if any), and I would return the item for a replacement or refund. No biggie, only a few minutes work and I as a consumer prevail in exercising my rights.

Sounds boring but I believe it is definitely worth making the effort.

ekianjo8y ago

> Because being organised makes you more effective

I agree with the sentiment, but I tend to agree with the previous poster. The value of paper documents tends to be really low in the long run. I think you can keep it maybe for a few years when it comes to bills, but anything longer than that and there's not really a lot of value to go back to what you purchased/did or even where you traveled. I am also fairly organized but I tend to see as kind of futile, since I don't really need to go back and search for stuff that often.

jcelerier8y ago

> With your 'throw it all in a box' system, you have a high barrier to finding documents in the future and this discourages you from doing so.

In fifteen years of keeping my mail I maybe had once or twice to go back in time more than a month or two ago.

purerandomness8y ago

Then that system obviously doesn't solve one of your pains.

I have to dig out older documents almost daily.

1 more reply

pymai8y ago

- if you need to retrieve things often

- if you need more than one box. will you still have a single box 10 years from now or will it be 2? good luck trying to retrieve something in 20 years when you have even more boxes

- it only takes 1 fire or flood to destroy all of your documents. once they are digital you can easily make copies and store them in a few different locations

tbh making digital copies isnt that complicated compared to throwing something in a box. i just scan everything into a year/month folder and do as you do... worry about finding them later. spending time tagging or naming stuff after theyve been digitised is optional.

phreeza8y ago

I do basically this but I store everything chronologically in big binders. Insertion is still O(1) and finding things is O(log n) if I know the date the document was generated. I keep large amounts of documents I never retrieve, and plan to do some culling in the future, but not having to agonize if something is worth keeping is worth it to me.

mattlondon8y ago

You'd be surprised how useful having all of your documents digitally available can be.

I've been doing this for years already (albeit using Google Drive - yes, some people dont trust Google - I understand that, but doesn't bother me) and these are some of the common use-cases where I find it really useful:

* Tax returns. This alone makes it worthwhile.

* Call-centers. You'll often have the reference/account number/etc you need quicker than the operator at the other end can look them up.

* Travelling. Pulling up travel insurance etc details from your phone when you're stranded in an airport is so useful.

* The inevitable "we need to see a copy if your birth certificate/driving license/passport/last 3 months of bank statements/bill from a utility to validate your identity" requests. Either email the PDF directly or print out and go - no box rumaging.

* Health Records. Looking up the name of those pills you were prescribed 18 months ago, or when you had vaccinations etc.

* Those moments when you're convinced you've already had your car serviced/paid that bill/renewed the warranty on your washing machine/dealt with something but cant quite remember. Tap-tap yeah there is the confirmation from the appliance manufacturer saying I am still covered by warranty - emailed it to the support people. Ball is now in their court. Job done - whats next on my to-do list?

It comes up surprisingly often. Sure, not every single day, but certainly enough to justify the very very very small time investment of scanning it and uploading the PDFs Google Drive (I am sure there are alternatives that work just as well). When you retrieve something, the time-savings totally make it worthwhile.

As someone else has said elsewhere, often when you need these documents it is a stressful situation - a death, an illness, an accident, something "wrong" going on financially etc. You dont want to waste time frantically digging through dusty boxes of paperwork trying to find something when you can just tap a few keys and get it in seconds with zero hassle.

I highly recommend it.

I "bootstrapped" my archive by heading into the office at the weekend and using the huge printers to scan in boxes full of old paper documents to PDFs. Now I top-up with a home scanner attached to my LAN - if you're looking to buy a scanner for home, make sure you get one with an automatic document feeder that can do both sides at once so you can just chuck the papers in and hit go, then collect the PDFs from your network drive.

ValentineC8y ago

> Now I top-up with a home scanner attached to my LAN - if you're looking to buy a scanner for home, make sure you get one with an automatic document feeder that can do both sides at once so you can just chuck the papers in and hit go, then collect the PDFs from your network drive.

Any suggestions for one? I have a Canon P-208, but it's close to useless for batch scanning.

slantyyz8y ago

I'm partial to Fujitsu ScanSnap document scanners. It's worth paying the extra money for a Fujitsu, imo.

The only catch is that you are supposed to replace the pick roller and pad assembly every year or so.

I would recommend buying a couple of extras in advance, as they will get harder to find and more expensive as time goes on.

Having said that, I didn't replace my pick roller/pad assembly until after 9 years of operation. After a few years it would have trouble feeding multipage documents, but most of the stuff I was scanning was only one or two pages. When I did eventually replace those parts, the scanner was basically as good as new.

mattlondon8y ago

I have an HP OfficeJet one which I guess is kinda "prosumer" (cant remember the number - 8600 or something). The ADF works, but is single-sided so that is why I am urging others to learn from my mistake :-)

I know that the cheapish Canon Maxify printers do duplex ADF (the Canon Maxify MB5150 is what I plan to buy next time my current HP one runs out of ink since the ink costs almost as much as the printer!), but I cant recommend it since I've not used it yet.

HTH.

nh28y ago

HP OfficeJet Pro 8620. It's an all-in-one, but interestingly the best scanner for such purposes I've found so far (I'd love to find a better one).

Scans both sides, up to 50 pages in a batch. Open-source drivers, works well with Linux. Ethernet. Can drop it on on your SMB share.

pjc508y ago

If you have zero retrievals, you can just bin everything. As you've said, it depends how often you want to retrieve something and whether you ever need to retrieve a number of things in a hurry.

sschueller8y ago

I like this idea. I only need to keep 10 years and almost never need to retrieve anything. So if I use 10 boxes I can just throw away the oldest.

paulie_a8y ago

Would it cause any real issues if you were unable to retrieve something? If the answer is no, toss the other 9 boxes

1 more reply

matt_the_bass8y ago

I used to organize every bill in a file cabinet. Then a few years ago I switched to "throw everything for a quarter in a single folder." Now I'm about to switch to throw everything for the year in a single folder". (a few exceptions for major expenses like house, car etc). This allows me reduce the possible search space a lot with very little effort. Over the past few years I've found myself looking for 1 or two things from the past year. Total time spent was 10 minutes. This was way less time than filing everything.

dhruvkar8y ago

I agree with the sentiment that most papers have very little value in the long run.

So I have 4 'bins' to categorize physical papers - scan & keep, scan & shred, shred, throw.

This system usually eliminates a lot of papers that I would otherwise mindlessly/OCD'ly scan.

gravypod8y ago· 10 in thread

Will this automatically center and apply perspective transforms to pictures taken with phone cameras?

bvrlt8y ago

You should check Genius Scan (for iOS and Android) that does that for you automatically, and much more: https://www.thegrizzlylabs.com/genius-scan/

It also doesn't tie you to a specific ecosystem.

Disclaimer: I'm on of the authors of Genius Scan :)

gravypod8y ago

There are many solutions for systems like this. Unfortunately none are open source.

Most documents I do want to keep record of are sensitive. I don't trust a closed-source app with that kind of information.

towndrunk8y ago

Microsofts OfficeLen's for iOS is pretty nice doing this automagically.

billbrown8y ago

The best version of this that I've seen is Scanner Pro by Readdle. I had to scan three months worth of food receipts for an insurance claim and this feature was a lifesaver.

bketelsen8y ago

seriously, thank you for mentioning this. Great app!

zhoubearOP8y ago

Not at the moment. I'm guessing this information is available in the EXIF properties?

joshvm8y ago

It's a little bit more tricky than that. What the EXIF might tell you is the camera calibration parameters like focal length, distortion, perspective center, etc. That can be used to fix systematic errors in images like pincushion/barrel distortion.

To unwarp photos that were taken at odd angles you need to do some image processing. The mathematics aren't particularly difficult, it's a homography transform in most cases (rectangles). The problem is robustly detecting the page.

Dropbox has some nice write-ups on this: https://blogs.dropbox.com/tech/2016/08/fast-document-rectifi...

zhoubearOP8y ago

Thanks for the link. That blew my mind! I wish it could be added would, my phone would replace my scanner instantly.

4 more replies

voltagex_8y ago

The most complicated implementation I've seen has been Office Lens, which also corrects distortion and does edge detection.

gravypod8y ago

No. It requires opencv and image processing

pw0nka8y ago· 9 in thread

Looks great. Love the idea behind it, but...

There is at least one country (mine - Switzerland) which is not able to use software like yours. The problems are the current laws that force people and organizations to store physical copies of the documents (for several years). Electronic documents have no value in front of the law, which is why we have no choice but to do all of that offline, manually.

I've tried many archiving solutions, but non of them saved any bit of time. The one single, missing feature was an automatism to print a serial code (the electronic document ID) back on the original document. This way you could just scan it, print it, put it in a large box where you sort it by its ID - that simple. And this would even work if you would use spacers to split the documents on the scanning process.

anotheryou8y ago

There is a proprietary solution for it, sadly overpriced and cloud-based. But the principle could be implemented in any software:

Shoe Box + QR, you fill linearly and always snap the QR with the scan. It than can tell you at roughly which height in the stack and in which shoe box something is.

edit: here: https://box.fileee.com/

pw0nka8y ago

Thank you. This brings up another problem we have with our laws: Corporates are not allowed to store such documents on servers which are outside of our country. And that's usually the case with clouds services, because of obvious reasons.

kuschku8y ago

That’s not a problem with your laws, but more a problem with SaaS. Data should always stay as local as possible, ideally even within of your own organization.

1 more reply

jopsen8y ago

I have always archived my documents by throwing them in an unsorted black box.

If someone really need me to retrieve an old document. It'll take forever to find, but why would I want to pay sorting costs upfront?

copperx8y ago

Because you often need documents on stressful occasions such after the death of a family member, after an accident, after losing your job, after an IRS audit. You really want to be going over n documents with the possibility of missing one or more during such times?

outsideoflife8y ago

Because some businesses generate very large volumes of documents that have to be stored

ar08y ago

Can you elaborate on the laws you mention a bit? I understand that there are requirements for organizations (GebüV and VAT-related laws), but for people? Which documents do you need to store physically as a private individual in Switzerland?

pw0nka8y ago

I'm not an attorney, so I can't give you the exact references to laws. Basically you are not obligated to store those documents as a private person, as long as you don't need to show them to a judge during a court case. That's why I'm a bit on the paranoid side, because you never know when you might need a specific document to prove something. In my case I do store all the documents which were signed (e.g. contracts) or are government related.

tehabe8y ago

You could store them chronically on paper and do the actual sorting on your computer or server.

curioussavage8y ago· 7 in thread

Any good open source desktop software with linux support to do this? I don't see why I would personally want a web app for this.

joelhaasnoot8y ago

It's a little clunky but here's the one I found best that just worked on Ubuntu: http://gscan2pdf.sourceforge.net/ . It can combine some of the best tools for OCR/cleanup/etc.

My main gripe is that I have a document feeder and manually selecting pages with shift to combine in to a single document and clicking "Save as" is far too much of a hassle. There needs to be a better flow for that.

coaxial8y ago

I wrote a collection of bash scripts for that. https://github.com/coaxial/insaned-config

It was initially to use with insaned, but I later came up with a script to tie it all together (scan.sh) because it's faster than jamming the scan button waiting for insaned to register. And with the script, I can queue commands provided I'm fast enough to swap the physical pages in the flatbed scanner.

It also uses the excellent textcleaner imagemagick script to clean up the scans and make them more ocr friendly.

The readme isn't totally up to date, parallel isn't required anymore, and there is no mention of the scan.sh script. But when you run it, it prompts for commands. You might need to edit the scripts to set your own output directories and textcleaner location.

jackvalentine8y ago

I haven't tried this yet, but - https://openpaper.work/

Edit: tried it, it's crap.

jk23238y ago

May I ask why? Installation is a bit cumbersome but it seems to be an outstanding program to me. I have been looking very long for something like this.

I have not tired yet how it reacts to huge amounts of data. But best thing: NOT written in Java!

1 more reply

dm3198y ago

Maybe just put the scanned pdfs into a hierarchical folder system, then keep a text file at the root with comma or tab-separated location, ISO date and keywords.

Then your documents are a grep away. Maybe awk to find documents from a date range?

Maybe someone clever could automate this with the OCR output...

arca_vorago8y ago

There are, I just can't think of them at the moment, I know though because I setup a bookscanner with a linux box. If I remember right the scan/ocr/archive tools are all seperate, so you would have to script them together.

JustSomeNobody8y ago

Well, if you have a home server, having a web app works quite well. But if you don't, then a desktop app would probably be better.

Spearchucker8y ago· 4 in thread

This is arguably a lot more than I need. I'm a hoarder in that I have every email I've ever sent or received (bar junkmail), and every piece of paper I've ever received.

Most of my paper is now scanned - I think I have two boxes left in my garden shed. I don't bother with OCR because search doesn't help me when I don't know what to search for (e.g. invoice for a jumper I bought in 2010 - fashion labels rarely call their jumpers jumper).

And so I rely on meta data. There's not much out there in terms of open-source tagging software, and even less in terms of an open tagging approach. I ended up with tagspaces, which is a web app packaged up as a native app. The approach to tagging is good (tags appended to file name), but the app is abysmally poor. Slow - waiting up to 30 seconds for a pop-up menu to appear. It assumes tag-based searches work in only one way.

The intent is to write some native apps to solve my biggest problems. For now I'm still trying to clear the backlog of un-scanned paper docs (not going to get this done for me, because privacy). I tag important stuff, like employment contracts, mortgage agreements, passports and birth certificates...

Hope to have everything done by the time I cash in my chips. Might make for a useful dataset for someone somewhere some day.

ishi8y ago

A few years ago I was involved with a startup that built a document management system for consumers, and we actually got pretty good results with OCR + automatic tagging based on a very simple database that maps keywords to tags.

Let's say you want to auto-tag bills and other documents from your ISP. So you add the ISP's name, phone number, website address etc. into the database - any uniquely-identifying keywords that typically appear on the documents that they send. Now any document that contains these keywords will get tagged as "ISP", making it very easy to find in the future.

Even if the OCR quality isn't perfect, at least one of these keywords will most likely get matched.

Another example - you could add the names of your family members as keywords, making it easy to find all documents related to Jenny or Susan.

You could argue that full-text search would achieve the same result, but uploading documents into the system and having them auto-tagged as "ISP", "car-payments", "Walmart", "Susan" and so on feels a little bit like magic, as if the system is actively helping you organize your papers.

The keyword approach is also very easy to understand and tweak, unlike more rigorous but opaque methods of document clustering (such as tf-idf).

myaso8y ago

Out of curiosity what is the state of the art today for extracting text or other data from scanned documents (forms, legal docs, receipts, etc) ?

matt_the_bass8y ago

I don't have an exact answer but can tell you that Expensify still resorts to human parsing sometimes. How often "sometimes" is, I have no idea. I would guess a lot.

Spearchucker8y ago

Everything you say is true, and the value, I think, is clear. The part I don't like is that I have to create a database manually. Granted, the results will save me time as I don't have to manually tag the routine.

Food for thought.

prashnts8y ago· 4 in thread

I've been using iOS and Mac's native notes app to do that. In my opinion what these solutions lack is an integration between both note-taking (I sometimes like to write a few sentences relevant to a document, and I'd like to have it shown right next to it) while also letting you have the individual documents available in PDF or whatever if you need. Notes app does it perfectly now after iOS 11.1 and High Sierra.

An example is this screenshot from my notes https://imgur.com/a/xuZqW

lucaspiller8y ago

As long as you don't mind being tied into the Apple ecosystem (and trust they won't loose your data) it's a good solution. Notes used to be backed by IMAP, but the new 'rich notes' since iOS 9/OS 10.11 is backed by iCloud.

mark_l_watson8y ago

I used to use Notes but stopped after trying to back up my notes. For me, exporting one note at a time to pdf is not good enough, and finding the opaque binary file in ~/Library does not help because it is not a standard file format.

I switched to using Notes in Fastmail.

davidrupp8y ago

I've been using http://writeapp.net/notesexporter/ for Mac's Notes app; happy with it so far.

zhoubearOP8y ago

Right now you can enter comments for a document that's already uploaded. A few UI tweaks could be added to show the latest comment under the document preview.

Exporting to PDF sounds like a really good idea.

EGreg8y ago· 4 in thread

I have a question

Is there a service anyone knows about which will print your email and send it with tracking of receipt or signature, so you can prove what was physically sent?

Or you mail it to them and they open your mail, scan it and forward it on with signature required, with your address as the return address?

Because righy now you can only prove that the ENVELOPE was received, not what was in it.

craftyguy8y ago

you can just ask your recipeint to send you some uuid on the letter after they receive it. i have no idea what problem you are trying to solve though.

y4mi8y ago

i once heard of a case were the employer send an unrelated document to an employee.

later on, the payment stopped and the employer claimed that they fired him at that time.

I don't recall how that case ultimately turned out, but maybe something like that? would be incredibly rare though and of dubious worth for mostly anyone

craftyguy8y ago

I believe there are courier services that help guarantee document delivery to the correct person, i.e. for sending/serving court summons.

chrisweekly8y ago

there are lots of virtual mailbox service offerings out there, to open and scan mail for you. what you describe is right up their alley.

theomega8y ago· 3 in thread

I want to show an alternative approach to managing your documents:

Store them in your IMAP/Mails. Either on an own account or in a dedicated sub-folder.

I wrote some small python scripts [1] which allow you to: - Add an email with the PDF attached to your document collection. The script supports adding a subject and adding tags to it - Go over all the emails and run an OCR (tesseract) on them: Attach the OCR result together with the pdf to the email.

Big advantage: - Search on IMAP is a solved problem - Clients for every operating system in the world, including web, mobile - Super simple backup and restore

Over course, very geeky, nothing for your parents, but maybe something for you?

[1]: https://github.com/theomega/IMAP_DMS

mdaniel8y ago

Just ensure you pick a "good" encoding scheme, because base64-ing PDFs will balloon that storage space in a very big hurry

JohnStrange8y ago

Don't you have to run your own IMAP server for that to work?

Although my mail provider is fairly generous about storage space, it's not unlimited.

theomega8y ago

Depends on two things: Your space and your privacy requirements. Google Mail works for example if you are willing to trust Google. A lot of email providers offer you a lot of space.

tjoff8y ago· 1 in thread

I don't know what Mayan EDMS is and all this readme does is saying what it is in relation to Mayan EDMS. Extremely frustrating.

Fnoord8y ago

According to the documentation it is Ubuntu only, as it requires Ubuntu 16.10 or later. What about other Linux distributions? No mention of the other 2 popular desktop OSes, Windows and macOS?

karinato8y ago· 1 in thread

For those wondering about the relationship between Mayan EDMS, Paperless and Open Paperless here is a story line summary of the saga.

Roberto Rosario (the creator of Mayan) is a very well known name in the Django, Python, document management, maker, hacking, open health and open source in the goverment circles.

- https://speakerdeck.com/siloraptor - https://en.wikipedia.org/wiki/Roberto_Rosario - https://www.pycon.it/conference/p/roberto-rosario - http://pyvideo.org/djangocon-us-2014/liberation-and-moderniz... - https://cpucadviceletters.org/login/?next=/ - https://twit.tv/shows/floss-weekly/episodes/253 - https://en.wikipedia.org/wiki/Mayan_(software) - https://www.youtube.com/watch?v=rubzEAojf-k

Mayan EDMS was initially released in February 3, 2011 (Wikipedia and git log). In June 2015, Roberto gave a workshop in DjangoCon named From zero to paperless with Mayan EDMS (https://archive.is/FDpYS). Daniel Quinn (the creator of Paperless) also attended and presented at the same DjangoCon event (https://vimeo.com/135907408) and 6 months later after working on it for several months (Daniel's own words), he released Paperless on December 20, 2015 (https://github.com/danielquinn/paperless/commits/master?afte...). By January 24, 2016, Paperless had "exploded in popularity" (https://twitter.com/danielagquinn/status/691242822431830016).

Both projects used Python, Django, same Django 3rd party apps like DjangoSuit, same document consumer model, same OCR engine, REST API, among other things. On the surface it appeared that Paperless was a copy of Mayan EDMS concepts and implementations without giving credit or mention. Many additions were planned for Paperless that were features and implementations already in Mayan (https://www.reddit.com/r/selfhosted/comments/44mh88/scan_ind...).

A separate point of contention was that the name "Paperless" had been in use by other projects much earlier that Daniel's Paperless (https://github.com/search?utf8=%E2%9C%93&q=paperless&type=). Since there is no trademark on the name or description, other projects appeared with the same name and description (https://github.com/lrnt/paperless).

On March 15, 2016, Daniel presented Paperless at CodeNode (https://skillsmatter.com/skillscasts/7843-intro-to-paperless).

It was Daniel's February 27, 2016 tweet suggesting to be paid to work on Paperless that sparked the animosity between the users of the two projects (https://twitter.com/danielagquinn/status/703629488932970500).

Many heated debates ensued. Even then, the main critique of Paperless remained technical, but lack of maturity and implemenation was described by one Reddit users as: "I've looked into paperless and it currently lacks a lot of...nearly well everything. Maybe in a year or two" (https://www.reddit.com/r/linux/comments/6m9evn/want_to_go_pa...)

On April 9, 2016, Daniel added a reference to Mayan to the documentation of Paperless (https://github.com/danielquinn/paperless/commit/674d54ec3878...).

On April 17, 2016, Daniel posted on his old twitter account: "It looks like my idea for Paperless wasn't all that unique. This other project uses a lot of the same tools: http://www.mayan-edms.com" (https://twitter.com/danielagquinn/status/721726208606646272).

On April 14, 2017, Daniel Quinn posted in his blog a summary of his experiences at DjangoCon Europe 2017 where he mentions meeting Roberto in person. He describes Roberto as a "rival geek" in what appears to be jest and uses positive adjectives to describe Roberto in the rest of the post. (https://danielquinn.org/blog/djangocon-2017/)

On April 16, 2017 Daniel posted a tweet mentioning the popularity Paperless (https://twitter.com/danielagquinn/status/853701257051205632).

The last release of Paperless is made on Sep 9, 2017.

On Oct 18, 2017 Daniel posted: "I changed my Twitter name! This isn't me any more, so if you're looking for me, you should keep head over to @danielagquinn." (https://twitter.com/searchingfortao/status/92077862371561062...). Only 7 commits have been made to Paperless since with the last commit happening on Novermber 5, 2017.

On December 18, 2017 a user named "zhoubear" anounced on Reddit's selfhoted "Open Paperless: Scan, index, and archive all of your paper documents" (https://www.reddit.com/r/selfhosted/comments/7kjocg/scan_ind...). It turned out that Open Paperless was a forked Mayan EDMS with cosmetic changes but with copyrights changed and no attribution to Mayan EDMS. After a much heated debate, copyrights and attributions were restored and the project's description has been updated to show that it is a new front end for Mayan among other usability changes meant for home users.

In 4 days, Open Paperless has surpassed Mayan EDMS in popularity on Github.

No posts or comments from Roberto can be found in reference of Paperless or Open Paperless.

https://twitter.com/search?q=paperless%20from%3Asearchingfor...

kerridge08y ago

Mayan isn't hosted on GitHub so that may explain the difference in popularity.

ikawe8y ago· 1 in thread

Let's put this in a room with [The Screenless Office](https://news.ycombinator.com/item?id=15960056) and see what happens.

beamatronic8y ago

Is that the one where you scan a barcode and they mail you a printout of a web page ?

SomewhatLikely8y ago· 1 in thread

Something I've wanted that might be possible is software that takes in a video of me flipping the pages of a notebook and converts that to a PDF of the notebook.

mdaniel8y ago

I regret that I can't immediately find the video which discussed it, but this gets in the ballpark of what I saw: https://www.researchgate.net/publication/271462470_OCR_from_...

IIRC, it wasn't ~vaporware~ researchware, but nor was it "clone this repo, away you go"

pingec8y ago

Does anyone know any similar free/open products for archiving documents, tracking etc.?

What I am after is a system like expensive solutions have in some companies where the mailbox department prints (or has preprinted) labels with unique bar codes, for any incoming mail, they open it, stick a label on it, scan it with the label on it and then physically deliver it. Some departments also input recipient and sender details, add tags etc. So in the end they have a searchable database by persons involved, content type, tags and also all documents (physical and digital) have a referenceable id that can be used for various purposes.

y4mi8y ago

a nontrivial name conflict with Paperless (https://github.com/danielquinn/paperless) ...

2 more replies

carwyn8y ago

There's also this paperless https://github.com/danielquinn/paperless

bob_theslob6468y ago

Please correct me if I am wrong, but this looks like you have to "name" each page. I would also want to see how accurate the ocr is. Historically, ocr on handwritting has been a problem unless the data is perfectly formatted. I guess the case is just to get enough accuracy so that you can look for or at the image of that page with the indexed search term you were looking for.

mickael-kerjean8y ago

Well done! Will definitly give it a try back home !

mauritzio8y ago

Maybe it would be better to "archive" on good paper (encoded) Can not imagine a 1000 year old magnetic device... ;)

rootsudo8y ago

Okay, wow, this is cool.

j / k navigate · click thread line to collapse

96 comments

80 comments · 19 top-level

jopsen8y ago· 16 in thread

Question: why bother organizing papers?

I just throw everything in a box, if I ever need it again later it'll take a long time to find.. but I rarely need to find a document again.

Complexity of archiving a document is O(1) with a very small constant. Complexity of retrieval is O(N) for a large N.

But I have few retrievals in my system, so why pay a higher per document cost?

tombrossman8y ago

> Question: why bother organizing papers?

Sounds boring but I believe it is definitely worth making the effort.

ekianjo8y ago

> Because being organised makes you more effective

jcelerier8y ago

> With your 'throw it all in a box' system, you have a high barrier to finding documents in the future and this discourages you from doing so.

In fifteen years of keeping my mail I maybe had once or twice to go back in time more than a month or two ago.

purerandomness8y ago

Then that system obviously doesn't solve one of your pains.

I have to dig out older documents almost daily.

1 more reply

pymai8y ago

- if you need to retrieve things often

- if you need more than one box. will you still have a single box 10 years from now or will it be 2? good luck trying to retrieve something in 20 years when you have even more boxes

- it only takes 1 fire or flood to destroy all of your documents. once they are digital you can easily make copies and store them in a few different locations

phreeza8y ago

mattlondon8y ago

You'd be surprised how useful having all of your documents digitally available can be.

* Tax returns. This alone makes it worthwhile.

* Call-centers. You'll often have the reference/account number/etc you need quicker than the operator at the other end can look them up.

* Travelling. Pulling up travel insurance etc details from your phone when you're stranded in an airport is so useful.

* Health Records. Looking up the name of those pills you were prescribed 18 months ago, or when you had vaccinations etc.

I highly recommend it.

ValentineC8y ago

Any suggestions for one? I have a Canon P-208, but it's close to useless for batch scanning.

slantyyz8y ago

I'm partial to Fujitsu ScanSnap document scanners. It's worth paying the extra money for a Fujitsu, imo.

The only catch is that you are supposed to replace the pick roller and pad assembly every year or so.

I would recommend buying a couple of extras in advance, as they will get harder to find and more expensive as time goes on.

mattlondon8y ago

HTH.

nh28y ago

HP OfficeJet Pro 8620. It's an all-in-one, but interestingly the best scanner for such purposes I've found so far (I'd love to find a better one).

Scans both sides, up to 50 pages in a batch. Open-source drivers, works well with Linux. Ethernet. Can drop it on on your SMB share.

pjc508y ago

If you have zero retrievals, you can just bin everything. As you've said, it depends how often you want to retrieve something and whether you ever need to retrieve a number of things in a hurry.

sschueller8y ago

I like this idea. I only need to keep 10 years and almost never need to retrieve anything. So if I use 10 boxes I can just throw away the oldest.

paulie_a8y ago

Would it cause any real issues if you were unable to retrieve something? If the answer is no, toss the other 9 boxes

1 more reply

matt_the_bass8y ago

dhruvkar8y ago

I agree with the sentiment that most papers have very little value in the long run.

So I have 4 'bins' to categorize physical papers - scan & keep, scan & shred, shred, throw.

This system usually eliminates a lot of papers that I would otherwise mindlessly/OCD'ly scan.

gravypod8y ago· 10 in thread

Will this automatically center and apply perspective transforms to pictures taken with phone cameras?

bvrlt8y ago

You should check Genius Scan (for iOS and Android) that does that for you automatically, and much more: https://www.thegrizzlylabs.com/genius-scan/

It also doesn't tie you to a specific ecosystem.

Disclaimer: I'm on of the authors of Genius Scan :)

gravypod8y ago

There are many solutions for systems like this. Unfortunately none are open source.

Most documents I do want to keep record of are sensitive. I don't trust a closed-source app with that kind of information.

towndrunk8y ago

Microsofts OfficeLen's for iOS is pretty nice doing this automagically.

billbrown8y ago

The best version of this that I've seen is Scanner Pro by Readdle. I had to scan three months worth of food receipts for an insurance claim and this feature was a lifesaver.

bketelsen8y ago

seriously, thank you for mentioning this. Great app!

zhoubearOP8y ago

Not at the moment. I'm guessing this information is available in the EXIF properties?

joshvm8y ago

Dropbox has some nice write-ups on this: https://blogs.dropbox.com/tech/2016/08/fast-document-rectifi...

zhoubearOP8y ago

Thanks for the link. That blew my mind! I wish it could be added would, my phone would replace my scanner instantly.

4 more replies

voltagex_8y ago

The most complicated implementation I've seen has been Office Lens, which also corrects distortion and does edge detection.

gravypod8y ago

No. It requires opencv and image processing

pw0nka8y ago· 9 in thread

Looks great. Love the idea behind it, but...

anotheryou8y ago

There is a proprietary solution for it, sadly overpriced and cloud-based. But the principle could be implemented in any software:

Shoe Box + QR, you fill linearly and always snap the QR with the scan. It than can tell you at roughly which height in the stack and in which shoe box something is.

edit: here: https://box.fileee.com/

pw0nka8y ago

kuschku8y ago

That’s not a problem with your laws, but more a problem with SaaS. Data should always stay as local as possible, ideally even within of your own organization.

1 more reply

jopsen8y ago

I have always archived my documents by throwing them in an unsorted black box.

If someone really need me to retrieve an old document. It'll take forever to find, but why would I want to pay sorting costs upfront?

copperx8y ago

outsideoflife8y ago

Because some businesses generate very large volumes of documents that have to be stored

ar08y ago

pw0nka8y ago

tehabe8y ago

You could store them chronically on paper and do the actual sorting on your computer or server.

curioussavage8y ago· 7 in thread

Any good open source desktop software with linux support to do this? I don't see why I would personally want a web app for this.

joelhaasnoot8y ago

It's a little clunky but here's the one I found best that just worked on Ubuntu: http://gscan2pdf.sourceforge.net/ . It can combine some of the best tools for OCR/cleanup/etc.

coaxial8y ago

I wrote a collection of bash scripts for that. https://github.com/coaxial/insaned-config

It also uses the excellent textcleaner imagemagick script to clean up the scans and make them more ocr friendly.

jackvalentine8y ago

I haven't tried this yet, but - https://openpaper.work/

Edit: tried it, it's crap.

jk23238y ago

May I ask why? Installation is a bit cumbersome but it seems to be an outstanding program to me. I have been looking very long for something like this.

I have not tired yet how it reacts to huge amounts of data. But best thing: NOT written in Java!

1 more reply

dm3198y ago

Maybe just put the scanned pdfs into a hierarchical folder system, then keep a text file at the root with comma or tab-separated location, ISO date and keywords.

Then your documents are a grep away. Maybe awk to find documents from a date range?

Maybe someone clever could automate this with the OCR output...

arca_vorago8y ago

JustSomeNobody8y ago

Well, if you have a home server, having a web app works quite well. But if you don't, then a desktop app would probably be better.

Spearchucker8y ago· 4 in thread

This is arguably a lot more than I need. I'm a hoarder in that I have every email I've ever sent or received (bar junkmail), and every piece of paper I've ever received.

Hope to have everything done by the time I cash in my chips. Might make for a useful dataset for someone somewhere some day.

ishi8y ago

Even if the OCR quality isn't perfect, at least one of these keywords will most likely get matched.

Another example - you could add the names of your family members as keywords, making it easy to find all documents related to Jenny or Susan.

The keyword approach is also very easy to understand and tweak, unlike more rigorous but opaque methods of document clustering (such as tf-idf).

myaso8y ago

Out of curiosity what is the state of the art today for extracting text or other data from scanned documents (forms, legal docs, receipts, etc) ?

matt_the_bass8y ago

I don't have an exact answer but can tell you that Expensify still resorts to human parsing sometimes. How often "sometimes" is, I have no idea. I would guess a lot.

Spearchucker8y ago

Food for thought.

prashnts8y ago· 4 in thread

An example is this screenshot from my notes https://imgur.com/a/xuZqW

lucaspiller8y ago

mark_l_watson8y ago

I switched to using Notes in Fastmail.

davidrupp8y ago

I've been using http://writeapp.net/notesexporter/ for Mac's Notes app; happy with it so far.

zhoubearOP8y ago

Right now you can enter comments for a document that's already uploaded. A few UI tweaks could be added to show the latest comment under the document preview.

Exporting to PDF sounds like a really good idea.

EGreg8y ago· 4 in thread

I have a question

Is there a service anyone knows about which will print your email and send it with tracking of receipt or signature, so you can prove what was physically sent?

Or you mail it to them and they open your mail, scan it and forward it on with signature required, with your address as the return address?

Because righy now you can only prove that the ENVELOPE was received, not what was in it.

craftyguy8y ago

you can just ask your recipeint to send you some uuid on the letter after they receive it. i have no idea what problem you are trying to solve though.

y4mi8y ago

i once heard of a case were the employer send an unrelated document to an employee.

later on, the payment stopped and the employer claimed that they fired him at that time.

I don't recall how that case ultimately turned out, but maybe something like that? would be incredibly rare though and of dubious worth for mostly anyone

craftyguy8y ago

I believe there are courier services that help guarantee document delivery to the correct person, i.e. for sending/serving court summons.

chrisweekly8y ago

there are lots of virtual mailbox service offerings out there, to open and scan mail for you. what you describe is right up their alley.

theomega8y ago· 3 in thread

I want to show an alternative approach to managing your documents:

Store them in your IMAP/Mails. Either on an own account or in a dedicated sub-folder.

Big advantage: - Search on IMAP is a solved problem - Clients for every operating system in the world, including web, mobile - Super simple backup and restore

Over course, very geeky, nothing for your parents, but maybe something for you?

[1]: https://github.com/theomega/IMAP_DMS

mdaniel8y ago

Just ensure you pick a "good" encoding scheme, because base64-ing PDFs will balloon that storage space in a very big hurry

JohnStrange8y ago

Don't you have to run your own IMAP server for that to work?

Although my mail provider is fairly generous about storage space, it's not unlimited.

theomega8y ago

Depends on two things: Your space and your privacy requirements. Google Mail works for example if you are willing to trust Google. A lot of email providers offer you a lot of space.

tjoff8y ago· 1 in thread

I don't know what Mayan EDMS is and all this readme does is saying what it is in relation to Mayan EDMS. Extremely frustrating.

Fnoord8y ago

According to the documentation it is Ubuntu only, as it requires Ubuntu 16.10 or later. What about other Linux distributions? No mention of the other 2 popular desktop OSes, Windows and macOS?

karinato8y ago· 1 in thread

For those wondering about the relationship between Mayan EDMS, Paperless and Open Paperless here is a story line summary of the saga.

Roberto Rosario (the creator of Mayan) is a very well known name in the Django, Python, document management, maker, hacking, open health and open source in the goverment circles.

On March 15, 2016, Daniel presented Paperless at CodeNode (https://skillsmatter.com/skillscasts/7843-intro-to-paperless).

On April 9, 2016, Daniel added a reference to Mayan to the documentation of Paperless (https://github.com/danielquinn/paperless/commit/674d54ec3878...).

On April 16, 2017 Daniel posted a tweet mentioning the popularity Paperless (https://twitter.com/danielagquinn/status/853701257051205632).

The last release of Paperless is made on Sep 9, 2017.

In 4 days, Open Paperless has surpassed Mayan EDMS in popularity on Github.

No posts or comments from Roberto can be found in reference of Paperless or Open Paperless.

https://twitter.com/search?q=paperless%20from%3Asearchingfor...

kerridge08y ago

Mayan isn't hosted on GitHub so that may explain the difference in popularity.

ikawe8y ago· 1 in thread

Let's put this in a room with [The Screenless Office](https://news.ycombinator.com/item?id=15960056) and see what happens.

beamatronic8y ago

Is that the one where you scan a barcode and they mail you a printout of a web page ?

SomewhatLikely8y ago· 1 in thread

Something I've wanted that might be possible is software that takes in a video of me flipping the pages of a notebook and converts that to a PDF of the notebook.

mdaniel8y ago

I regret that I can't immediately find the video which discussed it, but this gets in the ballpark of what I saw: https://www.researchgate.net/publication/271462470_OCR_from_...

IIRC, it wasn't ~vaporware~ researchware, but nor was it "clone this repo, away you go"

pingec8y ago

Does anyone know any similar free/open products for archiving documents, tracking etc.?

y4mi8y ago

a nontrivial name conflict with Paperless (https://github.com/danielquinn/paperless) ...

2 more replies

carwyn8y ago

There's also this paperless https://github.com/danielquinn/paperless

bob_theslob6468y ago

mickael-kerjean8y ago

Well done! Will definitly give it a try back home !

mauritzio8y ago

Maybe it would be better to "archive" on good paper (encoded) Can not imagine a 1000 year old magnetic device... ;)

rootsudo8y ago

Okay, wow, this is cool.

j / k navigate · click thread line to collapse