DIY Book Scanner (opens in new tab)

(diybookscanner.org)

340 pointsbcaa7f3a8bbc4y ago124 comments

124 comments

My first job out of college was scanning books for the Internet Archive down in the basement of the Library of Congress. Their scanning machines used a foot pedal to raise and lower the glass Platen, so I'd use one hand to flip the page and wiggle the cradle to get things nice and flat and the other would snap the photo. You can get pretty fast after a while, but boy is it mindless. Older books that had been rebound a couple times already were the hardest to work with as you have the least amount of margin. There's a bunch of different sized dowels that we would put under the spine in the cradle so the glass could gain a couple millimeters of margin, just enough to avoid cutting off text. Worst case scenario the book had to be unbound in order to capture. I did get to flip through a lot of cool old illustrated catalogues like this: https://archive.org/details/illustratedcatal00keil/page/14/m...

daniel_reetz4y ago

Later on I worked briefly for the Archive. The scanner I designed later became their "ttscribe". It was fascinating to see their process up close.

daniel_reetz4y ago

(I'm the founder of DIY Book Scanner, ran it from 2009 to 2015)

subset4y ago

I currently work for the University I study at in the (biomedical) library. I scan lots of old journal articles and periodicals for academics who need it for their research. A typical job might be scanning an article from 1960 on Potato Research for an agriculturalist, or a graph of human energy expenditure, or an analysis of fibres for forensic medicine. We get researchers from all around the world requesting articles on all sorts of topics from our archives. It's a Sandstone university, so we have some very old collections that are definitely getting crumby!

Other than locating the books, by far the most tedious aspect is the scanning. We only have a terrible flatbed scanner, that is completely unforgiving - it only has a 25 page email limit, otherwise you have to split it into separate emails. And if you mis-scan a page accidentally (some of the book margins are super tight), then you have to restart the entire scan - there's no delete page button!

tablespoon4y ago

> And if you mis-scan a page accidentally (some of the book margins are super tight), then you have to restart the entire scan - there's no delete page button!

Sounds like you need a license for Adobe Acrobat Pro or some other application that will let you reshuffle/insert pages.

0wis4y ago

Oh I understand you ! I would have love to have design hints like that for one summer job I had. I was tasked to scan medical books with thousands of pages. I had little time constraints and could do it wherever I wanted. I was paid 80$ per book which I found crazy huge before starting. Even by optimizing every parameter of the hardware, software and my workspace, I couldn’t do more than a few 2-hour sessions per day and it took me several days per book. A boring job if there is one. I would certainly use it as a zen practice today but as a teenager, not really needing the money, I couldn’t find any value in it (after optimization).

pbhjpbhj4y ago

Books that aren't rare, ie aren't valuable as artefacts, you would surely cut off the spine and run through an automated scanner?

But then medical texts probably cost way more than $80. How much was your boss making from those scans? Were they taking account of copyright law?

raybb4y ago

Wow that's awesome! I take it you're responsible for a chunk of the books available now on openlibrary.org?

When scanning books like that did you ever see anything interesting or are you so zoned out you don't really pay attention?

subpar4y ago

A very small chunk, I only lasted a couple months. Most of it was pretty boring, think volume after volume of copyright records or issues of the national stamp collector's magazine. Eventually I started working on some of the contract work they did for other agencies, e.g. declassified FBI case reports. The best was the stuff for the Smithsonian, which often included beautiful naturalist illustrations. I'm not sure how much of that stuff was public domain though.

mymythisisthis4y ago

Nice old batteries https://ia800608.us.archive.org/BookReader/BookReaderImages....

bobbytakeitwith4y ago

Back when this was a more popular problem, I saw a number of projects that used a rubber tipped stick to automate page turning.

I wonder why this never took off?

pfp4y ago

I'd really like to see this too.

Building a scanner would be interesting, but the mind-numbing idea of turning pages manually isn't very enticing

axiosgunnar4y ago

Were you using gloves or something like that?

david_allison4y ago

Unsure about IA, but gloves are typically advised against[0] unless you have a suspicion that the book will be dangerous (arsenic ink in bindings[1], dust, mold or frass (sadly)[2]). Hand-washing before is typical advice but YMMV

[0] https://www.nationaltrust.org.uk/features/why-wearing-gloves...

[1] https://daily.jstor.org/some-books-can-kill/

[2] https://www.ifla.org/node/93094

thrdbndndn4y ago

I get the points against gloves in your links, but from my experience my hands will constantly leave fingerprints and oil on the books no matter how much I wash it.

How do they avoid this issue? Or just don't bother?

2 more replies

subpar4y ago

Nope. I'm not sure if this is the right way to put it, but we were basically a scanning factory. I think the really sensitive documents got routed elsewhere. Books and folios that had large format foldouts got more specialized, ahem, white glove treatment. Many scanners wore these little textured rubber finger tips: https://rubber_finger_tips.jpg.so/

l00sed4y ago

That sounds incredible

zwayhowder4y ago

I built one of these out of pine 2x4s and plywood. I thought it would be cheaper than buying one (I was wrong) but I'm also not a skilled woodworker and had to buy most of the tools.

It works quite well and I digitised dozens of textbooks I'd purchased and needed to reference but couldn't carry around every day while finishing my masters. My one had 2 Nikon mirrorless cameras controlled via Pi-Scan. https://github.com/Tenrec-Builders/pi-scan

I had a smaller toggle switch wired to the GPIO pins so I could click the scan next button without having to take my hands of the book. Once I got used to the workflow I could scan about 1000 pages per hour while watching Netflix.

I replaced it with a Czur scanner that isn't as good, but is a lot smaller and is good enough for my less demanding needs now that I'm not doing a masters degree :D

timeinput4y ago

The dual camera is a design choice I hadn't thought of. I've thought of scanning a couple books, and that's probably the trick for me. Though maybe I'll rotate the book and scan / rotate images separately.

zwayhowder4y ago

It let me capture the pages with the correct orientation and the cameras have a fixed focus on the Platen so it works really well. Then Scantailer can crop automagically and deal with the rest.

SamBam4y ago

What was the purpose? Did the digitized books go to the public domain/your university, or something, or was it purely for personal research?

zwayhowder4y ago

Just for personal use as all the books were still in copyright and I own the paper versions, it was a (probably legal) fair use of them purely for reference while studying.

I often needed to find information in the books and couldn't reasonably carry them all with me every day between work and uni.

pbhjpbhj4y ago

In my opinion, buying a book (or other media) should give one right to a digital copy from any source.

1 more reply

flakiness4y ago

At Japan in the meantime, people in book scanning community (that exists) often just cut the book spine and scan the pages using normal scanner, throw it away once all the pages are scanned.

People (rightly) value room spaces than books there. It's called "Ji-sui" (scanning by oneself) and gear recommendation sites like [1] are abundant. Another reason of "Ji-sui" prevalence was the poor availability of ebooks, although that reason was less relevant today.

[1] http://monomania.sblo.jp/article/60578693.html

wpietri4y ago

One business I kinda want to exist is a book warehouse/scanning operation. I send them boxes of books; they give me an app with access to digital versions of every book I send them. The whole operation is somewhere in, say, Nebraska, so storage cost is very low.

flakiness4y ago

Similar scanning services exist in Japan as well (ex. [1]). The difference is that they kindly discard the books for you once getting scanned!

[1] https://www.bookscan.co.jp/

SolonIslandus4y ago

Something like this?

https://1dollarscan.com/sp/

wintermutestwin4y ago

$.01 per page and they are based in Fremont, CA.

This looks like a cheaper way to get most ebooks and you can ship the books direct from Amazon. If enough people did this, maybe ebook pricing will come down to something rational.

1 more reply

mattowen_uk4y ago

For a while there existed services that did this with your CD collection. You'd send them a crate of your music CDs and they'd send back all the music ripped as mp3s.

Obviously, they didn't re-rip music they'd already ripped so technically for popular music, you got 'someone else's' mp3s.

---

My multi function laser printer has a duplex scanner on it. It can scan pages at quite a rate. The problem is not the scanning, but the accurate OCRing, and for things like magazines, the storage of all those high resolution pages. Right now, I cut out the articles I want to keep from my monthly mags, and just scan those. It seems like a fair compromise right now.

pbhjpbhj4y ago

Also, in UK with our strict "fair-dealing" (as opposed to USA's Fair Use) none of this is lawful - including ripping CDs for personal use (though that format-shifting was briefly allowed for a couple of years).

Disclaimer: This is my personal opinion; not legal advice.

_virtu4y ago

When I was in college the iPad had just come out. I was determined to save money so I snagged an iPad to use as my omnitextbook and built a scanner based upon one of the schematics on this site with a friend.

I would usually be the guy that made an email group for everyone to share notes and questions for classes pre all of the blackboard garbage, so I started leveraging those connections and would ask if anyone would let me borrow their book for a scanned version in return. My friends and I would have a book scanning party and would help to scan each others’ books. We’d grab some drinks, find some favorite albums and hang out all night until the wee hours taking turns scanning texts.

After one semester the setup paid for itself. I would supplement some texts with learning trackers like bitme before amazing resources came around like libgen. Good times.

daniel_reetz4y ago

Thanks for sharing your story, I'm so glad to hear it was useful to you and your crew. Were you ever active on the forums?

fernly4y ago

Nice to provide hardware hints and designs but geez that is almost the least of it. Cleverest hardware still only gets you a thumb drive full of page images. Now what? There needs to be a software workflow ending with a readable book in PDF, EBOOK or MOBI format, and there are many, many choices to be made along that path.

Edit: "Finishing a book" is discussed at a very superficial level here: https://vimeo.com/user33752051 at about 1:00:

"In order to turn these raw images into an ebook, the very minimum you need to do is A, you need to rotate them, B you need to crop them down to use the page [?], and C you need to combine them into one document like a PDF... You can do OCR to make it searchable ... color correction... de-skewing, de-warping ..."

BeetleB4y ago

Back in 2012, there was a guy who started an open source project that did exactly this - he wrote it specifically for the DIY Book scanner. It had a local Django project as the interface. I don't remember the details, but it did a decent job of taking the images, OCR'ing them and creating an output PDF.

I believe he abandoned the project some years later as life got busy and he never found enough volunteers to help him.

Would have to go through my email records to find the name of the project.

markvdb4y ago

Spreads[0] is probably what you are referring to. Some backstory:

I saw the diybookscanner community - which at that point mostly had Daniel Reetz [1] as its active contributor- struggle with mechanical contraptions for triggering cameras and very little software experience. I built a simple proof of concept to reliably trigger cheap consumer cameras using software. I built it on CHDK[2], the Canon Hack Development Kit, alternative firmware for cheap consumer cameras. The proof of concept worked.

I then had a fairly large number of book scanner kits built and shipped mostly around the EU [3]. More of a work of love than a business really, even if it was formally under an llc umbrella. Johannes initially was just a customer. He wanted to build a better software solution, and within the spirit of the project did so as free software. I tried to support him at this as well as I could, setting up build infrastructure, trying to reel in more people, getting him some cameras to test, get the amazing CHDK people to port to new camera models, ...

Then real life intervened indeed.

Johannes, if you read this, I'm still grateful for the experience of having worked with a great developer like you!

[EDIT] And of course, I should also mention Dan Reetz' incredibly inspiring work bootstrapping an incredible open hardware project! Hats off!

[0] https://github.com/DIYBookScanner/spreads

[1] https://danreetz.com/

[2] https://chdk.fandom.com/wiki/CHDK

[3] http://diybookscanner.eu

daniel_reetz4y ago

Hi Mark! We had electronic and USB triggering working with SDM and CHDK before you joined. But no image transfer or control of settings by USB. We deliberately pursued mechanical triggering for places where a computer and crazy firmware wasn't an option. I donated quite a few scanners to projects and people who simply couldn't use that stuff at the digitization site.

Johannes (spreads) was one of the most inspiring people I've ever worked with, so thankful for the energy and intellect he brought to the project- and the software he built. I donated a pair of DSLRs to him as a thank-you. Last I heard he was still working in a related space, but at a higher level.

Personally, I left the project to join Apple (they refused to let me continue any work on the open- source project while I was employed there), and gave Jonathon (tenrec) control. He redesigned the scanner again and sold kits as well as produced a Raspberry Pi based controller with nice software. Seems he has closed the store.

3 more replies

BeetleB4y ago

Actually, it was a project called Paper Upgrade. Here is an old archive link:

http://web.archive.org/web/20140101000000*/http://www.paperu...

I don't know if you can find the code through there, but I'm pretty sure he had made it free. I think spreads is a bit newer.

Edit: Found some more info. It did indeed use Scantailor in the backend. His SW was more of a Web based frontend to all the parts. You can see a video demo of it here:

https://www.youtube.com/watch?v=Ad7aFYdbDos

Start at about 4:40.

The source is here:

https://code.google.com/archive/p/diy-ebook-creator/

1 more reply

fernly4y ago

Not the same as Scan Tailor[1,2] ? Which was referenced from the Instructables link cited earlier. That apparently was a comprehensive toolkit in C++ and Qt, now archived.

[1] https://scantailor.org/

[2] https://web.archive.org/web/20210304015939/https://github.co...

jccalhoun4y ago

There are a couple folks that forked scantailor. I'm not sure the status of those. Here are a couple: https://github.com/4lex4/scantailor-advanced https://github.com/trufanov-nok/scantailor-universal

hackeyed4y ago

Right, step 1 -> get page images, step 2 -> author images into book file. While OCR is obviously useful for search, a rotated phone screen will let you comfortably read a pdf book just fine unless you are talking about something like a textbook, in which case you probably wanted a tablet anyway.

I wrote up a guide on the authoring process using FOSS tools for some Digital Humanities folks a couple years ago: https://github.com/wikey/bookscan

It gives some background on the problem and covers a Scantailor (page crop, rotate, deskew), pdfbeads (compression, book metadata) authoring workflow, with pdftk for some general odds and ends.

jccalhoun4y ago

scantailor will get you most of the way there. the original project is dead but there are a few forks on github. It has been a while since I did any serious scanning so I can't remember which version I used. https://github.com/4lex4/scantailor-advanced https://github.com/trufanov-nok/scantailor-universal

Mediterraneo104y ago

I scan heavily from academic libraries in order to contribute to LibGen, but even with Scantailor it is very time-consuming. For example, if you are scanning scientific literature from the Eastern Bloc, it was often printed on low-quality, speckled paper, which means Scantailor often identifies too much of the scan as the page block, and then you have to manually tweak the rectangle.

walrus014y ago

The simplest of which would be to turn the images into a multi page raster PDF, using freely licensed linux based command line tools for PDF generation. Which will of course result in a rather large file size vs doing OCR, but might be the best preservation method for books with illustrations, unusual fonts, catalogs, mixed text and photos, etc.

I am not clear on to what extent the existing workflow does a de-skew of the camera images to deal with page curvature towards the spine.

I think I recall the Internet Archive having an open source design for something similar to this? And other projects which accomplish generally the same idea.

vixen994y ago

Just page images? No, Czur software with its OCR generates searchable pdfs and Word or Excel files with no further input. With careful attention to the scanned area, it's easy to get .xlsx files needing zero or minimal editing. The other advantage of the Czur is the automatic correction for curvature when scanning books with narrow margins on either side of the spine.

No, I have no connection with Czur - just an enthusiastic user!

ajot4y ago

This is an old article, so maybe some software isn't the best option nowadays, but you can get the idea of postprocessing: http://natecraun.net/articles/linux-guide-to-book-scanning.h...

rahimnathwani4y ago

https://www.instructables.com/Bargain-Price-Book-Scanner-Fro...

"Step 10 - Post-processing" has some steps

sandeep13384y ago

Video looks interesting, I'll check it out!

david_allison4y ago

I'm very interested in getting into archival (getting started this month after a few more conversations).

Your buy button[0] is broken. You're potentially missing out on a few sales due to this.

Is 2x 4GB SD card sufficient for your purposes? I've been quoted 50MB TIFF images as a standard, and a lot of books wouldn't fit without swapping SDs at that size.

[0] http://store.diybookscanner.org/

zwayhowder4y ago

If you use pi-scan the images are saved to a USB drive instead of the SDcards.

fernly4y ago

archiving what? just curious.

david_allison4y ago

I want to digitize the entire linguistic and spoken corpus of a critically endangered language[0] and convert it to a searchable format to aid in language revival, academic research, and ensuring that an informed debate can occur when the modern usages of the language differ from traditional usages of the language.

Most of the printed books are scattered, but available, but it's akin to an iceberg: there's a significant amount of 'submerged' knowledge about the language in written manuscripts and recorded audio, and this is where a lot of the value comes from. Printed texts are primarily religious, and getting the colloquial usages of words and phrases is very useful.

Many manuscripts aren't digitized at all, or are available and need transcription.

The language is relatively well-recorded (dating back to at least the late 16th century in written form), and yet small enough that a comprehensive reference is viable: estimates of about 5MM words crop up, but even 3x could easily fit in memory on a Digital Ocean droplet, even if fully POS tagged[1]. Texts are also mostly in the public domain, and there's a lot of bilingual texts (which act as a Rosetta Stone).

[0] https://en.wikipedia.org/wiki/Manx_language#Revival

[1] https://en.wikipedia.org/wiki/Part-of-speech_tagging

EDIT: More than happy to talk in depth about this if anyone wants, via comments, or email on my profile.

1 more reply

timonoko4y ago

If you have fast flatbed scanner, you can scan 300 pages in thirty minutes. Not worth the effort to build automation. Bigger problem was to sort out all errors and missed pages afterwards. Real-time display (from Imagemaqick) solved this problem:

    while true ; do
     for x in *.pnm ; do
      killall display 
      display -rotate 90 $x &
     done 
     sleep 5
    done

timonoko4y ago

Nobody asked, but for the record, this is how make real one-page PDF from two-page scans. (gm = GraphicsMagick)

    mkdir kaksi
    rm kaksi/*
    j=102 
    scale=600
    size=500x730
    yla=27
    for x in *.pnm ; do
     echo $x
    gm convert $x -rotate 90 -crop $size+20+$yla -resize $scale -normalize kaksi/k$j.jpg
      j=$((j+2))
    done    
    j=103
    for x in *.pnm ; do
     echo $x
    gm convert $x -rotate 90 -crop $size+530+$yla -resize $scale -normalize kaksi/k$j.jpg
      j=$((j+2))
    done
    cd kaksi
    gm convert *.jpg -format pdf TheBook.pdf

pflanze4y ago

If I read this right then it means "every 5 second, open the last scanned page (and nothing else / close the previous one)". But this seems like an inefficient way to do it, opening and killing all irrelevant pages all the time. This will be more efficient and react more quickly:

    lastfile=
    while true; do
        newestfile=$(ls *.pnm | tail -1)
        if [ "$newestfile" != "$lastfile" ]; then
            kill %
            display -rotate 90 "$newestfile" &
            lastfile=$newestfile
        fi
        sleep 0.3
    done

timonoko4y ago

"Saving precious bits like it is 1969".

This would make an excellent song title.

pflanze4y ago

Sorry but I find your answer disappointing and crossing over into offending. I spent some time first trying to understand how your code makes sense, then to write up a better solution and posted it, and you don't seem to be thankful at all and are instead dissing my effort. Sure, if it works for you, fine, I was under the impression that you didn't know better. You could have saved me time by indicating that you know your solution is hacky but you don't care.

2 more replies

shard4y ago

This really needs to be redesigned for ergonomics.

- Lever should have a button for capture

- Display should be visible while looking down

But now I see why destructive scanning (slicing the binding off and using a sheet feeding scanner) is so attractive. For any non-rare books, this is just too tedious and time consuming to go through for more than a few books.

markvdb4y ago

Display should be visible indeed.

Your capture triggering suggestion is not as great though. The systems that I shipped with http://diybookscanner.eu actually used a USB foot pedal for triggering the cameras. That's by far a superior user experience to pressing a button while both hands are busy moving a cradle...

Destructive scanning feels incredibly cruel to the books. A non-destructive system like this actually works fairly well. You can expect to get up to about 1000-1200 pages an hour with it.

shard4y ago

> Destructive scanning feels incredibly cruel to the books.

I suppose it depends on whether it has sentimental value. When I was young, I'd treat my books like treasures, putting covers on them (even paperbacks), making sure I didn't crease the spine when I read them. Now I consider books to be a temporary store of knowledge as the contents pass to my brain. I fold pages, underline, scribble notes in them. There are thousands more copies out there, I don't feel any need to baby my copy.

BeetleB4y ago

> For any non-rare books, this is just too tedious and time consuming to go through for more than a few books.

If your goal is to scan a whole bunch, it's tedious. If you want to do it once in a while, it's not really a problem.

dang4y ago

One past thread, a long time ago:

DIY book scanning - https://news.ycombinator.com/item?id=991897 - Dec 2009 (7 comments)

djoldman4y ago

Here it is in action:

http://tenrec.builders/quill/guide/scanning/scan/

azureel4y ago

For anyone interested, there is also https://libreflip.org/ website about similar device.

dahart4y ago

> While there are some computer algorithms that can help dewarp the pages after capture, it is always more reliable to just capture flat pages in the first place.

I’m sure this is technically true, but curious how much it matters in practice today? Reading Google’s book scanning patents I found a description of a de-warper based on capturing a 3d depth scan of the book, which I assumed they were using in order to achieve the scale of scanning all books on earth. Capturing and de-warping a 3d depth scan would also be leagues more reliable than trying to do a purely 2d image based de-warp.

> The lights must also be positioned to minimize glare and reflections.

For my personal photo scanning and archiving project, I used a polarizing filter on the light and on the camera in order to eliminate specular glare, it works amazingly well. Would that be impractical, and/or not work as well on books for some reason?

usui4y ago

These kinds of discussions need more real examples to accurately depict the tradeoffs of destructive vs non-destructive scanning, so I'll add scans I personally made.

Here are two pages from Cracking the Coding Interview, 6th Edition, that I preferred over the digital versions I found online that were hard on the eyes because I disliked the black-and-white scans. Feel free to ask me about "details" in the process

https://imgur.com/2ZQFZ5p

It's entirely possible to accomplish post-processing without writing code if you have Adobe Photoshop.

I used a free-to-the-public bookscanner built by the Digital Archivists at Noisebridge in San Francisco to take pictures of all pages in my textbooks (it took a while). In Photoshop, you can record a macro to automatically crop to a rectangular region determined by just one or more points that are guaranteed to be on the page in every photo. The selection is made by the quick selection tool (selects similar pixels to the page color in the same region). With this macro recorded, you can run it in bulk through all files.

The textbook size was still large digitally (a gigabyte) because I wanted the highest quality possible for studying, but it beat having to carry heavy textbooks for sure. I also shared these files with friends and we were able to study without any physical textbooks for books that were not available digitally—it was amazing.

Personally I avoided all the deskewing technologies and preferred just pictures, all in color, as close to the real thing as possible, because Noisebridge's scanner used two DSLRs and the pictures were high quality. It was better than converting everything to black-and-white for reading enjoyability. OCR through ABBYY FineReader.

Overall it gets more annoying the thicker the textbook is. If destructive scanning is acceptable, one can just buy the book, go to FedEx and ask them cut the spine off for $4 to convert it to loose-leaf, then run it through a document scanner such as ScanSnap ix500, which is much faster at around 25 pages/min at its slowest

One really cool feature about Noisebridge's scanner (picture below) was that you could view the camera's viewfinder live in real-time, thereby speeding up iteration and catching errors much faster

https://imgur.com/4Pkdp1j

tunesmith4y ago

Are any of you part of book scanner clubs that might have a database of word counts of famous fiction books? I've found several lists online but it's not a wide selection of books - I'd imagine book scanners might have more. I'd be happy to share the database I've cobbled together.

userbinator4y ago

I briefly participated in an eBookz scene group at the turn of the century, although we didn't keep track of any word counts (nor did we OCR) and we focused on non-fiction, mainly automotive repair manuals. I doubt it's a statistic that the scanners (people/groups) pay attention to.

totetsu4y ago

Last year I bought a czur book scanner that looks kind of like a lamp to try and archive some 100 year old books I had limited access to. The resolution of the camera was so low I ended up balancing my phone on top and getting better images just using it as a light.

ghaff4y ago

I find the latest Czur works well enough for non-glossy stuff. The broader scanning problem I find is that, beyond the small must scan category, I find I have so much stuff that generally it's not very practical.

ebr4him4y ago

The store seems to be down, any idea how much it costs?

nanna4y ago

Anyone have thoughts on the Easy Book Scanner design by David Landin?

https://www.instructables.com/Book-Scanner-Low-cost-easy-to-...

hackeyed4y ago

Looks incredibly serviceable and well engineered. I would expect reasonable and consistent results from the rig. The biggest question would come down to the cameras.

With these kind of rigs (two cameras, not computer controlled, no computer display) your big potential sources of error are either accidentally failing to trigger one camera or cameras losing focus on the page (especially if you are at something like the end of a chapter where there is often empty space in the middle of the page where the camera's auto-focus area is). His solution of using the IR remote should significantly reduce the issue of failing to capture on one camera. Cameras exist with manual focus settings, but they are often pricier or too old to reliably find one worth recommending to others. The CHDK alternative firmware for certain cheap Canon cameras generally adds a manual focus option for the less expensive cameras (though the individual features depend on who is making the firmware build you get).

Another option worth investigating is the newest Raspberry Pi camera modules with external lenses. Those should give you manual focus and the ability to build up an automated workflow you like around things like moving files around and any pre-processing you need. An ~9 mega pixel camera gets you 300dpi resolution on a full sheet of A4 paper, which is a lot more than most books.

daniel_reetz4y ago

A lot of people on the DIY Book Scanner forums got a lot of value out of this design and David's presence on the forum. In my opinion it's an excellent starting point.

Gedxx4y ago

Here a homemade way to digitize a book with a compact camera https://www.ikkaro.com/en/como-digitalizar-libro/

TaylorAlexander4y ago

Hello if anyone is in the Bay Area and has a book scanner I’d love to scan my copy of this book which was only printed in India in 2001 and seems relatively rare:

https://www.abebooks.com/9780140298246/Patents-Myths-Reality...

I did fill out the form for the internet archive but it talked about scanning a library and I’m not sure they want to deal with just one book.

daniel_reetz4y ago

NoiseBridge maintained a DIY Book Scanner for a long time.

TaylorAlexander4y ago

Oooh good tip thank you

mcguire4y ago

Are the kits back in stock?

I fooled around with the DIY option, but realized I was incompetent. Ended up buying a cheap Czur scanner, which works surprisingly well.

For it, you hold the book open on the black mat on a table. The scanner uses a laser to measure and correct page curvature, and takes a picture of both pages.

It produces decent PDFs (I'm not sure about the comparative resolution) with (bad) OCR'ed text. (The IA re-OCR's the book after upload, right?)

braincode4y ago

I'd love to see something like this made out of entirely recycled phones and their cameras instead of going with discrete components... any leads?

mattowen_uk4y ago

I've just built a basic basic overhead 'rostrum' type rig using some wood, screws, and an older Android phone.

The phone's camera points down at a height of about 20cm/8" and can see an area on the base plate big enough for the object I'm capturing (in this case, 3.5" floppy disks).

I use an app called IP Camera (it's on the Play store for free) to serve the image via http. I then remotely grab it, process/crop it and store it. The project is in it's early stages, but is working quite well so far.

ngold4y ago

I have most of the entire collection of hardback national geographics from 1930 to 1970. Wonder how legal it would be to scan them. Always wondered.

jccalhoun4y ago

unless you really want to do it for fun, they have almost certainly been scanned and put online by someone else. Archive.org has several https://archive.org/search.php?query=national%20geographic

bcaa7f3a8bbcOP4y ago

> from 1930

Good news: It's very likely that the copyright has expired. If you were to scan them, remember to upload them to archive.org for everyone else to see.

Bad news: It's only the case if the copyright hasn't been renewed by the owner. Usually most owners don't renew them, but to determine whether or not this is the case, you need to go through huge catalogs of registered entries from the U.S. copyright office.

ahi4y ago

Incorrect. It is almost certainly still in copyright in the United States. Anything published after 1925 will be in copyright except for those published without notice 1926-1977 or were not renewed 1926-1963. The exceptions almost certainly don't apply to NatGeo.

Scanning typically falls under fair use, so copyright only applies to distribution of the scans.

edit: Maybe you edited or maybe I'm just dumb. Anyway, the problem with relying on a lack of renewal is that you have to prove a negative. NYPL among others have been doing interesting work on this problem: https://www.nypl.org/blog/2019/09/01/historical-copyright-re...

dredmorbius4y ago

Under US law, between 1925 and 1964, it was necessary to renew copyright every 28 years.[1] There are many works which have fallen out of copyright for nonrenewal, and there are projects which are now going through copyright renewal records to determine which of these are in fact now public domain in the US. Roughly 3/4 of potentially copyrighted works (published since 1924 and initially registered) have proved to be public domain.

Once copyright has lapsed, it cannot be reinstated.

https://www.crummy.com/2019/07/22/0

https://www.nypl.org/blog/2019/05/31/us-copyright-history-19...

________________________________

Notes:

1. Technically, the obligation existed until the 1976 act, at which time copyright status was automatic, but was retroactively waived for works never having lapsed, back to 1964, in the 1992 act. It's complicated. (NYPL link above.)

salamandersauce4y ago

It's just as legal as ripping your CDs into MP3s.

Topgamer74y ago

I've used scan tailor in the past to convert a outboard motor manual to pdf, it's pretty powerful. I didn't have a proper setup, but my results still came out decently.

https://github.com/4lex4/scantailor-advanced

fiftyacorn4y ago

I remember reading about Larry Page spending time developing a book scanner using a scanner and a hoover to turn pages

fortran774y ago

I built a similar one of these from a kit that Dan Reetz made. (Technology has improved since I built mine.)

I have eliminated most printed books. I had to pass a "psychological barrier" before I was able to discard the books I scanned.

The last holdout was music scores, but I now use an iPad for music at the piano.

daniel_reetz4y ago

Thanks for buying and building a kit. I appreciated everyone who did that so much.

Finnotesorg4y ago

I also want this.

jbergens4y ago

The link http://store.diybookscanner.org/ goes to a shop page that is not configured yet.

Topgamer74y ago

I remember reading about Google's book scanner that operated automatically using vacuum pressure to gently flip pages. I'd love to see an open source variety of that.

hellbannedguy4y ago

I think about Google's scanner project a lot. I wonder if they are still scanning?

I would love if they found a way to offer those books (authors would have to agree) to the world.

If they did, I would forgive them for tracking me all these years.

I'd even put up with ads in the books.

david_allison4y ago

They do if the books are out of copyright, and they're often copied to archive.org (and then transcribed via Wikisource) when this happens.

sample: https://books.google.im/books?id=Me8CAAAAQAAJ&printsec=front...

floathub4y ago

And you can order a bound paperback book made from the scans of just about any out-of-copyright google book. From the Harvard Bookstore (or anyone else that has an Espresso Book Machine):

https://www.ondemandbooks.com/as/?t=Hamlet&c=google

supernova87a4y ago

I don't know where to find a picture of it, but I remember seeing a video of one of their book scanners. Or maybe it was just one version of them.

Picture a wedge (shaped like the V of a book lying open on a table) that could move up and down. The wedge would descend and insert itself into the V of the book's pages, then rising up would suck the 2 left and right sheets against the sides of the wedge, scanning as it went.

Then it would blow those 2 pages to the left (or right I forget) and descend again and do the next set.

I thought it was pretty cool. Have never seen something like it since.

markvdb4y ago

Automated page turning is an incredibly complex problem to tackle. It won't gain you as much as you'd think either. "Book" and "page" are a surprisingly difficult to define categories.

If my experience helping to build the diybookscanner.org project taught me anything, it's that picking the low hanging fruit of small efficiency improvements to the (semi-)manual process is so much more effective...

donalhunt4y ago

Which is probably why some of the scanning locations used something very similar to this DIY effort (I'm familiar with the Oxford, UK location that existed in the mid 00s). Humans turned the pages with the finger tips mentioned in another comment.

dredmorbius4y ago

Google Linear Book Scanner

https://yewtu.be/watch?v=7MNqINDm1lk

rckoepke4y ago

ggm4y ago

UCL-CS had one of these which was deployed in conjunction with the British Library. This is when high pixel count CCDs were super expensive back in the 1980s. Amazing device.

indiantinker4y ago

Nice! A foot-pedal can improve his over-all efficiency and reduce lower back and neck pain.

failwhaleshark4y ago

I need this for some vintage IBM/PC-compatible programming books that are a zillion pages long.

j / k navigate · click thread line to collapse

124 comments

subpar4y ago

daniel_reetz4y ago

Later on I worked briefly for the Archive. The scanner I designed later became their "ttscribe". It was fascinating to see their process up close.

daniel_reetz4y ago

(I'm the founder of DIY Book Scanner, ran it from 2009 to 2015)

subset4y ago

tablespoon4y ago

> And if you mis-scan a page accidentally (some of the book margins are super tight), then you have to restart the entire scan - there's no delete page button!

Sounds like you need a license for Adobe Acrobat Pro or some other application that will let you reshuffle/insert pages.

0wis4y ago

pbhjpbhj4y ago

Books that aren't rare, ie aren't valuable as artefacts, you would surely cut off the spine and run through an automated scanner?

But then medical texts probably cost way more than $80. How much was your boss making from those scans? Were they taking account of copyright law?

raybb4y ago

Wow that's awesome! I take it you're responsible for a chunk of the books available now on openlibrary.org?

When scanning books like that did you ever see anything interesting or are you so zoned out you don't really pay attention?

subpar4y ago

mymythisisthis4y ago

Nice old batteries https://ia800608.us.archive.org/BookReader/BookReaderImages....

bobbytakeitwith4y ago

Back when this was a more popular problem, I saw a number of projects that used a rubber tipped stick to automate page turning.

I wonder why this never took off?

pfp4y ago

I'd really like to see this too.

Building a scanner would be interesting, but the mind-numbing idea of turning pages manually isn't very enticing

axiosgunnar4y ago

Were you using gloves or something like that?

david_allison4y ago

[0] https://www.nationaltrust.org.uk/features/why-wearing-gloves...

[1] https://daily.jstor.org/some-books-can-kill/

[2] https://www.ifla.org/node/93094

thrdbndndn4y ago

I get the points against gloves in your links, but from my experience my hands will constantly leave fingerprints and oil on the books no matter how much I wash it.

How do they avoid this issue? Or just don't bother?

2 more replies

subpar4y ago

l00sed4y ago

That sounds incredible

zwayhowder4y ago

I built one of these out of pine 2x4s and plywood. I thought it would be cheaper than buying one (I was wrong) but I'm also not a skilled woodworker and had to buy most of the tools.

I replaced it with a Czur scanner that isn't as good, but is a lot smaller and is good enough for my less demanding needs now that I'm not doing a masters degree :D

timeinput4y ago

zwayhowder4y ago

It let me capture the pages with the correct orientation and the cameras have a fixed focus on the Platen so it works really well. Then Scantailer can crop automagically and deal with the rest.

SamBam4y ago

What was the purpose? Did the digitized books go to the public domain/your university, or something, or was it purely for personal research?

zwayhowder4y ago

Just for personal use as all the books were still in copyright and I own the paper versions, it was a (probably legal) fair use of them purely for reference while studying.

I often needed to find information in the books and couldn't reasonably carry them all with me every day between work and uni.

pbhjpbhj4y ago

In my opinion, buying a book (or other media) should give one right to a digital copy from any source.

1 more reply

flakiness4y ago

At Japan in the meantime, people in book scanning community (that exists) often just cut the book spine and scan the pages using normal scanner, throw it away once all the pages are scanned.

[1] http://monomania.sblo.jp/article/60578693.html

wpietri4y ago

flakiness4y ago

Similar scanning services exist in Japan as well (ex. [1]). The difference is that they kindly discard the books for you once getting scanned!

[1] https://www.bookscan.co.jp/

SolonIslandus4y ago

Something like this?

https://1dollarscan.com/sp/

wintermutestwin4y ago

$.01 per page and they are based in Fremont, CA.

This looks like a cheaper way to get most ebooks and you can ship the books direct from Amazon. If enough people did this, maybe ebook pricing will come down to something rational.

1 more reply

mattowen_uk4y ago

For a while there existed services that did this with your CD collection. You'd send them a crate of your music CDs and they'd send back all the music ripped as mp3s.

Obviously, they didn't re-rip music they'd already ripped so technically for popular music, you got 'someone else's' mp3s.

---

pbhjpbhj4y ago

Disclaimer: This is my personal opinion; not legal advice.

_virtu4y ago

After one semester the setup paid for itself. I would supplement some texts with learning trackers like bitme before amazing resources came around like libgen. Good times.

daniel_reetz4y ago

Thanks for sharing your story, I'm so glad to hear it was useful to you and your crew. Were you ever active on the forums?

fernly4y ago

Edit: "Finishing a book" is discussed at a very superficial level here: https://vimeo.com/user33752051 at about 1:00:

BeetleB4y ago

I believe he abandoned the project some years later as life got busy and he never found enough volunteers to help him.

Would have to go through my email records to find the name of the project.

markvdb4y ago

Spreads[0] is probably what you are referring to. Some backstory:

Then real life intervened indeed.

Johannes, if you read this, I'm still grateful for the experience of having worked with a great developer like you!

[EDIT] And of course, I should also mention Dan Reetz' incredibly inspiring work bootstrapping an incredible open hardware project! Hats off!

[0] https://github.com/DIYBookScanner/spreads

[1] https://danreetz.com/

[2] https://chdk.fandom.com/wiki/CHDK

[3] http://diybookscanner.eu

daniel_reetz4y ago

3 more replies

BeetleB4y ago

Actually, it was a project called Paper Upgrade. Here is an old archive link:

http://web.archive.org/web/20140101000000*/http://www.paperu...

I don't know if you can find the code through there, but I'm pretty sure he had made it free. I think spreads is a bit newer.

Edit: Found some more info. It did indeed use Scantailor in the backend. His SW was more of a Web based frontend to all the parts. You can see a video demo of it here:

https://www.youtube.com/watch?v=Ad7aFYdbDos

Start at about 4:40.

The source is here:

https://code.google.com/archive/p/diy-ebook-creator/

1 more reply

fernly4y ago

Not the same as Scan Tailor[1,2] ? Which was referenced from the Instructables link cited earlier. That apparently was a comprehensive toolkit in C++ and Qt, now archived.

[1] https://scantailor.org/

[2] https://web.archive.org/web/20210304015939/https://github.co...

jccalhoun4y ago

There are a couple folks that forked scantailor. I'm not sure the status of those. Here are a couple: https://github.com/4lex4/scantailor-advanced https://github.com/trufanov-nok/scantailor-universal

hackeyed4y ago

I wrote up a guide on the authoring process using FOSS tools for some Digital Humanities folks a couple years ago: https://github.com/wikey/bookscan

It gives some background on the problem and covers a Scantailor (page crop, rotate, deskew), pdfbeads (compression, book metadata) authoring workflow, with pdftk for some general odds and ends.

jccalhoun4y ago

Mediterraneo104y ago

walrus014y ago

I am not clear on to what extent the existing workflow does a de-skew of the camera images to deal with page curvature towards the spine.

I think I recall the Internet Archive having an open source design for something similar to this? And other projects which accomplish generally the same idea.

vixen994y ago

No, I have no connection with Czur - just an enthusiastic user!

ajot4y ago

This is an old article, so maybe some software isn't the best option nowadays, but you can get the idea of postprocessing: http://natecraun.net/articles/linux-guide-to-book-scanning.h...

rahimnathwani4y ago

https://www.instructables.com/Bargain-Price-Book-Scanner-Fro...

"Step 10 - Post-processing" has some steps

sandeep13384y ago

Video looks interesting, I'll check it out!

david_allison4y ago

I'm very interested in getting into archival (getting started this month after a few more conversations).

Your buy button[0] is broken. You're potentially missing out on a few sales due to this.

Is 2x 4GB SD card sufficient for your purposes? I've been quoted 50MB TIFF images as a standard, and a lot of books wouldn't fit without swapping SDs at that size.

[0] http://store.diybookscanner.org/

zwayhowder4y ago

If you use pi-scan the images are saved to a USB drive instead of the SDcards.

fernly4y ago

archiving what? just curious.

david_allison4y ago

Many manuscripts aren't digitized at all, or are available and need transcription.

[0] https://en.wikipedia.org/wiki/Manx_language#Revival

[1] https://en.wikipedia.org/wiki/Part-of-speech_tagging

EDIT: More than happy to talk in depth about this if anyone wants, via comments, or email on my profile.

1 more reply

timonoko4y ago

    while true ; do
     for x in *.pnm ; do
      killall display 
      display -rotate 90 $x &
     done 
     sleep 5
    done

timonoko4y ago

Nobody asked, but for the record, this is how make real one-page PDF from two-page scans. (gm = GraphicsMagick)

    mkdir kaksi
    rm kaksi/*
    j=102 
    scale=600
    size=500x730
    yla=27
    for x in *.pnm ; do
     echo $x
    gm convert $x -rotate 90 -crop $size+20+$yla -resize $scale -normalize kaksi/k$j.jpg
      j=$((j+2))
    done    
    j=103
    for x in *.pnm ; do
     echo $x
    gm convert $x -rotate 90 -crop $size+530+$yla -resize $scale -normalize kaksi/k$j.jpg
      j=$((j+2))
    done
    cd kaksi
    gm convert *.jpg -format pdf TheBook.pdf

pflanze4y ago

    lastfile=
    while true; do
        newestfile=$(ls *.pnm | tail -1)
        if [ "$newestfile" != "$lastfile" ]; then
            kill %
            display -rotate 90 "$newestfile" &
            lastfile=$newestfile
        fi
        sleep 0.3
    done

timonoko4y ago

"Saving precious bits like it is 1969".

This would make an excellent song title.

pflanze4y ago

2 more replies

shard4y ago

This really needs to be redesigned for ergonomics.

- Lever should have a button for capture

- Display should be visible while looking down

markvdb4y ago

Display should be visible indeed.

Destructive scanning feels incredibly cruel to the books. A non-destructive system like this actually works fairly well. You can expect to get up to about 1000-1200 pages an hour with it.

shard4y ago

> Destructive scanning feels incredibly cruel to the books.

BeetleB4y ago

> For any non-rare books, this is just too tedious and time consuming to go through for more than a few books.

If your goal is to scan a whole bunch, it's tedious. If you want to do it once in a while, it's not really a problem.

dang4y ago

One past thread, a long time ago:

DIY book scanning - https://news.ycombinator.com/item?id=991897 - Dec 2009 (7 comments)

djoldman4y ago

Here it is in action:

http://tenrec.builders/quill/guide/scanning/scan/

azureel4y ago

For anyone interested, there is also https://libreflip.org/ website about similar device.

dahart4y ago

> While there are some computer algorithms that can help dewarp the pages after capture, it is always more reliable to just capture flat pages in the first place.

> The lights must also be positioned to minimize glare and reflections.

usui4y ago

These kinds of discussions need more real examples to accurately depict the tradeoffs of destructive vs non-destructive scanning, so I'll add scans I personally made.

https://imgur.com/2ZQFZ5p

It's entirely possible to accomplish post-processing without writing code if you have Adobe Photoshop.

One really cool feature about Noisebridge's scanner (picture below) was that you could view the camera's viewfinder live in real-time, thereby speeding up iteration and catching errors much faster

https://imgur.com/4Pkdp1j

tunesmith4y ago

userbinator4y ago

totetsu4y ago

ghaff4y ago

ebr4him4y ago

The store seems to be down, any idea how much it costs?

nanna4y ago

Anyone have thoughts on the Easy Book Scanner design by David Landin?

https://www.instructables.com/Book-Scanner-Low-cost-easy-to-...

hackeyed4y ago

Looks incredibly serviceable and well engineered. I would expect reasonable and consistent results from the rig. The biggest question would come down to the cameras.

daniel_reetz4y ago

A lot of people on the DIY Book Scanner forums got a lot of value out of this design and David's presence on the forum. In my opinion it's an excellent starting point.

Gedxx4y ago

Here a homemade way to digitize a book with a compact camera https://www.ikkaro.com/en/como-digitalizar-libro/

TaylorAlexander4y ago

Hello if anyone is in the Bay Area and has a book scanner I’d love to scan my copy of this book which was only printed in India in 2001 and seems relatively rare:

https://www.abebooks.com/9780140298246/Patents-Myths-Reality...

I did fill out the form for the internet archive but it talked about scanning a library and I’m not sure they want to deal with just one book.

daniel_reetz4y ago

NoiseBridge maintained a DIY Book Scanner for a long time.

TaylorAlexander4y ago

Oooh good tip thank you

mcguire4y ago

Are the kits back in stock?

I fooled around with the DIY option, but realized I was incompetent. Ended up buying a cheap Czur scanner, which works surprisingly well.

For it, you hold the book open on the black mat on a table. The scanner uses a laser to measure and correct page curvature, and takes a picture of both pages.

It produces decent PDFs (I'm not sure about the comparative resolution) with (bad) OCR'ed text. (The IA re-OCR's the book after upload, right?)

braincode4y ago

I'd love to see something like this made out of entirely recycled phones and their cameras instead of going with discrete components... any leads?

mattowen_uk4y ago

I've just built a basic basic overhead 'rostrum' type rig using some wood, screws, and an older Android phone.

The phone's camera points down at a height of about 20cm/8" and can see an area on the base plate big enough for the object I'm capturing (in this case, 3.5" floppy disks).

ngold4y ago

I have most of the entire collection of hardback national geographics from 1930 to 1970. Wonder how legal it would be to scan them. Always wondered.

jccalhoun4y ago

unless you really want to do it for fun, they have almost certainly been scanned and put online by someone else. Archive.org has several https://archive.org/search.php?query=national%20geographic

bcaa7f3a8bbcOP4y ago

> from 1930

Good news: It's very likely that the copyright has expired. If you were to scan them, remember to upload them to archive.org for everyone else to see.

ahi4y ago

Scanning typically falls under fair use, so copyright only applies to distribution of the scans.

dredmorbius4y ago

Once copyright has lapsed, it cannot be reinstated.

https://www.crummy.com/2019/07/22/0

________________________________

Notes:

salamandersauce4y ago

It's just as legal as ripping your CDs into MP3s.

Topgamer74y ago

I've used scan tailor in the past to convert a outboard motor manual to pdf, it's pretty powerful. I didn't have a proper setup, but my results still came out decently.

https://github.com/4lex4/scantailor-advanced

fiftyacorn4y ago

I remember reading about Larry Page spending time developing a book scanner using a scanner and a hoover to turn pages

fortran774y ago

I built a similar one of these from a kit that Dan Reetz made. (Technology has improved since I built mine.)

I have eliminated most printed books. I had to pass a "psychological barrier" before I was able to discard the books I scanned.

The last holdout was music scores, but I now use an iPad for music at the piano.

daniel_reetz4y ago

Thanks for buying and building a kit. I appreciated everyone who did that so much.

Finnotesorg4y ago

I also want this.

jbergens4y ago

The link http://store.diybookscanner.org/ goes to a shop page that is not configured yet.

Topgamer74y ago

I remember reading about Google's book scanner that operated automatically using vacuum pressure to gently flip pages. I'd love to see an open source variety of that.

hellbannedguy4y ago

I think about Google's scanner project a lot. I wonder if they are still scanning?

I would love if they found a way to offer those books (authors would have to agree) to the world.

If they did, I would forgive them for tracking me all these years.

I'd even put up with ads in the books.

david_allison4y ago

They do if the books are out of copyright, and they're often copied to archive.org (and then transcribed via Wikisource) when this happens.

sample: https://books.google.im/books?id=Me8CAAAAQAAJ&printsec=front...

floathub4y ago

And you can order a bound paperback book made from the scans of just about any out-of-copyright google book. From the Harvard Bookstore (or anyone else that has an Espresso Book Machine):

https://www.ondemandbooks.com/as/?t=Hamlet&c=google

supernova87a4y ago

I don't know where to find a picture of it, but I remember seeing a video of one of their book scanners. Or maybe it was just one version of them.

Then it would blow those 2 pages to the left (or right I forget) and descend again and do the next set.

I thought it was pretty cool. Have never seen something like it since.

markvdb4y ago

Automated page turning is an incredibly complex problem to tackle. It won't gain you as much as you'd think either. "Book" and "page" are a surprisingly difficult to define categories.

donalhunt4y ago

dredmorbius4y ago

Google Linear Book Scanner

https://yewtu.be/watch?v=7MNqINDm1lk

rckoepke4y ago

ggm4y ago

UCL-CS had one of these which was deployed in conjunction with the British Library. This is when high pixel count CCDs were super expensive back in the 1980s. Amazing device.

indiantinker4y ago

Nice! A foot-pedal can improve his over-all efficiency and reduce lower back and neck pain.

failwhaleshark4y ago

I need this for some vintage IBM/PC-compatible programming books that are a zillion pages long.

j / k navigate · click thread line to collapse