Other than locating the books, by far the most tedious aspect is the scanning. We only have a terrible flatbed scanner, that is completely unforgiving - it only has a 25 page email limit, otherwise you have to split it into separate emails. And if you mis-scan a page accidentally (some of the book margins are super tight), then you have to restart the entire scan - there's no delete page button!
Sounds like you need a license for Adobe Acrobat Pro or some other application that will let you reshuffle/insert pages.
But then medical texts probably cost way more than $80. How much was your boss making from those scans? Were they taking account of copyright law?
When scanning books like that did you ever see anything interesting or are you so zoned out you don't really pay attention?
I wonder why this never took off?
Building a scanner would be interesting, but the mind-numbing idea of turning pages manually isn't very enticing
[0] https://www.nationaltrust.org.uk/features/why-wearing-gloves...
It works quite well and I digitised dozens of textbooks I'd purchased and needed to reference but couldn't carry around every day while finishing my masters. My one had 2 Nikon mirrorless cameras controlled via Pi-Scan. https://github.com/Tenrec-Builders/pi-scan
I had a smaller toggle switch wired to the GPIO pins so I could click the scan next button without having to take my hands of the book. Once I got used to the workflow I could scan about 1000 pages per hour while watching Netflix.
I replaced it with a Czur scanner that isn't as good, but is a lot smaller and is good enough for my less demanding needs now that I'm not doing a masters degree :D
I often needed to find information in the books and couldn't reasonably carry them all with me every day between work and uni.
People (rightly) value room spaces than books there. It's called "Ji-sui" (scanning by oneself) and gear recommendation sites like [1] are abundant. Another reason of "Ji-sui" prevalence was the poor availability of ebooks, although that reason was less relevant today.
Obviously, they didn't re-rip music they'd already ripped so technically for popular music, you got 'someone else's' mp3s.
---
My multi function laser printer has a duplex scanner on it. It can scan pages at quite a rate. The problem is not the scanning, but the accurate OCRing, and for things like magazines, the storage of all those high resolution pages. Right now, I cut out the articles I want to keep from my monthly mags, and just scan those. It seems like a fair compromise right now.
I would usually be the guy that made an email group for everyone to share notes and questions for classes pre all of the blackboard garbage, so I started leveraging those connections and would ask if anyone would let me borrow their book for a scanned version in return. My friends and I would have a book scanning party and would help to scan each others’ books. We’d grab some drinks, find some favorite albums and hang out all night until the wee hours taking turns scanning texts.
After one semester the setup paid for itself. I would supplement some texts with learning trackers like bitme before amazing resources came around like libgen. Good times.
Edit: "Finishing a book" is discussed at a very superficial level here: https://vimeo.com/user33752051 at about 1:00:
"In order to turn these raw images into an ebook, the very minimum you need to do is A, you need to rotate them, B you need to crop them down to use the page [?], and C you need to combine them into one document like a PDF... You can do OCR to make it searchable ... color correction... de-skewing, de-warping ..."
I believe he abandoned the project some years later as life got busy and he never found enough volunteers to help him.
Would have to go through my email records to find the name of the project.
I saw the diybookscanner community - which at that point mostly had Daniel Reetz [1] as its active contributor- struggle with mechanical contraptions for triggering cameras and very little software experience. I built a simple proof of concept to reliably trigger cheap consumer cameras using software. I built it on CHDK[2], the Canon Hack Development Kit, alternative firmware for cheap consumer cameras. The proof of concept worked.
I then had a fairly large number of book scanner kits built and shipped mostly around the EU [3]. More of a work of love than a business really, even if it was formally under an llc umbrella. Johannes initially was just a customer. He wanted to build a better software solution, and within the spirit of the project did so as free software. I tried to support him at this as well as I could, setting up build infrastructure, trying to reel in more people, getting him some cameras to test, get the amazing CHDK people to port to new camera models, ...
Then real life intervened indeed.
Johannes, if you read this, I'm still grateful for the experience of having worked with a great developer like you!
[EDIT] And of course, I should also mention Dan Reetz' incredibly inspiring work bootstrapping an incredible open hardware project! Hats off!
[0] https://github.com/DIYBookScanner/spreads
[2] https://web.archive.org/web/20210304015939/https://github.co...
I wrote up a guide on the authoring process using FOSS tools for some Digital Humanities folks a couple years ago: https://github.com/wikey/bookscan
It gives some background on the problem and covers a Scantailor (page crop, rotate, deskew), pdfbeads (compression, book metadata) authoring workflow, with pdftk for some general odds and ends.
I am not clear on to what extent the existing workflow does a de-skew of the camera images to deal with page curvature towards the spine.
I think I recall the Internet Archive having an open source design for something similar to this? And other projects which accomplish generally the same idea.
No, I have no connection with Czur - just an enthusiastic user!
"Step 10 - Post-processing" has some steps
Your buy button[0] is broken. You're potentially missing out on a few sales due to this.
Is 2x 4GB SD card sufficient for your purposes? I've been quoted 50MB TIFF images as a standard, and a lot of books wouldn't fit without swapping SDs at that size.
Most of the printed books are scattered, but available, but it's akin to an iceberg: there's a significant amount of 'submerged' knowledge about the language in written manuscripts and recorded audio, and this is where a lot of the value comes from. Printed texts are primarily religious, and getting the colloquial usages of words and phrases is very useful.
Many manuscripts aren't digitized at all, or are available and need transcription.
The language is relatively well-recorded (dating back to at least the late 16th century in written form), and yet small enough that a comprehensive reference is viable: estimates of about 5MM words crop up, but even 3x could easily fit in memory on a Digital Ocean droplet, even if fully POS tagged[1]. Texts are also mostly in the public domain, and there's a lot of bilingual texts (which act as a Rosetta Stone).
[0] https://en.wikipedia.org/wiki/Manx_language#Revival
[1] https://en.wikipedia.org/wiki/Part-of-speech_tagging
EDIT: More than happy to talk in depth about this if anyone wants, via comments, or email on my profile.
while true ; do
for x in *.pnm ; do
killall display
display -rotate 90 $x &
done
sleep 5
done mkdir kaksi
rm kaksi/*
j=102
scale=600
size=500x730
yla=27
for x in *.pnm ; do
echo $x
gm convert $x -rotate 90 -crop $size+20+$yla -resize $scale -normalize kaksi/k$j.jpg
j=$((j+2))
done
j=103
for x in *.pnm ; do
echo $x
gm convert $x -rotate 90 -crop $size+530+$yla -resize $scale -normalize kaksi/k$j.jpg
j=$((j+2))
done
cd kaksi
gm convert *.jpg -format pdf TheBook.pdf lastfile=
while true; do
newestfile=$(ls *.pnm | tail -1)
if [ "$newestfile" != "$lastfile" ]; then
kill %
display -rotate 90 "$newestfile" &
lastfile=$newestfile
fi
sleep 0.3
doneThis would make an excellent song title.
- Lever should have a button for capture
- Display should be visible while looking down
But now I see why destructive scanning (slicing the binding off and using a sheet feeding scanner) is so attractive. For any non-rare books, this is just too tedious and time consuming to go through for more than a few books.
Your capture triggering suggestion is not as great though. The systems that I shipped with http://diybookscanner.eu actually used a USB foot pedal for triggering the cameras. That's by far a superior user experience to pressing a button while both hands are busy moving a cradle...
Destructive scanning feels incredibly cruel to the books. A non-destructive system like this actually works fairly well. You can expect to get up to about 1000-1200 pages an hour with it.
I suppose it depends on whether it has sentimental value. When I was young, I'd treat my books like treasures, putting covers on them (even paperbacks), making sure I didn't crease the spine when I read them. Now I consider books to be a temporary store of knowledge as the contents pass to my brain. I fold pages, underline, scribble notes in them. There are thousands more copies out there, I don't feel any need to baby my copy.
If your goal is to scan a whole bunch, it's tedious. If you want to do it once in a while, it's not really a problem.
DIY book scanning - https://news.ycombinator.com/item?id=991897 - Dec 2009 (7 comments)
I’m sure this is technically true, but curious how much it matters in practice today? Reading Google’s book scanning patents I found a description of a de-warper based on capturing a 3d depth scan of the book, which I assumed they were using in order to achieve the scale of scanning all books on earth. Capturing and de-warping a 3d depth scan would also be leagues more reliable than trying to do a purely 2d image based de-warp.
> The lights must also be positioned to minimize glare and reflections.
For my personal photo scanning and archiving project, I used a polarizing filter on the light and on the camera in order to eliminate specular glare, it works amazingly well. Would that be impractical, and/or not work as well on books for some reason?
Here are two pages from Cracking the Coding Interview, 6th Edition, that I preferred over the digital versions I found online that were hard on the eyes because I disliked the black-and-white scans. Feel free to ask me about "details" in the process
It's entirely possible to accomplish post-processing without writing code if you have Adobe Photoshop.
I used a free-to-the-public bookscanner built by the Digital Archivists at Noisebridge in San Francisco to take pictures of all pages in my textbooks (it took a while). In Photoshop, you can record a macro to automatically crop to a rectangular region determined by just one or more points that are guaranteed to be on the page in every photo. The selection is made by the quick selection tool (selects similar pixels to the page color in the same region). With this macro recorded, you can run it in bulk through all files.
The textbook size was still large digitally (a gigabyte) because I wanted the highest quality possible for studying, but it beat having to carry heavy textbooks for sure. I also shared these files with friends and we were able to study without any physical textbooks for books that were not available digitally—it was amazing.
Personally I avoided all the deskewing technologies and preferred just pictures, all in color, as close to the real thing as possible, because Noisebridge's scanner used two DSLRs and the pictures were high quality. It was better than converting everything to black-and-white for reading enjoyability. OCR through ABBYY FineReader.
Overall it gets more annoying the thicker the textbook is. If destructive scanning is acceptable, one can just buy the book, go to FedEx and ask them cut the spine off for $4 to convert it to loose-leaf, then run it through a document scanner such as ScanSnap ix500, which is much faster at around 25 pages/min at its slowest
One really cool feature about Noisebridge's scanner (picture below) was that you could view the camera's viewfinder live in real-time, thereby speeding up iteration and catching errors much faster
https://www.instructables.com/Book-Scanner-Low-cost-easy-to-...
With these kind of rigs (two cameras, not computer controlled, no computer display) your big potential sources of error are either accidentally failing to trigger one camera or cameras losing focus on the page (especially if you are at something like the end of a chapter where there is often empty space in the middle of the page where the camera's auto-focus area is). His solution of using the IR remote should significantly reduce the issue of failing to capture on one camera. Cameras exist with manual focus settings, but they are often pricier or too old to reliably find one worth recommending to others. The CHDK alternative firmware for certain cheap Canon cameras generally adds a manual focus option for the less expensive cameras (though the individual features depend on who is making the firmware build you get).
Another option worth investigating is the newest Raspberry Pi camera modules with external lenses. Those should give you manual focus and the ability to build up an automated workflow you like around things like moving files around and any pre-processing you need. An ~9 mega pixel camera gets you 300dpi resolution on a full sheet of A4 paper, which is a lot more than most books.
https://www.abebooks.com/9780140298246/Patents-Myths-Reality...
I did fill out the form for the internet archive but it talked about scanning a library and I’m not sure they want to deal with just one book.
I fooled around with the DIY option, but realized I was incompetent. Ended up buying a cheap Czur scanner, which works surprisingly well.
For it, you hold the book open on the black mat on a table. The scanner uses a laser to measure and correct page curvature, and takes a picture of both pages.
It produces decent PDFs (I'm not sure about the comparative resolution) with (bad) OCR'ed text. (The IA re-OCR's the book after upload, right?)
The phone's camera points down at a height of about 20cm/8" and can see an area on the base plate big enough for the object I'm capturing (in this case, 3.5" floppy disks).
I use an app called IP Camera (it's on the Play store for free) to serve the image via http. I then remotely grab it, process/crop it and store it. The project is in it's early stages, but is working quite well so far.
Good news: It's very likely that the copyright has expired. If you were to scan them, remember to upload them to archive.org for everyone else to see.
Bad news: It's only the case if the copyright hasn't been renewed by the owner. Usually most owners don't renew them, but to determine whether or not this is the case, you need to go through huge catalogs of registered entries from the U.S. copyright office.
Scanning typically falls under fair use, so copyright only applies to distribution of the scans.
edit: Maybe you edited or maybe I'm just dumb. Anyway, the problem with relying on a lack of renewal is that you have to prove a negative. NYPL among others have been doing interesting work on this problem: https://www.nypl.org/blog/2019/09/01/historical-copyright-re...
I have eliminated most printed books. I had to pass a "psychological barrier" before I was able to discard the books I scanned.
The last holdout was music scores, but I now use an iPad for music at the piano.
I would love if they found a way to offer those books (authors would have to agree) to the world.
If they did, I would forgive them for tracking me all these years.
I'd even put up with ads in the books.
sample: https://books.google.im/books?id=Me8CAAAAQAAJ&printsec=front...
Picture a wedge (shaped like the V of a book lying open on a table) that could move up and down. The wedge would descend and insert itself into the V of the book's pages, then rising up would suck the 2 left and right sheets against the sides of the wedge, scanning as it went.
Then it would blow those 2 pages to the left (or right I forget) and descend again and do the next set.
I thought it was pretty cool. Have never seen something like it since.
If my experience helping to build the diybookscanner.org project taught me anything, it's that picking the low hanging fruit of small efficiency improvements to the (semi-)manual process is so much more effective...