Digitising books
20 Apr 2013
Having acquired a new document scanner, I chomped through most of the paper in my life, scanning receipts, letters from the bank and so on. It took a few hours.
To really put my scanner through its paces, I wanted to digitise a few books. Here are a few thoughts and pointers for future reference.
- It’s particularly useful to digitise reference books that you might want to refer to. This is a matter of opinion, but I think they are better suited to the illuminated screen, random-access type of reading/research that things like iPads are so good at.
- You need to get the pages out. I’ve read of all sorts of ways of doing this. Circular saws look good fun! If you don’t have one, and live near a Mailbox ETC, go in there and ask them to guillotine the spine off. Other stationery shops may do this - but in Cambridge the friendly staff there don’t even charge for this. I don’t know how long that’ll last…
- Chop off as little as possible when removing the spine. It will help avoid cut off lines and make rebinding easier, if you choose to do that.
- After you’ve got the pages out, flick through them to make sure they really are separated. Sometimes you will find a few pages still ‘glued’ together.
- Scan one or two pages as a way to get the right settings for resolution/compression/scan workflow. Zoom in, check OCR will succeed, and perform back-of-envelope calculation to enable you to strike your preferred balance between image resolution, compression level and file size. 150dpi seems fine to me, but I choose large file size over compression artefacts.
- Start with a small book - large ones aren’t necessarily harder, but if you make a mistake, you’ve wasted less time.
- Does your book rely on double page spreads? If so, see if your software will join them up for you. I haven’t found this yet in Scansnap. I made the mistake of scanning a whole book and then having to use the supplied “Page merger” tool on tens of double page spreads. Tedious!
- You can’t put every page into the scanner, so do ‘em in batches. Look for the “continue scanning” option if using Scansnap software. In the “grab”, I chose not to do OCR. It takes time, so save it until the end, in case you make a mistake on the way.
- If your scanner automatically discards blank pages, consider disabling this feature (see note about page numbers below).
- Stitching together PDFs digitally is possible but fiddly. Macs have a python script to help "/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py"
- You’ll have to scan the front cover separately (they are usually too stiff to go through auto feeders). You can use Preview (on a Mac) to put it at the front of your document.
- Remember to set metadata like title, author(s) in the PDF file.
- Consider the page labels. Usually the visible page number will not match the electronic page number, so finding content based on page number by hitting ⌥⌘G won’t work. If you’ve scanned every page and included blanks, the relationship will be simple (n’ = n + 4 or similar). I had success using jpdftweak to achieve this.
- Now perform OCR, check the results, and upload to Dropbox, iBooks etc.
- Consider having the pages rebound. You’ve already pulled all the pages out and digitised it, so keeping the book is an optional bonus. Helpfully, the only cost-effective rebinding mechanism is likely to be one that leaves you with a “stays flat” book - ideal for referring to while you have both hands full (e.g. tying a knot or fiddling with a bike).
Books are still useful in full sunlight, rain, or in your garage while you’re spraying WD40 around!
Tags: books, scanning, digital, paper, data
< Previous post | Next post >Favourite posts
- On wiggly lines and being normal
- On infinite villages
- Running a race backwards
- Brainmaking
- Their tables were stored full, to glad the sight
- The structure of a smell
Recent posts
- Skill swaps
- Times Table Hack Stars
- Long, crustless hypotenuses
- Standing up a prototype
- Optimising the FA Cup
Blog archives
Posts from 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024.