It’s 2012 and I still keep folders full of papers. Time to change that.
I’ve bought an HP Officejet 4500. Mainly because its cheap, works well under Linux and has ADF. It looks solid, but sometimes sounds as if it is falling apart. So I hope it will continue doing its work for a while.
Here’s my goal:
- Use command line to scan one or more pages
- Apply OCR
- Store text and image somewhere
- Put a full-text index on top of the text/image
- Have the ability to search for these documents
Scanning an image is straightforward. There is either scanimage or scanadf, both are part of the sane project
scanimage --device 'hpaio:/net/Officejet_4500_G510g-m?ip=192.168.178.100' \ --format=pnm \ --resolution 300 \ -x 210 \ -y 297 > scan1.pnm
In order to make it easier for the OCR tools to do their duty, unpaper is helpful to clean up the scan.
unpaper scan1.pnm unpapered1.pnm
After putting my strong Google-Fu to use, I found that tesseract apparently yields the best results (among the Open Source solutions). There is also cuneiform, OCRopus, ocrad and gocr.
Tesseract requires a .tiff for its magic. But thanks to ImageMagick converting the pnm couldn’t be easier:
convert unpapered1.pnm prepared1.tiff
And finally extract the text.
tesseract prepared1.tiff scan1 -l eng
So far the only downside I have found with tesseract is that I constantly find myself typing tessarect instead of tesseract.
Okay, so far so good. Now what about the full-text index? Well, I have read lots of good things about Lucene, but Lucene is either Java or .NET. There exists a Python wrapper around it, but I would still have had to pull in the whole Java dependency.
Good thing I learned about Whoosh, which is a full-text indexing library for python written in python.
All I had to do was to plumb these pieces together, which then became paperstore. It still requires some polish and I haven’t scanned a whole lot with it yet. But it might give you some pointers on how to do something similar.
It’s all open source, so feel free to fork it and hack away on it. Or simply leave me some suggestions for improvements.