Scan-OCR-pdf is a project mainly written in ..., it's free.
A bash script to scan a series of pages, ocr them, then combine the scans and the text into one pdf
This is a bash script to scan a series of pages from a document into a pdf, at the end of the pdf is the output from OCR on each page.
This provides the original scan but also something searchable, e.g. for use in google desktop.
I played around with tesseract, very good OCR but no layout analisys!! so cuneiform it is.
This was built and tested on a gnome machine but i dont see any reason for it to fail on kde or a non linux OSs as long as dependancies are met, feedback on this would be welcome.
To install, currently only manual install:
Download and copy the script to a location such as /usr/bin/ $ wget http://github.com/dougle/scan-OCR-pdf/raw/master/scan_ocr_pdf -o /usr/bin/scan_ocr_pdf && chmod +x /usr/bin/scan_ocr_pdf
You can then add it to your panel or applications menu or run it in a terminal using the command:
$ scan_ocr_pdf -d "$HOME/Documents/Scanned"
OPTIONS: -h Show this message -d Destination directory -r Resolution to use while scanning, (can cause cuneiform to buffer overflow when set too high) -u Use this device for scanning, instead of the first scanner listed See $ scanimage -L for your devices -g Interface mode, prompts use zenity -s Options to pass to scanimage. Default is: "--mode 'Color' -l 0 -t 0 -x 210 -y 297" (colour A4 area)
To debug simple add -x to the first line:
or run
$ /bin/bash -x /usr/bin/scan_ocr_pdf -g -d "/home/dougle/Documents"
This'll add a line by line transcription of the execution and would be usefull in any weird/occasional bugs.
More options and features will be added soon, like a device list if -u is not specified and more than one device is present Forks, Patches, Feature requests, Bug reports, comments welome.