Home > scan-OCR-pdf

scan-OCR-pdf

Scan-OCR-pdf is a project mainly written in ..., it's free.

A bash script to scan a series of pages, ocr them, then combine the scans and the text into one pdf

This is a bash script to scan a series of pages from a document into a pdf, at the end of the pdf is the output from OCR on each page.

This provides the original scan but also something searchable, e.g. for use in google desktop.

I played around with tesseract, very good OCR but no layout analisys!! so cuneiform it is.

This was built and tested on a gnome machine but i dont see any reason for it to fail on kde or a non linux OSs as long as dependancies are met, feedback on this would be welcome.

To install, currently only manual install:

Download and copy the script to a location such as /usr/bin/ $ wget http://github.com/dougle/scan-OCR-pdf/raw/master/scan_ocr_pdf -o /usr/bin/scan_ocr_pdf && chmod +x /usr/bin/scan_ocr_pdf

You can then add it to your panel or applications menu or run it in a terminal using the command:

$ scan_ocr_pdf -d "$HOME/Documents/Scanned"

OPTIONS: -h Show this message -d Destination directory -r Resolution to use while scanning, (can cause cuneiform to buffer overflow when set too high) -u Use this device for scanning, instead of the first scanner listed See $ scanimage -L for your devices -g Interface mode, prompts use zenity -s Options to pass to scanimage. Default is: "--mode 'Color' -l 0 -t 0 -x 210 -y 297" (colour A4 area)

To debug simple add -x to the first line:

!/bin/bash -x

or run

$ /bin/bash -x /usr/bin/scan_ocr_pdf -g -d "/home/dougle/Documents"

This'll add a line by line transcription of the execution and would be usefull in any weird/occasional bugs.

More options and features will be added soon, like a device list if -u is not specified and more than one device is present Forks, Patches, Feature requests, Bug reports, comments welome.

Previous:drop-site