Home > rpdf

rpdf

Rpdf is a project mainly written in PERL and SHELL, it's free.

extract text content and layout information from PDF documents

rpdf extracts text and layout from PDF documents, using pdftohtml and cuneiform.

version 0.3 (2010-06-05)


INSTALLATION

rpdf requires

  • pdftohtml (the Poppler version)
  • cuneiform 0.9
  • convert (ImageMagick)
  • pdftk
  • the Perl modules XML::Writer, XML::XPath, HTML::TreeBuilder

No further installation required. To test if it works, run

./rpdf test/simple.pdf ./rpdf test/ocr.pdf


USAGE

rpdf [-d int] [-p int] file.pdf

-d int : debug level -p int : max. number of pages to parse

If -p is not specified, what happens depends on whether rpdf can use pdftohtml or has to fall back on OCR.

Previous:tweet-link