Sunday, February 13, 2011

Extracting text from PDF file with PDFBox ExtractText command

http://pdfbox.apache.org/commandlineutilities/ExtractText.html

Needs some installation (assuming PDFbox isn't installed already e.g. by your Linux distribution):

SVN pdfbox trunk

apt-get install maven2

mvn clean install

On successful building it, use it like:

usage: java -jar pdfbox-app-x.y.z.jar org.apache.pdfbox.ExtractText [OPTIONS] [Text file]

e.g.:
rbarraud@thinky:~/Desktop/tools/PDFBox/trunk/app/target$ java -jar pdfbox-app-1.5.0-SNAPSHOT.jar ExtractText -html /home/rbarraud/Desktop/Reference/pdfs/IMX25RM.pdf /tmp/MX25RM.html

I want to grab tables out of PDF processor reference manuals to make tools for browsing machine state by register names etc.

No comments: