How to convert regular pdf into a pdf with text using pdftk?

If none of the more specific forums is the right place to ask

How to convert regular pdf into a pdf with text using pdftk?

Postby s3a » 2010-09-09 14:12

Can someone tell me which of the parameters in pdftk is responsible for converting my PDF (which is a bunch of images "staped" together) into a pdf with text that I can ctrl+f? The manual isn't clear enough for me.

Any input would be greatly appreciated!
Thanks!
Use Mnemosyne to Study for School!
User avatar
s3a
 
Posts: 777
Joined: 2008-07-17 22:13

Re: How to convert regular pdf into a pdf with text using pdftk?

Postby paivakil » 2010-09-12 17:17

Short answer - you cannot.

Long answer:-

Extract images from the pdf, run the images through an OCR programme, convert it to text, format the text back, and convert it to pdf.

Not very easy.

BTW, PDFs with images are "normal" pdf files.
It's not the software that's free; it's you.
paivakil
 
Posts: 434
Joined: 2009-02-15 11:57

Re: How to convert regular pdf into a pdf with text using pdftk?

Postby s3a » 2010-09-12 21:34

How do I do that? With what OCR program? (I am not aware of any). Also, is there a for loop or something involved? (because I can't convert 600+ pages manually).
Use Mnemosyne to Study for School!
User avatar
s3a
 
Posts: 777
Joined: 2008-07-17 22:13

Re: How to convert regular pdf into a pdf with text using pdftk?

Postby Soul Singin' » 2010-09-12 22:11

s3a wrote:How do I do that? With what OCR program? (I am not aware of any). Also, is there a for loop or something involved? (because I can't convert 600+ pages manually).

Aren't you a computer science major? If so, how do you not know how to write a simple loop?

Oh, that's right! Because you didn't do your homework! . . :roll:


So here's Lesson One: Find suitable OCR program to read each page. Then write a "for loop" like this:

Code: Select all
#!/bin/bash

## script to:
##   *  split a PDF up by pages
##   *  convert them to an image format
##   *  read the text from each page
##   *  concatenate the pages


## pass name of PDF file to script
INFILE=$1

## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf  pg_0002.pdf  pg_0003.pdf
pdftk $INFILE burst

for i in pg*.pdf ; do

    ## convert it to a PNG image file
    convert $i ${i%.pdf}.png

    ## read text from each page
    <OCR command> ${i%.pdf}.png > ${i%.pdf}.txt

done

## concatenate the pages into a single text file
cat pg*.txt > ${INFILE%.pdf}.txt

exit
.
User avatar
Soul Singin'
 
Posts: 1466
Joined: 2008-12-21 07:02

Re: How to convert regular pdf into a pdf with text using pdftk?

Postby Soul Singin' » 2010-09-13 03:59

This problem captivated me today, so I might as well share the solution. (Even though you don't deserve it).

First:

Code: Select all
apt-get install gocr imagemagick libjpeg-progs pdftk poppler-utils


Next, you burst the PDF into single pages and convert them to an enormous PPM image file. Then you can either read the text from the PPM file or you can convert the PPM file to JPEG and read from the JPEG. I recommend the latter.

The whole process takes about 2 minutes per page on an old Pentium 3 (1200 MHz) with 1 GB of RAM, so if you really are going to convert 600 pages, then you might want to run the script overnight because it's going to take a while.

Disregards,
- Soul Singin'


Code: Select all
#!/bin/bash

## script to:
##   *  split a PDF up by pages
##   *  convert them to an image format
##   *  read the text from each page
##   *  concatenate the pages

## we will do all work in a temporary directory
## so remember where we started
DIR=$( pwd )

## pass name of PDF file to script
INFILE=$1

if [ ! $INFILE ] ; then
    printf "No file specified. Exiting.\n"
    exit 1
fi

if [ ! -f $INFILE ] ; then
    printf "$INFILE is not a file. Exiting.\n"
    exit 1
fi

## create temp directory and CD into it
## but get rid of anything that used to live there first
if [ -d /tmp/image2text ] ; then
    rm -rf /tmp/image2text
fi

mkdir /tmp/image2text
cp $INFILE /tmp/image2text/.
cd /tmp/image2text


## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf  pg_0002.pdf  pg_0003.pdf
pdftk $INFILE burst

## make sure file was burst
if [ ! -f pg_0001.pdf ] ; then
    printf "Failed to burst $INFILE. Exiting.\n"
    exit
else
    ## do you really need doc_data.txt ???
    rm doc_data.txt
fi


## now let's turn each PDF page into text
for i in pg*.pdf ; do

    ## convert it to a PPM image file at 600 dots per inch
    pdftoppm -r 600 $i ${i%.pdf}.ppm

    ## make sure the command worked
    if [ -f ${i%.pdf}.ppm-1.ppm ] ; then

   ## change the goofy file name
   mv ${i%.pdf}.ppm-1.ppm ${i%.pdf}.ppm

    else
   printf "The PPM file: ${i%.pdf}.ppm-1.ppm was not created. Exiting.\n"
   exit 1
    fi

    ## convert the file to a JPEG image with ImageMagick
    ## scanning the JPEG yields slighly better results
    ## and you get a much smaller file size
    convert ${i%.pdf}.ppm ${i%.pdf}.jpg

    ## make sure the command worked
    if [ -f ${i%.pdf}.jpg ] ; then

   ## get rid of the massive PPM file and the PDF file
   rm ${i%.pdf}.ppm $i

    else
   printf "The JPG file: ${i%.pdf}.jpg was not created. Exiting.\n"
   exit 1
    fi
   
    ## read text from the page
    djpeg -pnm ${i%.pdf}.jpg | gocr - > ${i%.pdf}.txt

    ## make sure the command worked
    if [ -f ${i%.pdf}.txt ] ; then

   ## get rid of the JPG file
   rm ${i%.pdf}.jpg

    else
   printf "The TXT file: ${i%.pdf}.txt was not created. Exiting.\n"
   exit 1
    fi

done

## concatenate the pages into a single text file
cat pg*.txt > $DIR/${INFILE%.pdf}.txt

## remove the temporary directory
cd $DIR

if [ -f ${INFILE%.pdf}.txt ] ; then

    rm -rf /tmp/image2text
   
    ## get out of here!
    printf "All done. Have fun! \n"

else
    printf "Failed to generate ${INFILE%.pdf}.txt\n"
    printf "Individual text files can be found in: /tmp/image2text/ \n"
fi

exit
.
User avatar
Soul Singin'
 
Posts: 1466
Joined: 2008-12-21 07:02


Return to General Questions

Who is online

Users browsing this forum: No registered users and 19 guests

fashionable