How to convert regular pdf into a pdf with text using pdftk?

If none of the more specific forums is the right place to ask

How to convert regular pdf into a pdf with text using pdftk?

Postby s3a » 2010-09-09 14:12

Can someone tell me which of the parameters in pdftk is responsible for converting my PDF (which is a bunch of images "staped" together) into a pdf with text that I can ctrl+f? The manual isn't clear enough for me.

Any input would be greatly appreciated!
Thanks!
Use Mnemosyne to Study for School!
User avatar
s3a
 
Posts: 779
Joined: 2008-07-17 22:13

Re: How to convert regular pdf into a pdf with text using pdftk?

Postby paivakil » 2010-09-12 17:17

Short answer - you cannot.

Long answer:-

Extract images from the pdf, run the images through an OCR programme, convert it to text, format the text back, and convert it to pdf.

Not very easy.

BTW, PDFs with images are "normal" pdf files.
It's not the software that's free; it's you.
paivakil
 
Posts: 434
Joined: 2009-02-15 11:57

Re: How to convert regular pdf into a pdf with text using pdftk?

Postby s3a » 2010-09-12 21:34

How do I do that? With what OCR program? (I am not aware of any). Also, is there a for loop or something involved? (because I can't convert 600+ pages manually).
Use Mnemosyne to Study for School!
User avatar
s3a
 
Posts: 779
Joined: 2008-07-17 22:13

Re: How to convert regular pdf into a pdf with text using pdftk?

Postby Soul Singin' » 2010-09-12 22:11

s3a wrote:How do I do that? With what OCR program? (I am not aware of any). Also, is there a for loop or something involved? (because I can't convert 600+ pages manually).

Aren't you a computer science major? If so, how do you not know how to write a simple loop?

Oh, that's right! Because you didn't do your homework! . . :roll:


So here's Lesson One: Find suitable OCR program to read each page. Then write a "for loop" like this:

Code: Select all
#!/bin/bash

## script to:
##   *  split a PDF up by pages
##   *  convert them to an image format
##   *  read the text from each page
##   *  concatenate the pages


## pass name of PDF file to script
INFILE=$1

## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf  pg_0002.pdf  pg_0003.pdf
pdftk $INFILE burst

for i in pg*.pdf ; do

    ## convert it to a PNG image file
    convert $i ${i%.pdf}.png

    ## read text from each page
    <OCR command> ${i%.pdf}.png > ${i%.pdf}.txt

done

## concatenate the pages into a single text file
cat pg*.txt > ${INFILE%.pdf}.txt

exit
.
User avatar
Soul Singin'
 
Posts: 1466
Joined: 2008-12-21 07:02

Re: How to convert regular pdf into a pdf with text using pdftk?

Postby Soul Singin' » 2010-09-13 03:59

This problem captivated me today, so I might as well share the solution. (Even though you don't deserve it).

First:

Code: Select all
apt-get install gocr imagemagick libjpeg-progs pdftk poppler-utils


Next, you burst the PDF into single pages and convert them to an enormous PPM image file. Then you can either read the text from the PPM file or you can convert the PPM file to JPEG and read from the JPEG. I recommend the latter.

The whole process takes about 2 minutes per page on an old Pentium 3 (1200 MHz) with 1 GB of RAM, so if you really are going to convert 600 pages, then you might want to run the script overnight because it's going to take a while.

Disregards,
- Soul Singin'


Code: Select all
#!/bin/bash

## script to:
##   *  split a PDF up by pages
##   *  convert them to an image format
##   *  read the text from each page
##   *  concatenate the pages

## we will do all work in a temporary directory
## so remember where we started
DIR=$( pwd )

## pass name of PDF file to script
INFILE=$1

if [ ! $INFILE ] ; then
    printf "No file specified. Exiting.\n"
    exit 1
fi

if [ ! -f $INFILE ] ; then
    printf "$INFILE is not a file. Exiting.\n"
    exit 1
fi

## create temp directory and CD into it
## but get rid of anything that used to live there first
if [ -d /tmp/image2text ] ; then
    rm -rf /tmp/image2text
fi

mkdir /tmp/image2text
cp $INFILE /tmp/image2text/.
cd /tmp/image2text


## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf  pg_0002.pdf  pg_0003.pdf
pdftk $INFILE burst

## make sure file was burst
if [ ! -f pg_0001.pdf ] ; then
    printf "Failed to burst $INFILE. Exiting.\n"
    exit
else
    ## do you really need doc_data.txt ???
    rm doc_data.txt
fi


## now let's turn each PDF page into text
for i in pg*.pdf ; do

    ## convert it to a PPM image file at 600 dots per inch
    pdftoppm -r 600 $i ${i%.pdf}.ppm

    ## make sure the command worked
    if [ -f ${i%.pdf}.ppm-1.ppm ] ; then

   ## change the goofy file name
   mv ${i%.pdf}.ppm-1.ppm ${i%.pdf}.ppm

    else
   printf "The PPM file: ${i%.pdf}.ppm-1.ppm was not created. Exiting.\n"
   exit 1
    fi

    ## convert the file to a JPEG image with ImageMagick
    ## scanning the JPEG yields slighly better results
    ## and you get a much smaller file size
    convert ${i%.pdf}.ppm ${i%.pdf}.jpg

    ## make sure the command worked
    if [ -f ${i%.pdf}.jpg ] ; then

   ## get rid of the massive PPM file and the PDF file
   rm ${i%.pdf}.ppm $i

    else
   printf "The JPG file: ${i%.pdf}.jpg was not created. Exiting.\n"
   exit 1
    fi
   
    ## read text from the page
    djpeg -pnm ${i%.pdf}.jpg | gocr - > ${i%.pdf}.txt

    ## make sure the command worked
    if [ -f ${i%.pdf}.txt ] ; then

   ## get rid of the JPG file
   rm ${i%.pdf}.jpg

    else
   printf "The TXT file: ${i%.pdf}.txt was not created. Exiting.\n"
   exit 1
    fi

done

## concatenate the pages into a single text file
cat pg*.txt > $DIR/${INFILE%.pdf}.txt

## remove the temporary directory
cd $DIR

if [ -f ${INFILE%.pdf}.txt ] ; then

    rm -rf /tmp/image2text
   
    ## get out of here!
    printf "All done. Have fun! \n"

else
    printf "Failed to generate ${INFILE%.pdf}.txt\n"
    printf "Individual text files can be found in: /tmp/image2text/ \n"
fi

exit
.
User avatar
Soul Singin'
 
Posts: 1466
Joined: 2008-12-21 07:02

Re: How to convert regular pdf into a pdf with text using pd

Postby aytack » 2014-10-06 02:32

Hello Soul Singin'

Your example script in your first post helped me - while many others did not, to convert PDF files into txt with ease.

It worked especially well after I made "convert -density 200 -quality 100 $i ${i%.pdf}.png" and using tesseract OCR.

Thank you!

I am not a computer science major, so I think I can ask something may be seen very basic.

This:

I have a folder full of PDF files. Some 15 GBs of them.

And I want to mass-convert them to txt files, better saved to an another folder.

Can you make an addition for your first script?

Is this ever possible?

Best!

- Aytaç
aytack
 
Posts: 6
Joined: 2014-10-06 02:24

Re: How to convert regular pdf into a pdf with text using pd

Postby aytack » 2014-10-07 17:49

I learned how to perfectly mass-convert PDF files into txt, using OCR.

I will copy the shell script code, and explain how to use it, but first, I have to say its advantage over other techniques shared in internet. I saw many scripts, especially in Ubuntu forums, and unfortunately, most of them even does not work as I have tested.

* This script does not use TIFF images in conversion, so it will not use tens of gigabytes of space in your hard disk in conversion.

* It just uses extensive CPU. How fast your CPU is, will determine how fast your PDF files will get converted. I have a considerably older machine with some 2 Ghz of processor, and in 1 hour, it converts approximately 120 pages.

* It does not make your system get slowed. Other scripts using TIFF conversion method slows your system extremely.

All said these, let me share the script, which is the script of Soul Singin', but a little changed. Since I don't know how to write codes, I asked Stack Exchange forums' Unix & Linux category to help, and they contributed. Here it is:

Code: Select all
#!/bin/bash

## script to:
##   *  split a PDF up by pages
##   *  convert them to an image format
##   *  read the text from each page
##   *  concatenate the pages


## pass name of PDF file to script
for INFILE
do

## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf  pg_0002.pdf  pg_0003.pdf
pdftk $INFILE burst

for i in pg*.pdf ; do

    ## convert it to a PNG image file
    convert -density 200 -quality 100 $i ${i%.pdf}.png

    ## read text from each page
    tesseract ${i%.pdf}.png ${i%.pdf}.txt

done

## concatenate the pages into a single text file
cat pg*.txt > ${INFILE%.pdf}.txt
rm pg*.pdf
rm pg*.png
rm pg*.txt
rm doc_data.txt
mv ${INFILE%.pdf}.txt /home/your_user_ID/path/to/your/desired/output/folder/.
done

exit


It works perfect in Debian Wheezy. And it can work in all Linux distros, but I did not tried.

You need these packages: pdftk, imagemagick, tesseract.

Yes, tesseract-ocr. Why?

It's conversion quality is equal to Abbyy Fine Reader. This is why.

How to use it?

Let me explain how to use it, because some people may not know how to use shell scripts. I did not know once.

You should copy this code into an empty file. A txt file, if you want to call this way.

But it should not have any extension at the end. (Perhaps it can have a .sh extension, but I don't know. Better leave it without any extension, because this works, I tried and worked. Better to use the working thing directly, right?)

Of course, edit the "outpul folder" path in the code. This output destination is the place where you want the result .txt files to be located.

Then copy this no-extension shell script file, which has the code above we shared, inside to the folder you have all of your PDFs.

(If you have many PDF files scattered around your hard disk, there are some codes which can move all of those PDF files into a single folder. You can find it in the net.)

Then, open the Terminal/bash/shell emulator, "cd" to the folder which you have your all PDF files. (What is "to 'cd' " to the folder? You can learn it by searching "Linux command line codes cd" in the net. In short, you write this: "cd /home/your_id/whatever/folder/to/cd" and done! You are there, inside this folder! And able to do and control things inside this folder!) You know, here in this folder you have both all your PDFs, and also the script file.

Lets assume the name of your script file is "script". It has no extension as you can see.

And.. Write this on command line: "./script *.pdf"

Thats it.

The temporary files will be also deleted in the process. A clean script.

You dont need Abbyy Fine Reader for converting your files into .txt format.

Linux has it, and has it for free.

:wink:

Note: The "density" section in the code is probably the best values. The lower, you get wrong results in txt file. The higher, you get much longer processing time. I have tested, and it works perfectly. So, I advise not to change it.


I think this script can be useful for many.

Thanks to all!

- Aytaç
aytack
 
Posts: 6
Joined: 2014-10-06 02:24


Return to General Questions

Who is online

Users browsing this forum: No registered users and 9 guests

fashionable