Can someone tell me which of the parameters in pdftk is responsible for converting my PDF (which is a bunch of images "staped" together) into a pdf with text that I can ctrl+f? The manual isn't clear enough for me.
Any input would be greatly appreciated!
Thanks!
Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230
How to convert regular pdf into a pdf with text using pdftk?
Re: How to convert regular pdf into a pdf with text using pdftk?
Short answer - you cannot.
Long answer:-
Extract images from the pdf, run the images through an OCR programme, convert it to text, format the text back, and convert it to pdf.
Not very easy.
BTW, PDFs with images are "normal" pdf files.
Long answer:-
Extract images from the pdf, run the images through an OCR programme, convert it to text, format the text back, and convert it to pdf.
Not very easy.
BTW, PDFs with images are "normal" pdf files.
It's not the software that's free; it's you.
Re: How to convert regular pdf into a pdf with text using pdftk?
How do I do that? With what OCR program? (I am not aware of any). Also, is there a for loop or something involved? (because I can't convert 600+ pages manually).
Use Mnemosyne to Study for School!
- Soul Singin'
- Posts: 1605
- Joined: 2008-12-21 07:02
Re: How to convert regular pdf into a pdf with text using pdftk?
Aren't you a computer science major? If so, how do you not know how to write a simple loop?s3a wrote:How do I do that? With what OCR program? (I am not aware of any). Also, is there a for loop or something involved? (because I can't convert 600+ pages manually).
Oh, that's right! Because you didn't do your homework! . .
So here's Lesson One: Find suitable OCR program to read each page. Then write a "for loop" like this:
Code: Select all
#!/bin/bash
## script to:
## * split a PDF up by pages
## * convert them to an image format
## * read the text from each page
## * concatenate the pages
## pass name of PDF file to script
INFILE=$1
## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf pg_0002.pdf pg_0003.pdf
pdftk $INFILE burst
for i in pg*.pdf ; do
## convert it to a PNG image file
convert $i ${i%.pdf}.png
## read text from each page
<OCR command> ${i%.pdf}.png > ${i%.pdf}.txt
done
## concatenate the pages into a single text file
cat pg*.txt > ${INFILE%.pdf}.txt
exit
- Soul Singin'
- Posts: 1605
- Joined: 2008-12-21 07:02
Re: How to convert regular pdf into a pdf with text using pdftk?
This problem captivated me today, so I might as well share the solution. (Even though you don't deserve it).
First:
Next, you burst the PDF into single pages and convert them to an enormous PPM image file. Then you can either read the text from the PPM file or you can convert the PPM file to JPEG and read from the JPEG. I recommend the latter.
The whole process takes about 2 minutes per page on an old Pentium 3 (1200 MHz) with 1 GB of RAM, so if you really are going to convert 600 pages, then you might want to run the script overnight because it's going to take a while.
Disregards,
- Soul Singin'
.
First:
Code: Select all
apt-get install gocr imagemagick libjpeg-progs pdftk poppler-utils
The whole process takes about 2 minutes per page on an old Pentium 3 (1200 MHz) with 1 GB of RAM, so if you really are going to convert 600 pages, then you might want to run the script overnight because it's going to take a while.
Disregards,
- Soul Singin'
Code: Select all
#!/bin/bash
## script to:
## * split a PDF up by pages
## * convert them to an image format
## * read the text from each page
## * concatenate the pages
## we will do all work in a temporary directory
## so remember where we started
DIR=$( pwd )
## pass name of PDF file to script
INFILE=$1
if [ ! $INFILE ] ; then
printf "No file specified. Exiting.\n"
exit 1
fi
if [ ! -f $INFILE ] ; then
printf "$INFILE is not a file. Exiting.\n"
exit 1
fi
## create temp directory and CD into it
## but get rid of anything that used to live there first
if [ -d /tmp/image2text ] ; then
rm -rf /tmp/image2text
fi
mkdir /tmp/image2text
cp $INFILE /tmp/image2text/.
cd /tmp/image2text
## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf pg_0002.pdf pg_0003.pdf
pdftk $INFILE burst
## make sure file was burst
if [ ! -f pg_0001.pdf ] ; then
printf "Failed to burst $INFILE. Exiting.\n"
exit
else
## do you really need doc_data.txt ???
rm doc_data.txt
fi
## now let's turn each PDF page into text
for i in pg*.pdf ; do
## convert it to a PPM image file at 600 dots per inch
pdftoppm -r 600 $i ${i%.pdf}.ppm
## make sure the command worked
if [ -f ${i%.pdf}.ppm-1.ppm ] ; then
## change the goofy file name
mv ${i%.pdf}.ppm-1.ppm ${i%.pdf}.ppm
else
printf "The PPM file: ${i%.pdf}.ppm-1.ppm was not created. Exiting.\n"
exit 1
fi
## convert the file to a JPEG image with ImageMagick
## scanning the JPEG yields slighly better results
## and you get a much smaller file size
convert ${i%.pdf}.ppm ${i%.pdf}.jpg
## make sure the command worked
if [ -f ${i%.pdf}.jpg ] ; then
## get rid of the massive PPM file and the PDF file
rm ${i%.pdf}.ppm $i
else
printf "The JPG file: ${i%.pdf}.jpg was not created. Exiting.\n"
exit 1
fi
## read text from the page
djpeg -pnm ${i%.pdf}.jpg | gocr - > ${i%.pdf}.txt
## make sure the command worked
if [ -f ${i%.pdf}.txt ] ; then
## get rid of the JPG file
rm ${i%.pdf}.jpg
else
printf "The TXT file: ${i%.pdf}.txt was not created. Exiting.\n"
exit 1
fi
done
## concatenate the pages into a single text file
cat pg*.txt > $DIR/${INFILE%.pdf}.txt
## remove the temporary directory
cd $DIR
if [ -f ${INFILE%.pdf}.txt ] ; then
rm -rf /tmp/image2text
## get out of here!
printf "All done. Have fun! \n"
else
printf "Failed to generate ${INFILE%.pdf}.txt\n"
printf "Individual text files can be found in: /tmp/image2text/ \n"
fi
exit
Re: How to convert regular pdf into a pdf with text using pd
Hello Soul Singin'
Your example script in your first post helped me - while many others did not, to convert PDF files into txt with ease.
It worked especially well after I made "convert -density 200 -quality 100 $i ${i%.pdf}.png" and using tesseract OCR.
Thank you!
I am not a computer science major, so I think I can ask something may be seen very basic.
This:
I have a folder full of PDF files. Some 15 GBs of them.
And I want to mass-convert them to txt files, better saved to an another folder.
Can you make an addition for your first script?
Is this ever possible?
Best!
- Aytaç
Your example script in your first post helped me - while many others did not, to convert PDF files into txt with ease.
It worked especially well after I made "convert -density 200 -quality 100 $i ${i%.pdf}.png" and using tesseract OCR.
Thank you!
I am not a computer science major, so I think I can ask something may be seen very basic.
This:
I have a folder full of PDF files. Some 15 GBs of them.
And I want to mass-convert them to txt files, better saved to an another folder.
Can you make an addition for your first script?
Is this ever possible?
Best!
- Aytaç
Re: How to convert regular pdf into a pdf with text using pd
I learned how to perfectly mass-convert PDF files into txt, using OCR.
I will copy the shell script code, and explain how to use it, but first, I have to say its advantage over other techniques shared in internet. I saw many scripts, especially in Ubuntu forums, and unfortunately, most of them even does not work as I have tested.
* This script does not use TIFF images in conversion, so it will not use tens of gigabytes of space in your hard disk in conversion.
* It just uses extensive CPU. How fast your CPU is, will determine how fast your PDF files will get converted. I have a considerably older machine with some 2 Ghz of processor, and in 1 hour, it converts approximately 120 pages.
* It does not make your system get slowed. Other scripts using TIFF conversion method slows your system extremely.
All said these, let me share the script, which is the script of Soul Singin', but a little changed. Since I don't know how to write codes, I asked Stack Exchange forums' Unix & Linux category to help, and they contributed. Here it is:
It works perfect in Debian Wheezy. And it can work in all Linux distros, but I did not tried.
You need these packages: pdftk, imagemagick, tesseract.
Yes, tesseract-ocr. Why?
It's conversion quality is equal to Abbyy Fine Reader. This is why.
How to use it?
Let me explain how to use it, because some people may not know how to use shell scripts. I did not know once.
You should copy this code into an empty file. A txt file, if you want to call this way.
But it should not have any extension at the end. (Perhaps it can have a .sh extension, but I don't know. Better leave it without any extension, because this works, I tried and worked. Better to use the working thing directly, right?)
Of course, edit the "outpul folder" path in the code. This output destination is the place where you want the result .txt files to be located.
Then copy this no-extension shell script file, which has the code above we shared, inside to the folder you have all of your PDFs.
(If you have many PDF files scattered around your hard disk, there are some codes which can move all of those PDF files into a single folder. You can find it in the net.)
Then, open the Terminal/bash/shell emulator, "cd" to the folder which you have your all PDF files. (What is "to 'cd' " to the folder? You can learn it by searching "Linux command line codes cd" in the net. In short, you write this: "cd /home/your_id/whatever/folder/to/cd" and done! You are there, inside this folder! And able to do and control things inside this folder!) You know, here in this folder you have both all your PDFs, and also the script file.
Lets assume the name of your script file is "script". It has no extension as you can see.
And.. Write this on command line: "./script *.pdf"
Thats it.
The temporary files will be also deleted in the process. A clean script.
You dont need Abbyy Fine Reader for converting your files into .txt format.
Linux has it, and has it for free.
Note: The "density" section in the code is probably the best values. The lower, you get wrong results in txt file. The higher, you get much longer processing time. I have tested, and it works perfectly. So, I advise not to change it.
I think this script can be useful for many.
Thanks to all!
- Aytaç
I will copy the shell script code, and explain how to use it, but first, I have to say its advantage over other techniques shared in internet. I saw many scripts, especially in Ubuntu forums, and unfortunately, most of them even does not work as I have tested.
* This script does not use TIFF images in conversion, so it will not use tens of gigabytes of space in your hard disk in conversion.
* It just uses extensive CPU. How fast your CPU is, will determine how fast your PDF files will get converted. I have a considerably older machine with some 2 Ghz of processor, and in 1 hour, it converts approximately 120 pages.
* It does not make your system get slowed. Other scripts using TIFF conversion method slows your system extremely.
All said these, let me share the script, which is the script of Soul Singin', but a little changed. Since I don't know how to write codes, I asked Stack Exchange forums' Unix & Linux category to help, and they contributed. Here it is:
Code: Select all
#!/bin/bash
## script to:
## * split a PDF up by pages
## * convert them to an image format
## * read the text from each page
## * concatenate the pages
## pass name of PDF file to script
for INFILE
do
## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf pg_0002.pdf pg_0003.pdf
pdftk $INFILE burst
for i in pg*.pdf ; do
## convert it to a PNG image file
convert -density 200 -quality 100 $i ${i%.pdf}.png
## read text from each page
tesseract ${i%.pdf}.png ${i%.pdf}.txt
done
## concatenate the pages into a single text file
cat pg*.txt > ${INFILE%.pdf}.txt
rm pg*.pdf
rm pg*.png
rm pg*.txt
rm doc_data.txt
mv ${INFILE%.pdf}.txt /home/your_user_ID/path/to/your/desired/output/folder/.
done
exit
You need these packages: pdftk, imagemagick, tesseract.
Yes, tesseract-ocr. Why?
It's conversion quality is equal to Abbyy Fine Reader. This is why.
How to use it?
Let me explain how to use it, because some people may not know how to use shell scripts. I did not know once.
You should copy this code into an empty file. A txt file, if you want to call this way.
But it should not have any extension at the end. (Perhaps it can have a .sh extension, but I don't know. Better leave it without any extension, because this works, I tried and worked. Better to use the working thing directly, right?)
Of course, edit the "outpul folder" path in the code. This output destination is the place where you want the result .txt files to be located.
Then copy this no-extension shell script file, which has the code above we shared, inside to the folder you have all of your PDFs.
(If you have many PDF files scattered around your hard disk, there are some codes which can move all of those PDF files into a single folder. You can find it in the net.)
Then, open the Terminal/bash/shell emulator, "cd" to the folder which you have your all PDF files. (What is "to 'cd' " to the folder? You can learn it by searching "Linux command line codes cd" in the net. In short, you write this: "cd /home/your_id/whatever/folder/to/cd" and done! You are there, inside this folder! And able to do and control things inside this folder!) You know, here in this folder you have both all your PDFs, and also the script file.
Lets assume the name of your script file is "script". It has no extension as you can see.
And.. Write this on command line: "./script *.pdf"
Thats it.
The temporary files will be also deleted in the process. A clean script.
You dont need Abbyy Fine Reader for converting your files into .txt format.
Linux has it, and has it for free.
Note: The "density" section in the code is probably the best values. The lower, you get wrong results in txt file. The higher, you get much longer processing time. I have tested, and it works perfectly. So, I advise not to change it.
I think this script can be useful for many.
Thanks to all!
- Aytaç