pdf text only partially selectable - though it should be

If none of the more specific forums is the right place to ask

pdf text only partially selectable - though it should be

Postby stillsen » 2019-11-11 11:06

Hi,
I have a pdf of roughly 190 lines of text per page, in which I can only select the first 153 lines. But I need all lines to be selectable.
If I use Chromium's build-in pdf viewer or Windows, I can select all lines.

Hence I thought this issue might be connected to the default paper size. I changed it between a4 and letter using 'dpkg-reconfigure libpaper1', which did not solve the problem.
Does anyone has a glue what I might do or where else I might find help?

I know that this is maybe not the most ideal place to ask, but I could not come up with another idea.
Thanks ind advance,
stillsen
stillsen
 
Posts: 3
Joined: 2019-11-11 10:48

Re: pdf text only partially selectable - though it should be

Postby arochester » 2019-11-11 14:24

If I use Chromium's build-in pdf viewer or Windows, I can select all lines.


Which application are you using that only shows 153 lines?
arochester
 
Posts: 1577
Joined: 2010-12-07 19:55

Re: pdf text only partially selectable - though it should be

Postby Soul Singin' » 2019-11-11 16:50

stillsen wrote:I have a pdf of roughly 190 lines of text per page, in which I can only select the first 153 lines. But I need all lines to be selectable.

Instead of selecting text, first install the poppler-utils package:

Code: Select all
# apt-get install poppler-utils

Then run:

Code: Select all
$ pdftotext your-file.pdf

It's much easier that way. It will output all of the text to: your-file.txt.

Or if you would like to direct the output to another file:

Code: Select all
$ pdftotext your-file.pdf some-other-file.txt
User avatar
Soul Singin'
 
Posts: 1583
Joined: 2008-12-21 07:02

Re: pdf text only partially selectable - though it should be

Postby stillsen » 2019-11-12 12:48

The tools I have been using to display the pdfs and select text within are: Atril Document Viewer and Occular

I've tried the poppler approach, which gives me a textfile, but those lines from around 153 until the end of the page are missing too :(

This is the pdf in question
https://static-content.springer.com/esm/art%3A10.1038%2Fs41540-018-0069-9/MediaObjects/41540_2018_69_MOESM2_ESM.pdf
stillsen
 
Posts: 3
Joined: 2019-11-11 10:48

Re: pdf text only partially selectable - though it should be

Postby Soul Singin' » 2019-11-12 16:00

Whoa! That's one huge file.

Because it's a data table of some kind, your task would be much easier if you could obtain the spreadsheet (or other file) that was used to generate it.

I would write to the authors and ask them to share it with you. Tell them what you're working on, why you think their work is important and how you would like to build on it. Who knows? They might say: "Yes."

Good luck!
- Soul
User avatar
Soul Singin'
 
Posts: 1583
Joined: 2008-12-21 07:02

Re: pdf text only partially selectable - though it should be

Postby bester69 » 2019-11-15 12:51

stillsen wrote:Hi,
I have a pdf of roughly 190 lines of text per page, in which I can only select the first 153 lines. But I need all lines to be selectable.
If I use Chromium's build-in pdf viewer or Windows, I can select all lines.

Hence I thought this issue might be connected to the default paper size. I changed it between a4 and letter using 'dpkg-reconfigure libpaper1', which did not solve the problem.
Does anyone has a glue what I might do or where else I might find help?

I know that this is maybe not the most ideal place to ask, but I could not come up with another idea.
Thanks ind advance,
stillsen


I think you could do like this.:

0- extract range of pages you need to use:
Code: Select all
pdftk 41540_2018_69_MOESM2_ESM.pdf cat 1-2 output sal1-2.pdf


1- Convert pages to plain txt-files with ghostscript.:
Code: Select all
gs -sDEVICE=txtwrite dNOPAUSE  -dBATCH -sOutputFile=sal1-2.pdf out1-2.txt


2- And finally if you need it and feel more confortable, convert it to open document (odt).:
Code: Select all
unoconv -f odt out1-2.txt
or
libreoffice  --headless --convert-to odt out1-2.txt


then you adjust the document to a bigger page like A3 or the one to match the original pdf document and the result is this.:
Image
bester69 wrote:You wont change my mind when I know Im right, Im not an ...
User avatar
bester69
 
Posts: 1502
Joined: 2015-04-02 13:15

Re: pdf text only partially selectable - though it should be

Postby stillsen » 2019-11-24 07:42

thank you so much for your help! - it's solved now
yes, it is a huge file! - and I want to convert it into csv

i absolutely didn't think of ghostscript - which in turn did the trick.
using debian i did not manage to use the correct character encoding, so i switched to windows and converted the whole pdf using:

Code: Select all
gs -sDEVICE=txtwrite -dNOPAUSE  -dBATCH -sOutputFile=out.txt 41540_2018_69_MOESM2_ESM.pdf


BIG THANKS again!
stillsen
 
Posts: 3
Joined: 2019-11-11 10:48

Re: pdf text only partially selectable - though it should be

Postby Soul Singin' » 2019-11-25 09:58

stillsen wrote:thank you so much for your help! - it's solved now
yes, it is a huge file! - and I want to convert it into csv

I'm glad you got the text. Below is a Perl script that will convert the file to CSV.

You have made me so curious that I even tested it for you. It should work fine. Now could you please tell us what this is? . :?:

Code: Select all
#!/usr/bin/env perl

use strict;
use warnings;

##  input file -- the output of "gs" command
my $infile = "out.txt";

##  output file -- formatted CSV
my $otfile = "formatted.csv";

##  open the files for reading and writing
open( OTFILE, ">$otfile" ) || die "could not overwrite $otfile";
open( INFILE, $infile ) || die "could not open $infile";

##  read in the input file and convert it to CSV
while (<INFILE>) {
   
    ##  remove newlines (at end of each line)
    chomp;

    ##  create a scalar to hold the line
    my $line = $_;
   
    ##  remove excess space
    $line =~ s/\s+/ /g;
    $line =~ s/^ //;
    $line =~ s/ $//;
   
    ##  replace the spaces with commas
    $line =~ s/ /,/g;
   
    ##  if the line contains text, add an initial newline
    ##  if the line contains floats, add an initial column
    $line = ( $line =~ /^[A-Z]/ ) ? "\n". $line : ','. $line;

    ##  print to the CSV file (adding a newline)
    print OTFILE $line ."\n";
}

##  close the files
close INFILE;
close OTFILE;
User avatar
Soul Singin'
 
Posts: 1583
Joined: 2008-12-21 07:02

Re: pdf text only partially selectable - though it should be

Postby stevepusser » 2019-11-25 23:31

Just as a point of interest, the free-as-in-beer-but-not-as-in-speech Master PDF Editor seems to be able to copy those lines, if this is the last one:

Code: Select all
FOX3FUS3STR3DOX3TMP3 FOX FUS STR DOX TMP FOX+FUS FOX+STR FOX+DOX FOX+TMP FUS+STR FUS+DOX FUS+TMP STR+DOX STR+TMP DOX+TMP FOX+FUS+STR FOX+FUS+DOX FOX+FUS+TMP FOX+STR+DOX FOX+STR+TMP FOX+DOX+TMP FUS+STR+DOX FUS+STR+TMP FUS+DOX+TMP STR+DOX+TMP FOX+FUS+STR+DOX FOX+FUS+STR+TMP FOX+FUS+DOX+TMP FOX+STR+DOX+TMP FUS+STR+DOX+TMP FOX+FUS+STR+DOX+TMP
95.77191621 92.0480993 83.35919317 84.55559984 97.30968513 101.3576416 87.70364624 95.46159814 66.15593483 63.49883631 118.7354538 125.2521334 101.6679597 107.8743212 113.889996 71.87742436 77.15283165 96.08223429 68.7742436 47.98293251 101.9782777 58.08766486 63.05275407 115.0116369 81.18696664 54.4996121 65.22498061 116.5632273 85.84173778 54.67416602 44.74398759
101.0549983 92.75337254 78.91732964 99.70660147 95.53382233 89.7094431 77.2570045 75.04323763 57.33310273 58.99342788 109.356624 108.8031823 92.47665168 102.4386026 92.92583537 71.44586648 66.46489104 63.97440332 46.54098928 44.88066413 90.81632653 57.88654445 76.98028364 109.356624 89.98616396 46.26426842 56.77966102 101.60844 79.74749222 68.6786579 50.968523
94.69523977 87.10697348 87.10697348 85.48565121 111.9757174 99.75408396 75.58405059 100.8782716 51.13297031 60.1264711 102.0024592 110.99596 90.19848937 93.29000527 108.7380427 79.51870718 79.51870718 56.19181451 66.3095029 27.80607764 95.25733357 53.38134551 67.9957843 115.7737572 88.79325487 47.19831372 54.50553311 98.34884946 74.17881609 64.6232215 62.0937994
The MX Linux repositories: Backports galore! If we don't have something, just ask and we'll try--we like challenges. New packages: Clipgrab 3.8.6, Hedgewars 1.0.0, PulseEffects 4.6.9, Telegram-desktop 1.8.15, Pale Moon 28.8.0, KeepassXC 2.5.1
User avatar
stevepusser
 
Posts: 11316
Joined: 2009-10-06 05:53


Return to General Questions

Who is online

Users browsing this forum: No registered users and 13 guests

fashionable