how to process files separately in unzip to pipe?

Message

bitrat · #1 Post by **bitrat** » 2023-08-30 21:53

Hi,

I'm trying to process a bunch of text (email) files read one by one from a zip file into a pipe.

Assuming process is a script that reads stdin, I'd like to do it this way:

Code: Select all

unzip -p file.zip | ./process

However, unzip's output is just a continuous stream of bytes, with nothing separating each file (not even a new line).

This method works for 'traditional' filenames:

Code: Select all

zipinfo -1 file.zip  | while read f ; do unzip -p file.zip "$f" | ./process ; done

Unfortunately, a large number of the filenames have non-word (ie, [^\w]) ascii characters and/or wide characters, like emojis, etc.

For these 'rich' filenames, unzip just prints:

Code: Select all

caution: filename not matched:

...which is an error, not a warning, since the file isn't found.

I'm extremely reluctant to engage in anything involving parsing the actual content for email boundaries. For example, I could maybe try something masochistic like...

Code: Select all

./print-stuff | sed 's|Message-Id: |^ZMessage-Id: |' | while read -d ^Z s ; do echo "$s" | ./do-stuff ; done

but I'd rather gouge my eyes out... For a start, I don't know what the first header field will be.

Another solution that suggests itself is to read the index in the zip file, then extract the contents manually, by offset. Unless there's an existing tool for this, it sounds like a lot of hard work...

So, can anybody suggest a way to:

separate individual files sent to stdout from unzip?
or iterate over and exttract a zip file's contents, without using filenames?
or pass 'rich' filenames so they can be recognised by unzip? (there are a variety of switches for this, but none seem applicable)

Notes:
- None of the zip files contain directories.
- unzip's -U option doesn't help.

Update:

Code: Select all

zipdetails --scan file.zip

...ouputs the filename, showing hex value of the non printable bytes. Unfortunately the output is mangled into columns.. but it may be useful.

Another possibility is to get each files size and read the file in chunks that size into a pipe. Maybe something like....

Code: Select all

mkfifo emails

unzip -p file.zip > emails

# in a separate terminal...

unzip -lqq file.zip | while read info ; do
         SIZE=$(echo "$info" | cut -c -9)
         printf "size: %10d\tfilename: %s\n"  $SIZE $(echo "$info" | cut -c 31-)
         cat emails | dd ibs=$SIZE count=1 2>./dev/null | ./process
done

...or something. I don't need the original filename.

I quite like this idea. I haven't tried it yet, but this looks promising...

Code: Select all

echo 1234567890abcdefghij > emails

============

cat emails | dd ibs=10 count=1 iseek=1 2>/dev/null
1234567890

cat emails | dd ibs=10 count=1 iseek=1 2>/dev/null
abcdefghij

The default block size is 512, but I don't know what the max is. I think I'd need to chunk up the bytes into smaller blocks, since some of the emails are 10s of Mb..

something like....

Code: Select all

[size=85]
$ declare -f get_std_blks
get_std_blks () 
{ 
    SZ=$(($1));
    IBS=512;
    COUNT=$(echo "$SZ / $IBS" | bc);
    echo $IBS x $COUNT
}


$ declare -f get_last_blk
get_last_blk () 
{ 
    SZ=$(($1));
    BS=512;
    COUNT=1;
    IBS=$(echo "$SZ % $BS" | bc);
    echo $IBS x $COUNT
}


$ declare -f read_input
read_input () 
{ 
    while read l; do
        sz=$(echo "$l" | cut -f 1 -d ' ');
        fn=$(echo "$l" | cut -f 2- -d ' ');
        echo ______________________________ $sz [$fn];
        get_std_blks $sz;
        get_last_blk $sz;
    done
}


$ echo '1657 sas asasas 
1655 xzxzx zxz 
9441 jkghfkjgh kdf 
16429 jh fsdkfk 
627 kjhsdkjf kdf 
1828 sdk sdjhfksdj 
2706 fhk sdjhsfd 
2005 kj sdhfkhd' | read_input
______________________________ 1657 [sas asasas]
512 x 3
121 x 1
______________________________ 1655 [xzxzx zxz]
512 x 3
119 x 1
______________________________ 9441 [jkghfkjgh kdf]
512 x 18
225 x 1
______________________________ 16429 [jh fsdkfk]
512 x 32
45 x 1
______________________________ 627 [kjhsdkjf kdf]
512 x 1
115 x 1
______________________________ 1828 [sdk sdjhfksdj]
512 x 3
292 x 1
______________________________ 2706 [fhk sdjhsfd]
512 x 5
146 x 1
______________________________ 2005 [kj sdhfkhd]
512 x 3
469 x 1

[/size]

Unfortunately, returning from the first application of dd breaks the pipe. It needs something like...

Code: Select all

echo 1234567890abcdefghij > emails

=========

$ cat emails | { dd ibs=2 count=2 2>/dev/null ; dd ibs=2 count=2 2>/dev/null ; }
12345678

.... but a loop doesn't work. Hmmmm.... Process substitution looks like the answer. Maybe I'll find out tomorrow.

As shown below, the same process is read repeatedly in the loop, not called each time. There are two different 'i' variables.

Code: Select all

for i in {0..10} ; do dd ibs=2 count=2 skip="$i" 2>/dev/null < <(for i in {1..10} ; do echo a"$i"b"$i"c"$i"d"$i"e"$i"f"$i" ; done) ; echo ; done
a1b1
b1c1
c1d1
d1e1
e1f1
f1
a

a2b
2b2c
2c2d
2d2e
2e2f

At this point, a normal person might be tempted to do this with C, but that would be boring...

I curse the day spaces, special and wide characters were allowed in filenames!

lindi · #2 Post by **lindi** » 2023-08-31 06:24

I would consider writing this in python, it has support for zip files.

bitrat · #3 Post by **bitrat** » 2023-08-31 09:10

lindi wrote: ↑2023-08-31 06:24 I would consider writing this in python, it has support for zip files.

Hi, thanks lindi, I considered that...

python3 -m zipfile

It seems like an admission of defeat though.

Also, I have other bash scripts already to do the processing, and I'd rather not call them from a python script. I suspect using python would just add another layer of indirection, and probably call the same underlying zip library code, lol. Also, if I just use python to print out n bytes, it would probably be easier to just do that part in C.

But yes, you're right, it would probably be easier, and I'll check it out!

Actually, I'm surprised there aren't tools for this in unzip already. It doesn't seem that exotic a use case...

UPDATE:

Happily, Python's zip module handles the problematic filenames correctly (which bash does not) so this is the best solution. Thanks!

Code: Select all

$ python3
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import zipfile
>>> 
>>> with zipfile.ZipFile('file.zip') as myzip:
...     with myzip.open('file1.eml') as myfile:
...             print(myfile.read())
... 
b'Message-Id: <148366 ....

I'll finish the job before I mark this solved, but I'm optimistic. I don't regret the bash experiment, as I've learned a couple of good tricks!

One question:

Code: Select all

>>> import hashlib
>>> hashlib.md5('example string'.encode(encoding = 'UTF-8')).hexdigest()
'aaaa5d16ca437bd4564f25ec0cb71c9d'

Code: Select all

echo 'example string' | md5sum
5ae20f1b2a76baf4d18f8c02a89d6a9b  -

I'm hopeless with encoding questions. Can you (or anyone) explain why these hashes are different, and how I can align the input encoding so they're the same? My intuition is that 'example string' should only contain (identical) bytes in either case, since there are no wide chars in the string. xfce4-terminal doesn't seem to have an encoding setting...?

I have hashes generated from email contents with bash md5sum and I want to match them to newly generated ones. It's probably not too critical, as I can always print out the email contents and get the hash using bash:

Code: Select all

$ python3 -c 'print("example string")' | md5sum
5ae20f1b2a76baf4d18f8c02a89d6a9b  -

$ echo "example string" | md5sum
5ae20f1b2a76baf4d18f8c02a89d6a9b  -

This works:

Code: Select all

import zipfile                                                                                                                                                                             
file='problems.zip'                                                                                                                                                                        
#file='file.zip'                                                                                                                                                                           
                                                                                                                                                                                           
with zipfile.ZipFile(file) as myzip:                                                                                                                                                       
     for name in myzip.namelist():                                                                                                                                                          
         with myzip.open(name) as myfile:                                                                                                                                                   
             print(myfile.read())                                                                                                                                                           
             print('\x04')

So, hopefully, my problem is solved (thanks lindi

). I just need to handle a marker (^D) between each file.

It seems that the key issue is aligning the encodings of the filenames:

from unzip -l to xfce4-terminal to unzip -p
or from python print(filename) to xfce4-terminal to unzip -p

Both of which cause errors on my system. However, when I read the filenames from the zip file and use them to extract that file all within python, it works, since there's no encoding change. Actually, I've yet so see what happens to the encoding of wide characters when they're printed by python, but I'm hoping I might have more control over that. Essentially, they're just bytes anyway, so who care's what they represent.

I'd be interested to know the correct way to handle the failure case, but I don't need to right now.

bitrat

bitrat · #4 Post by **bitrat** » 2023-08-31 22:35

PROGRESS SO FAR...

This is what I want to accomplish:

Code: Select all

$ zipinfo -1 file.zip  | while read f ; do unzip -p file.zip "$f" | md5sum ; done
cc4b468ab53601d447f744fca1648a2d  -
5461d70bf31919b18d26bb75228a3b48  -
936b7989fb4f6a93c9419b2cf9a99cbd  -
47bb075234dac9505c62f8755b6308b1  -
9937b1439051f8ae506b53aee8843466  -
1723095936b79b9c18e84181696aa8f0  -
26ed1e10eccae9ee1b0dbd5ec8738a7f  -
41f5e411c47b44c3ba57e69390d00e5d  -

But this method fails for filenames with wide chars:

Code: Select all

zipinfo -1 problems.zip  | while read f ; do unzip -p problems.zip "$f" | md5sum ; done
aution: filename not matched:  111 👤👤👤.eml
d41d8cd98f00b204e9800998ecf8427e  -
caution: filename not matched:  222 👤👤👤.eml
d41d8cd98f00b204e9800998ecf8427e  -
caution: filename not matched:  333 👤👤👤.eml
d41d8cd98f00b204e9800998ecf8427e  -
caution: filename not matched:  444 👤👤👤.eml
d41d8cd98f00b204e9800998ecf8427e  -
.
.

.

The filename problem is resolved in Python, but I have to figure out how to get the hash. I've been misled a few times by forgetting to suppress echo's new line, but this looks promising...

Code: Select all

$ python3 -c "import sys
sys.stdout.buffer.write(b'example string')" | md5sum
aaaa5d16ca437bd4564f25ec0cb71c9d  -

$ echo -n 'example string' | md5sum
aaaa5d16ca437bd4564f25ec0cb71c9d  -

$ python3 -c 'print("333 👤👤👤")' | md5sum
656919e847926a89c1e870135716a8dc  -

$ echo "333 👤👤👤" | md5sum
656919e847926a89c1e870135716a8dc  -

But I still have issues...

Code: Select all

$ cat print-files.py
import zipfile
import sys

file='file.zip'

with zipfile.ZipFile(file) as myzip:
    for name in myzip.namelist():
        with myzip.open(name) as myfile:
            sys.stdout.buffer.write(myfile.read())
            sys.stdout.buffer.write(b'\x04')  ## ^D

Code: Select all

$ python3 print-files.py | while read -d ^D s ; do echo -n "$s" | md5sum  ; done
cc4b468ab53601d447f744fca1648a2d  -
5461d70bf31919b18d26bb75228a3b48  -
b503ea92cd1d8af69ace157268d519fe  -
66df6ad9a581504b978f82a048a4667c  -
d54fe1cfed30f92f7cc1ac2206799528  -
c4dcfd03ff95fd9dbde1f2c3b0ac6dda  -
8d209914b13f2ebaff74cff6e8d7d5cc  -
7cfa864efec605fe484c490b4376850f  -

Only the first two lines agree with the bash only version...

Code: Select all

$ zipinfo -1 file.zip  | while read f ; do unzip -p file.zip "$f" | md5sum ; done
cc4b468ab53601d447f744fca1648a2d  -
5461d70bf31919b18d26bb75228a3b48  -
936b7989fb4f6a93c9419b2cf9a99cbd  -
47bb075234dac9505c62f8755b6308b1  -
9937b1439051f8ae506b53aee8843466  -
1723095936b79b9c18e84181696aa8f0  -
26ed1e10eccae9ee1b0dbd5ec8738a7f  -
41f5e411c47b44c3ba57e69390d00e5d  -

This is due to newlines being added or removed at the end of some of the files, somewhere in the pipeline, and possibly different handling of '\r' characters... Should be soluble...

Code: Select all

$ python3 print-files.py | while read -d ^D s ; do printf "%s" "$s" | tr -d -c '\n' ; done  | wc -l
794

$ zipinfo -1 file.zip  | while read f ; do unzip -p file.zip "$f" | tr -d -c '\n' ; done | wc -l
800

I've realised that line oriented buffering pretty much makes it impossible to accomplish this task without writing the data to a file. There are too many edge cases triggering different handling of newlines in the stream oriented approach. Short of using netcat and UDP, I've decided to give up on streams for now...

Since my main goal is just to get the hashes of the files without extracting them all (because there are many GBs and not much space), the obvious solution is just to write each file to a tempfile as it's extracted. Python was definitely a good choice though! Not least because it does a good job decoding utf-8 MIME encoding in email headers! I'm using it for the other processing now.

Although sticking to bash's md5sum for the hash, calling with process module. Not sure why but I can't get the Python one to deliver the same result and have run out of patience.

kent_dorfman766 · #5 Post by **kent_dorfman766** » 2023-09-25 07:22

too much to read and digest above, but lets assume the zipfile is a single file containing a list of emails...
Since the standard UNIX mail facilities save all emails for a single user in a single file, you need to parse them out the way mail utilities do:

Lood for the standard mail headers and message length fields as indicators of how much that follows is part of that message. Use that data to separate the messages in "awk fashion"

Otherwise, if the zipfile has a separate file for each message then write a script to extract the files to temporary directory and process each one individually. It will be far less work than trying to pipe zip data through file descriptors.

bitrat · #6 Post by **bitrat** » 2023-09-25 22:31

kent_dorfman766 wrote: ↑2023-09-25 07:22 too much to read and digest above, but lets assume the zipfile is a single file containing a list of emails...
Since the standard UNIX mail facilities save all emails for a single user in a single file, you need to parse them out the way mail utilities do:

Lood for the standard mail headers and message length fields as indicators of how much that follows is part of that message. Use that data to separate the messages in "awk fashion"

Otherwise, if the zipfile has a separate file for each message then write a script to extract the files to temporary directory and process each one individually. It will be far less work than trying to pipe zip data through file descriptors.

Hi, thanks for this. I actually found the solution in switching to Python, which handled UTF-8 correctly. The unix mail tools weren't really up to it in that area.

I wanted to do this without extracting the files because there were quite a few GB involved and I wanted to process them more than once, so it just seemed easier to make a tool that could examine the contents in place. I agree though, in most situations, it would makes sense to just extract the files!

Debian User Forums

how to process files separately in unzip to pipe?

how to process files separately in unzip to pipe?

Re: how to process files separately in unzip to pipe?

Re: how to process files separately in unzip to pipe?

Re: how to process files separately in unzip to pipe?

Re: how to process files separately in unzip to pipe?

Re: how to process files separately in unzip to pipe?