I'm trying to process a bunch of text (email) files read one by one from a zip file into a pipe.
Assuming process is a script that reads stdin, I'd like to do it this way:
Code: Select all
unzip -p file.zip | ./process
However, unzip's output is just a continuous stream of bytes, with nothing separating each file (not even a new line).
This method works for 'traditional' filenames:
Code: Select all
zipinfo -1 file.zip | while read f ; do unzip -p file.zip "$f" | ./process ; done
Unfortunately, a large number of the filenames have non-word (ie, [^\w]) ascii characters and/or wide characters, like emojis, etc.
For these 'rich' filenames, unzip just prints:
Code: Select all
caution: filename not matched:
...which is an error, not a warning, since the file isn't found.
I'm extremely reluctant to engage in anything involving parsing the actual content for email boundaries. For example, I could maybe try something masochistic like...
Code: Select all
./print-stuff | sed 's|Message-Id: |^ZMessage-Id: |' | while read -d ^Z s ; do echo "$s" | ./do-stuff ; done
but I'd rather gouge my eyes out... For a start, I don't know what the first header field will be.
Another solution that suggests itself is to read the index in the zip file, then extract the contents manually, by offset. Unless there's an existing tool for this, it sounds like a lot of hard work...
So, can anybody suggest a way to:
- separate individual files sent to stdout from unzip?
- or iterate over and exttract a zip file's contents, without using filenames?
- or pass 'rich' filenames so they can be recognised by unzip? (there are a variety of switches for this, but none seem applicable)
- None of the zip files contain directories.
- unzip's -U option doesn't help.
Update:
Code: Select all
zipdetails --scan file.zip
Another possibility is to get each files size and read the file in chunks that size into a pipe. Maybe something like....
Code: Select all
mkfifo emails
unzip -p file.zip > emails
# in a separate terminal...
unzip -lqq file.zip | while read info ; do
SIZE=$(echo "$info" | cut -c -9)
printf "size: %10d\tfilename: %s\n" $SIZE $(echo "$info" | cut -c 31-)
cat emails | dd ibs=$SIZE count=1 2>./dev/null | ./process
done
...or something. I don't need the original filename.
I quite like this idea. I haven't tried it yet, but this looks promising...
Code: Select all
echo 1234567890abcdefghij > emails
============
cat emails | dd ibs=10 count=1 iseek=1 2>/dev/null
1234567890
cat emails | dd ibs=10 count=1 iseek=1 2>/dev/null
abcdefghij
The default block size is 512, but I don't know what the max is. I think I'd need to chunk up the bytes into smaller blocks, since some of the emails are 10s of Mb..
something like....
Code: Select all
[size=85]
$ declare -f get_std_blks
get_std_blks ()
{
SZ=$(($1));
IBS=512;
COUNT=$(echo "$SZ / $IBS" | bc);
echo $IBS x $COUNT
}
$ declare -f get_last_blk
get_last_blk ()
{
SZ=$(($1));
BS=512;
COUNT=1;
IBS=$(echo "$SZ % $BS" | bc);
echo $IBS x $COUNT
}
$ declare -f read_input
read_input ()
{
while read l; do
sz=$(echo "$l" | cut -f 1 -d ' ');
fn=$(echo "$l" | cut -f 2- -d ' ');
echo ______________________________ $sz [$fn];
get_std_blks $sz;
get_last_blk $sz;
done
}
$ echo '1657 sas asasas
1655 xzxzx zxz
9441 jkghfkjgh kdf
16429 jh fsdkfk
627 kjhsdkjf kdf
1828 sdk sdjhfksdj
2706 fhk sdjhsfd
2005 kj sdhfkhd' | read_input
______________________________ 1657 [sas asasas]
512 x 3
121 x 1
______________________________ 1655 [xzxzx zxz]
512 x 3
119 x 1
______________________________ 9441 [jkghfkjgh kdf]
512 x 18
225 x 1
______________________________ 16429 [jh fsdkfk]
512 x 32
45 x 1
______________________________ 627 [kjhsdkjf kdf]
512 x 1
115 x 1
______________________________ 1828 [sdk sdjhfksdj]
512 x 3
292 x 1
______________________________ 2706 [fhk sdjhsfd]
512 x 5
146 x 1
______________________________ 2005 [kj sdhfkhd]
512 x 3
469 x 1
[/size]
Unfortunately, returning from the first application of dd breaks the pipe. It needs something like...
Code: Select all
echo 1234567890abcdefghij > emails
=========
$ cat emails | { dd ibs=2 count=2 2>/dev/null ; dd ibs=2 count=2 2>/dev/null ; }
12345678
.... but a loop doesn't work. Hmmmm.... Process substitution looks like the answer. Maybe I'll find out tomorrow.
As shown below, the same process is read repeatedly in the loop, not called each time. There are two different 'i' variables.
Code: Select all
for i in {0..10} ; do dd ibs=2 count=2 skip="$i" 2>/dev/null < <(for i in {1..10} ; do echo a"$i"b"$i"c"$i"d"$i"e"$i"f"$i" ; done) ; echo ; done
a1b1
b1c1
c1d1
d1e1
e1f1
f1
a
a2b
2b2c
2c2d
2d2e
2e2f
At this point, a normal person might be tempted to do this with C, but that would be boring...
I curse the day spaces, special and wide characters were allowed in filenames!