RFC for sparse pack tool headers

Message

dryden · #1 Post by **dryden** » 2017-06-08 17:34

So I had some itch as they say it.

I have wanted more pipe functionality for the manipulation of archives. Sparse files can only be manipulated on disk and not really as the output of programs.

XZ'ing a sparse file is very slow, tarring is possible but heavy overhead on CPU in the beginning. XZ itself contains no error correction and its checksums are useful only if you are happy to throw away the entire file if any bit error occurs.

Still XZ is the most popular format if you have the time to spare.

I had written before something called "e2zerocat" which was nothing more than a reading of dumpe2fs and then 'catting' the used blocks while outputing /dev/zero for the unused ones.

The result is a sparsified disk image being sent over a pipe.

But I really mean to say...

I wanted a way to compress "sparse file sent over stream" better than is possible with XZ, while not depending on tar. The obvious thing is a simple format that will encode data blocks as data blocks and sparse room as merely headers. When sending stuff over a stream information is lost because "real zeroes" become sparse as well.

So it would be better to retain this information from the originating program, but anyway....

I am wondering if anyone would be interested in giving some comments on what a simple sparse-packing tool should have for features.

no compression
does have checksumming
can have a form of hamming encoding

Given that, what are mandatory features of its data structure?

Do I...

... create hierarchical summary information for groups of blocks?

This would allow verification of any length fields in the headers, because they all have to sum in their "group". These length summaries also indicate the position in the entire stream.

Without it, the position is calculated as an accumulation of all the preceding blocks, and nothing other than that.

I have finished a Hamming encoder for a (255, 247) scheme. The thing works. It creates about 3% growth in file size.

It has no headers, it is not meant as a standalone thing. But it can take any file and create an encoded version of it.

Here is a screenshot of the visual output of the hamming encoder:

The left most column is the input. The middle column is the encoded data. The green is the error byte. The right is the output of the decoder. The accentuated bytes show a random bit error, and it was repaired while decoding.

So I have a library that can do (255, 247) hamming encoding/decoding. My sparsepack can easily integrate this to get some modicum of data safety. So, in addition to this (small measure of) correction ability, are there any other safeguards that would be common sensical to include in any format that should be designed to handle some bit rot?