tar improvements: Prepare smaller tarballs and obviate the need for `--ignore-zeros`: set tar `--blocking-factor=1`; trim EOF by piping tarred chunks to `head --bytes -1024`; set `--sparse`

### Context and problem

Naïve concatenation of incremental tarballs made by GNU `tar` with default settings results in a combined tarball that is larger than it could be. This size bloat occurs when separate tarballs are joined via `gcloud storage objects compose` or `cat` if zero padding is not first removed between tarballs, and if the tar block grouping is left at its default size of 20 blocks. Such tarballs can be successfully extracted using GNU tar if `--ignore-zeros` is specified, but the large size still problematic, especially when _many_ **small** tarballs are joined into a single archive.

The tar format standard has evolved over time, but the most commonly used flavor remains an older and very simple one[^1]: it is comprised of a linear series of files concatenated together, where each file is stored as a 512 byte header describing the file, followed by a series of zero or more 512-byte blocks containing the file data itself, padded with binary zeros to ensure the total file size fills out a discrete number of 512-byte blocks. Additional files are added similarly until the tarball is terminated with two 512-byte blocks of zeros signifying the end of the archive. After that the tar archive can be compressed, moved around, etc. 

It's a format that is _almost_ perfect out of the box for simple concatenation of multiple archives into larger ones. A **stream** of several individually-compressed tarballs can be decompressed as a "single archive" without issue by gzip and other compressors, and incremental backups made by tar over time can be read linearly from an archive and applied in succession to replay changes and restore data state. That all works fine. Unfortunately, since decompressed tarballs in such a stream lacks a file manifest or position index, the data boundaries between decompressed tar archives in a stream cannot be determined without fully reading the data. When `tar` attempts to unpack a tarball or stream made by a naïve concatenation of other tarballs, it can successfully extract files from the first archive, but exits when it encounters two successive 512-byte blocks of zeros representing the end of the archive. GNU `tar` can be forced to continue by setting `--ignore-zeros`, however this is problematic for a few reasons: 
1) `--ignore-zeros` is not universally available across tar implementations[^2], 
2) it's difficult to know _a priori_ if a particular tarball is a composite archive that requires `--ignore-zeros` for full extraction[^3]
  a) This can be especially unintuitive to users unfamiliar with tarballs created by naïve concatenation, since extraction without `--ignore-zeros` will cause a "silent" failure mode of yielding _some_ data from the first tarball, without continuing to downstream data from subsequent tarballs. The need for `--ignore-zeros` is also not very discoverable, since most documentation and discussion relating to the parameter is concerned with recovering data from partial or corrupt archives.
3) Use of `--ignore-zeros` can mask true data loss or corruption of archives

If the EOF signal is two 512-byte blocks of zeros, can we simply remove the last 1kB of a tarball before concatenating it to others? No, generally, since common tar-packing implementations batch 512-byte blocks into groups[^4] of 20 to improve write performance on storage devices which are, by current expectations, frustratingly linear in operation (i.e. tapes) or relatively slow to seek (i.e. spinning rust hard drives). What happens if `tar` wants to write a group of 20 blocks to an archive but does not have enough headers or file data to fill all 20? By default, it fills out the group with (_more_) 512-byte blocks of zeros. That means if the two EOF zero blocks are removed but they're preceded by blocks of zeros that were added so tar could write a full batch of 20 blocks (the latter containing zeros); it effectively appears—in the absence of inspection or knowledge of how tar was grouping blocks when they were written—as if the EOF has moved closer to the start of the archive[^5].

In addition to zeros causing issues for tarball concatenation and extraction, another problem with having so many blocks of zeros in an archive is that they unnecessarily bloat the size of non-compressed tarballs. This is particularly problematic in scenarios where incremental snapshots capture mostly small files and yield archives containing many zero-padded block groups (in addition to EOF zero blocks in concatenated tarballs).

### Solution

Fortunately, we can address all of the issues above by doing the following:
 * set tar `--blocking-factor=1`: this ensures blocks are written to a tarball individually and tar's write buffer is not filled out with blocks of zeros. 
 * pipe tarred chunks to `head --bytes -1024` _prior_ to gzip (or other compression) to trim EOF blocks from concatenated tarballs. The last tarball appended is an exception: it should have the EOF blocks, which can either be retained or added by concatenating `<(head -c 1024 /dev/zero)`
 * set tar `--sparse`: this consolidates runs of zeros in the input data files, which can reduce file sizes
 * set tar `--label="${RUN_BASENAME}"`: the `--label` parameter adds a human-readable note to a tarball that is printed during creation or extraction by GNU `tar`, and we can use it to store some extra info like the sequencing run ID (in case a tarball is renamed) or guidance for extracting the tarball. Since labels are stored in tarballs as the names of placeholder files, they're limited to 99 characters. Tangential to this issue, but it was relevant when I was looking for an in-band way to alert the tarball consumer of the need to `--ignore-zeros`.

### A quick and dirty comparison script, and its output:

```bash
#!/bin/bash

# select tar binary to use based on whether platform is Darwin or Linux
if [[ $(uname) == "Darwin" ]]; then
    alias gtar=gtar
else
    alias gtar=tar
fi

# if input files do not exist, create them
if [[ ! -f ./input1.txt ]] || \
   [[ ! -f ./input2.txt ]] || \
   [[ ! -f ./input3.txt ]] || \
   [[ ! -f ./input4.txt ]]; then
    for n in {1..4}; do 
        echo "${n} $(date +'%Y-%m-%d_%H%M%S')" > input${n}.txt && sleep 2;
    done
fi

# ============

# Scenario 1: proper tarball made from multiple input files
#   all input data exists at the same time and can be added to a single tarball all at once
gtar --sparse -cvf - ./input{1..4}.txt > combined_at_start.tar

# ============

# Scenario 2: naively-joined tarballs made from several independent tarballs
#   this creates two tarballs independently that can be concatenated together,
#   but the result will be larger than necessary, and will fail to extract fully 
#   unless --ignore-zeros is passed to for GNU tar
gtar --sparse -cvf - ./input{1,2}.txt > out1.tar
gtar --sparse -cvf - ./input{3,4}.txt > out2.tar
cat out1.tar out2.tar > combined_naively.tar

# ============

# Scenario 3: proper tarball from multiple input tarballs, joined by tar itself with --concatenate
#   this relies on tar fully reading the input tarballs to find and remove
#   end-of-file blocks and extra blocks of zeros that were added to allow 
#   the writing of block *groups* of consistent size. It can also only (?) add to an existing tarball
cp out1.tar combined_using_tar_native_concat.tar
gtar --sparse --concatenate -f combined_using_tar_native_concat.tar out2.tar

# ============

# Scenario 4: properly prepare tarballs for concatenation via `cat` or `gcloud storage objects compose`
#   this configures block group size to be 1 to avoid adding extra blocks of zeros after files
#   each tarball is then also trimmed of the two 512-byte blocks of zeros that indicate EOF
gtar --sparse --blocking-factor=1 -cvf - ./input{1,2}.txt | ghead --bytes -1024 > out1_slimmed_and_eof_trimmed.tar
gtar --sparse --blocking-factor=1 -cvf - ./input{3,4}.txt | ghead --bytes -1024 > out2_slimmed_and_eof_trimmed.tar
cat out1_slimmed_and_eof_trimmed.tar out2_slimmed_and_eof_trimmed.tar <(head -c 1024 /dev/zero) > combined_with_cat_after_proper_prep.tar

# ============

# gather sizes for all combined tarballs, before and after compression
touch file_sizes_and_tar_upacking_success
printf "tarball_name\tsize_before_gzip\tsize_after_gzip\textracts_with_default_params\textracts_with_ignore_zeros\n" > file_sizes_and_tar_upacking_success
for tarball in combined_at_start.tar combined_naively.tar combined_using_tar_native_concat.tar combined_with_cat_after_proper_prep.tar; do
    gzip --to-stdout --force --best -k ${tarball} > ${tarball}.gz

    # check if tarball can list all files
    extracts_with_plain_gtar="$([[ $(echo $(gtar --list --xform 's/\.\/input([0-9]+)\.txt/\1/gx' --show-transformed-names --file ${tarball}) | sed -e 's/[[:space:]]*//g') == "1234" ]] && echo 'yes' || echo 'no')"
    extracts_with_gtar_ignore_zeros="$([[ $(echo $(gtar --ignore-zeros --list --xform 's/\.\/input([0-9]+)\.txt/\1/gx' --show-transformed-names --file ${tarball}) | sed -e 's/[[:space:]]*//g') == "1234" ]] && echo 'yes' || echo 'no')"

    printf "$(basename ${tarball} .tar)\t$(stat -f %z ${tarball})\t$(stat -f %z ${tarball}.gz)\t${extracts_with_plain_gtar}\t${extracts_with_gtar_ignore_zeros}\n" >> file_sizes_and_tar_upacking_success
done

# ============

# if 'column' is available, print a table of the results. Otherwise, just cat
if [[ $(command -v column) ]]; then
    column -t <(sort -k3,3 -g file_sizes_and_tar_upacking_success)
else
    cat file_sizes_and_tar_upacking_success
fi
```

|tarball_name|size_before_gzip|size_after_gzip|extracts_with_default_params|extracts_with_ignore_zeros|
|:---|---:|---:|:---:|:---:|
|combined_at_start|10240|217|yes|yes|
|combined_naively|20480|238|**no**|yes|
|combined_using_tar_native_concat|20480|246|yes|yes|
|combined_with_cat_after_proper_prep|5120|223|yes|yes|


[^1]: The [tar header and format](https://www.gnu.org/software/tar/manual/html_node/Standard.html) are described in full in the POSIX standard of 1990, [_IEEE 1003.1-1990_](https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub151-2.pdf), p169-173. Its design and limitations stem from the linear storage of data on archival magnetic tapes without the modern conveniences of file manifests, rapid random seeking, or data location indices.
[^2]: Notably, the bsd variant of `tar` that ships with macOS cannot `--ignore-zeros`, though GNU `tar` is available for macOS from [Homebrew](https://brew.sh/) in the [`gnu-tar`](https://formulae.brew.sh/formula/gnu-tar) package, which adds GNU tar to the `PATH` as "`gtar`"
[^3]: Until now the solution implemented by this repo is to add a readme file alongside concatenated tarballs noting the need to tell `tar` to `--ignore-zeros`.
[^4]: GNU `tar` calls these groups of blocks _"records,"_ and makes the group size configurable through the cryptically-named parameter, "[`--blocking-factor`](https://www.gnu.org/software/tar/manual/tar.html#Blocking-Factor)"
[^5]: The block group size of an existing archive can sometimes be determined using GNU `tar`, and printed along with a list of block start positions for files within an archive; ex.: `tar  --checkpoint=1 --checkpoint-action=exec='echo block group size: $TAR_BLOCKING_FACTOR' --totals --block-number --verbose --list --file ${my_archive_tar}`, though YMMV for tarballs adhering to newer format standards, or for archives containing a variety of different block group sizes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tar improvements: Prepare smaller tarballs and obviate the need for `--ignore-zeros`: set tar `--blocking-factor=1`; trim EOF by piping tarred chunks to `head --bytes -1024`; set `--sparse` #8

Context and problem

Solution

A quick and dirty comparison script, and its output:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tarball_name	size_before_gzip	size_after_gzip	extracts_with_default_params	extracts_with_ignore_zeros
combined_at_start	10240	217	yes	yes
combined_naively	20480	238	no	yes
combined_using_tar_native_concat	20480	246	yes	yes
combined_with_cat_after_proper_prep	5120	223	yes	yes

tar improvements: Prepare smaller tarballs and obviate the need for --ignore-zeros: set tar --blocking-factor=1; trim EOF by piping tarred chunks to head --bytes -1024; set --sparse #8

Description

Context and problem

Solution

A quick and dirty comparison script, and its output:

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

tar improvements: Prepare smaller tarballs and obviate the need for `--ignore-zeros`: set tar `--blocking-factor=1`; trim EOF by piping tarred chunks to `head --bytes -1024`; set `--sparse` #8