Skip to content

tar improvements: Prepare smaller tarballs and obviate the need for --ignore-zeros: set tar --blocking-factor=1; trim EOF by piping tarred chunks to head --bytes -1024; set --sparse #8

@tomkinsc

Description

@tomkinsc

Context and problem

Naïve concatenation of incremental tarballs made by GNU tar with default settings results in a combined tarball that is larger than it could be. This size bloat occurs when separate tarballs are joined via gcloud storage objects compose or cat if zero padding is not first removed between tarballs, and if the tar block grouping is left at its default size of 20 blocks. Such tarballs can be successfully extracted using GNU tar if --ignore-zeros is specified, but the large size still problematic, especially when many small tarballs are joined into a single archive.

The tar format standard has evolved over time, but the most commonly used flavor remains an older and very simple one1: it is comprised of a linear series of files concatenated together, where each file is stored as a 512 byte header describing the file, followed by a series of zero or more 512-byte blocks containing the file data itself, padded with binary zeros to ensure the total file size fills out a discrete number of 512-byte blocks. Additional files are added similarly until the tarball is terminated with two 512-byte blocks of zeros signifying the end of the archive. After that the tar archive can be compressed, moved around, etc.

It's a format that is almost perfect out of the box for simple concatenation of multiple archives into larger ones. A stream of several individually-compressed tarballs can be decompressed as a "single archive" without issue by gzip and other compressors, and incremental backups made by tar over time can be read linearly from an archive and applied in succession to replay changes and restore data state. That all works fine. Unfortunately, since decompressed tarballs in such a stream lacks a file manifest or position index, the data boundaries between decompressed tar archives in a stream cannot be determined without fully reading the data. When tar attempts to unpack a tarball or stream made by a naïve concatenation of other tarballs, it can successfully extract files from the first archive, but exits when it encounters two successive 512-byte blocks of zeros representing the end of the archive. GNU tar can be forced to continue by setting --ignore-zeros, however this is problematic for a few reasons:

  1. --ignore-zeros is not universally available across tar implementations2,
  2. it's difficult to know a priori if a particular tarball is a composite archive that requires --ignore-zeros for full extraction3
    a) This can be especially unintuitive to users unfamiliar with tarballs created by naïve concatenation, since extraction without --ignore-zeros will cause a "silent" failure mode of yielding some data from the first tarball, without continuing to downstream data from subsequent tarballs. The need for --ignore-zeros is also not very discoverable, since most documentation and discussion relating to the parameter is concerned with recovering data from partial or corrupt archives.
  3. Use of --ignore-zeros can mask true data loss or corruption of archives

If the EOF signal is two 512-byte blocks of zeros, can we simply remove the last 1kB of a tarball before concatenating it to others? No, generally, since common tar-packing implementations batch 512-byte blocks into groups4 of 20 to improve write performance on storage devices which are, by current expectations, frustratingly linear in operation (i.e. tapes) or relatively slow to seek (i.e. spinning rust hard drives). What happens if tar wants to write a group of 20 blocks to an archive but does not have enough headers or file data to fill all 20? By default, it fills out the group with (more) 512-byte blocks of zeros. That means if the two EOF zero blocks are removed but they're preceded by blocks of zeros that were added so tar could write a full batch of 20 blocks (the latter containing zeros); it effectively appears—in the absence of inspection or knowledge of how tar was grouping blocks when they were written—as if the EOF has moved closer to the start of the archive5.

In addition to zeros causing issues for tarball concatenation and extraction, another problem with having so many blocks of zeros in an archive is that they unnecessarily bloat the size of non-compressed tarballs. This is particularly problematic in scenarios where incremental snapshots capture mostly small files and yield archives containing many zero-padded block groups (in addition to EOF zero blocks in concatenated tarballs).

Solution

Fortunately, we can address all of the issues above by doing the following:

  • set tar --blocking-factor=1: this ensures blocks are written to a tarball individually and tar's write buffer is not filled out with blocks of zeros.
  • pipe tarred chunks to head --bytes -1024 prior to gzip (or other compression) to trim EOF blocks from concatenated tarballs. The last tarball appended is an exception: it should have the EOF blocks, which can either be retained or added by concatenating <(head -c 1024 /dev/zero)
  • set tar --sparse: this consolidates runs of zeros in the input data files, which can reduce file sizes
  • set tar --label="${RUN_BASENAME}": the --label parameter adds a human-readable note to a tarball that is printed during creation or extraction by GNU tar, and we can use it to store some extra info like the sequencing run ID (in case a tarball is renamed) or guidance for extracting the tarball. Since labels are stored in tarballs as the names of placeholder files, they're limited to 99 characters. Tangential to this issue, but it was relevant when I was looking for an in-band way to alert the tarball consumer of the need to --ignore-zeros.

A quick and dirty comparison script, and its output:

#!/bin/bash

# select tar binary to use based on whether platform is Darwin or Linux
if [[ $(uname) == "Darwin" ]]; then
    alias gtar=gtar
else
    alias gtar=tar
fi

# if input files do not exist, create them
if [[ ! -f ./input1.txt ]] || \
   [[ ! -f ./input2.txt ]] || \
   [[ ! -f ./input3.txt ]] || \
   [[ ! -f ./input4.txt ]]; then
    for n in {1..4}; do 
        echo "${n} $(date +'%Y-%m-%d_%H%M%S')" > input${n}.txt && sleep 2;
    done
fi

# ============

# Scenario 1: proper tarball made from multiple input files
#   all input data exists at the same time and can be added to a single tarball all at once
gtar --sparse -cvf - ./input{1..4}.txt > combined_at_start.tar

# ============

# Scenario 2: naively-joined tarballs made from several independent tarballs
#   this creates two tarballs independently that can be concatenated together,
#   but the result will be larger than necessary, and will fail to extract fully 
#   unless --ignore-zeros is passed to for GNU tar
gtar --sparse -cvf - ./input{1,2}.txt > out1.tar
gtar --sparse -cvf - ./input{3,4}.txt > out2.tar
cat out1.tar out2.tar > combined_naively.tar

# ============

# Scenario 3: proper tarball from multiple input tarballs, joined by tar itself with --concatenate
#   this relies on tar fully reading the input tarballs to find and remove
#   end-of-file blocks and extra blocks of zeros that were added to allow 
#   the writing of block *groups* of consistent size. It can also only (?) add to an existing tarball
cp out1.tar combined_using_tar_native_concat.tar
gtar --sparse --concatenate -f combined_using_tar_native_concat.tar out2.tar

# ============

# Scenario 4: properly prepare tarballs for concatenation via `cat` or `gcloud storage objects compose`
#   this configures block group size to be 1 to avoid adding extra blocks of zeros after files
#   each tarball is then also trimmed of the two 512-byte blocks of zeros that indicate EOF
gtar --sparse --blocking-factor=1 -cvf - ./input{1,2}.txt | ghead --bytes -1024 > out1_slimmed_and_eof_trimmed.tar
gtar --sparse --blocking-factor=1 -cvf - ./input{3,4}.txt | ghead --bytes -1024 > out2_slimmed_and_eof_trimmed.tar
cat out1_slimmed_and_eof_trimmed.tar out2_slimmed_and_eof_trimmed.tar <(head -c 1024 /dev/zero) > combined_with_cat_after_proper_prep.tar

# ============

# gather sizes for all combined tarballs, before and after compression
touch file_sizes_and_tar_upacking_success
printf "tarball_name\tsize_before_gzip\tsize_after_gzip\textracts_with_default_params\textracts_with_ignore_zeros\n" > file_sizes_and_tar_upacking_success
for tarball in combined_at_start.tar combined_naively.tar combined_using_tar_native_concat.tar combined_with_cat_after_proper_prep.tar; do
    gzip --to-stdout --force --best -k ${tarball} > ${tarball}.gz

    # check if tarball can list all files
    extracts_with_plain_gtar="$([[ $(echo $(gtar --list --xform 's/\.\/input([0-9]+)\.txt/\1/gx' --show-transformed-names --file ${tarball}) | sed -e 's/[[:space:]]*//g') == "1234" ]] && echo 'yes' || echo 'no')"
    extracts_with_gtar_ignore_zeros="$([[ $(echo $(gtar --ignore-zeros --list --xform 's/\.\/input([0-9]+)\.txt/\1/gx' --show-transformed-names --file ${tarball}) | sed -e 's/[[:space:]]*//g') == "1234" ]] && echo 'yes' || echo 'no')"

    printf "$(basename ${tarball} .tar)\t$(stat -f %z ${tarball})\t$(stat -f %z ${tarball}.gz)\t${extracts_with_plain_gtar}\t${extracts_with_gtar_ignore_zeros}\n" >> file_sizes_and_tar_upacking_success
done

# ============

# if 'column' is available, print a table of the results. Otherwise, just cat
if [[ $(command -v column) ]]; then
    column -t <(sort -k3,3 -g file_sizes_and_tar_upacking_success)
else
    cat file_sizes_and_tar_upacking_success
fi
tarball_name size_before_gzip size_after_gzip extracts_with_default_params extracts_with_ignore_zeros
combined_at_start 10240 217 yes yes
combined_naively 20480 238 no yes
combined_using_tar_native_concat 20480 246 yes yes
combined_with_cat_after_proper_prep 5120 223 yes yes

Footnotes

  1. The tar header and format are described in full in the POSIX standard of 1990, IEEE 1003.1-1990, p169-173. Its design and limitations stem from the linear storage of data on archival magnetic tapes without the modern conveniences of file manifests, rapid random seeking, or data location indices.

  2. Notably, the bsd variant of tar that ships with macOS cannot --ignore-zeros, though GNU tar is available for macOS from Homebrew in the gnu-tar package, which adds GNU tar to the PATH as "gtar"

  3. Until now the solution implemented by this repo is to add a readme file alongside concatenated tarballs noting the need to tell tar to --ignore-zeros.

  4. GNU tar calls these groups of blocks "records," and makes the group size configurable through the cryptically-named parameter, "--blocking-factor"

  5. The block group size of an existing archive can sometimes be determined using GNU tar, and printed along with a list of block start positions for files within an archive; ex.: tar --checkpoint=1 --checkpoint-action=exec='echo block group size: $TAR_BLOCKING_FACTOR' --totals --block-number --verbose --list --file ${my_archive_tar}, though YMMV for tarballs adhering to newer format standards, or for archives containing a variety of different block group sizes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions