-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Context and problem
Naïve concatenation of incremental tarballs made by GNU tar with default settings results in a combined tarball that is larger than it could be. This size bloat occurs when separate tarballs are joined via gcloud storage objects compose or cat if zero padding is not first removed between tarballs, and if the tar block grouping is left at its default size of 20 blocks. Such tarballs can be successfully extracted using GNU tar if --ignore-zeros is specified, but the large size still problematic, especially when many small tarballs are joined into a single archive.
The tar format standard has evolved over time, but the most commonly used flavor remains an older and very simple one1: it is comprised of a linear series of files concatenated together, where each file is stored as a 512 byte header describing the file, followed by a series of zero or more 512-byte blocks containing the file data itself, padded with binary zeros to ensure the total file size fills out a discrete number of 512-byte blocks. Additional files are added similarly until the tarball is terminated with two 512-byte blocks of zeros signifying the end of the archive. After that the tar archive can be compressed, moved around, etc.
It's a format that is almost perfect out of the box for simple concatenation of multiple archives into larger ones. A stream of several individually-compressed tarballs can be decompressed as a "single archive" without issue by gzip and other compressors, and incremental backups made by tar over time can be read linearly from an archive and applied in succession to replay changes and restore data state. That all works fine. Unfortunately, since decompressed tarballs in such a stream lacks a file manifest or position index, the data boundaries between decompressed tar archives in a stream cannot be determined without fully reading the data. When tar attempts to unpack a tarball or stream made by a naïve concatenation of other tarballs, it can successfully extract files from the first archive, but exits when it encounters two successive 512-byte blocks of zeros representing the end of the archive. GNU tar can be forced to continue by setting --ignore-zeros, however this is problematic for a few reasons:
--ignore-zerosis not universally available across tar implementations2,- it's difficult to know a priori if a particular tarball is a composite archive that requires
--ignore-zerosfor full extraction3
a) This can be especially unintuitive to users unfamiliar with tarballs created by naïve concatenation, since extraction without--ignore-zeroswill cause a "silent" failure mode of yielding some data from the first tarball, without continuing to downstream data from subsequent tarballs. The need for--ignore-zerosis also not very discoverable, since most documentation and discussion relating to the parameter is concerned with recovering data from partial or corrupt archives. - Use of
--ignore-zeroscan mask true data loss or corruption of archives
If the EOF signal is two 512-byte blocks of zeros, can we simply remove the last 1kB of a tarball before concatenating it to others? No, generally, since common tar-packing implementations batch 512-byte blocks into groups4 of 20 to improve write performance on storage devices which are, by current expectations, frustratingly linear in operation (i.e. tapes) or relatively slow to seek (i.e. spinning rust hard drives). What happens if tar wants to write a group of 20 blocks to an archive but does not have enough headers or file data to fill all 20? By default, it fills out the group with (more) 512-byte blocks of zeros. That means if the two EOF zero blocks are removed but they're preceded by blocks of zeros that were added so tar could write a full batch of 20 blocks (the latter containing zeros); it effectively appears—in the absence of inspection or knowledge of how tar was grouping blocks when they were written—as if the EOF has moved closer to the start of the archive5.
In addition to zeros causing issues for tarball concatenation and extraction, another problem with having so many blocks of zeros in an archive is that they unnecessarily bloat the size of non-compressed tarballs. This is particularly problematic in scenarios where incremental snapshots capture mostly small files and yield archives containing many zero-padded block groups (in addition to EOF zero blocks in concatenated tarballs).
Solution
Fortunately, we can address all of the issues above by doing the following:
- set tar
--blocking-factor=1: this ensures blocks are written to a tarball individually and tar's write buffer is not filled out with blocks of zeros. - pipe tarred chunks to
head --bytes -1024prior to gzip (or other compression) to trim EOF blocks from concatenated tarballs. The last tarball appended is an exception: it should have the EOF blocks, which can either be retained or added by concatenating<(head -c 1024 /dev/zero) - set tar
--sparse: this consolidates runs of zeros in the input data files, which can reduce file sizes - set tar
--label="${RUN_BASENAME}": the--labelparameter adds a human-readable note to a tarball that is printed during creation or extraction by GNUtar, and we can use it to store some extra info like the sequencing run ID (in case a tarball is renamed) or guidance for extracting the tarball. Since labels are stored in tarballs as the names of placeholder files, they're limited to 99 characters. Tangential to this issue, but it was relevant when I was looking for an in-band way to alert the tarball consumer of the need to--ignore-zeros.
A quick and dirty comparison script, and its output:
#!/bin/bash
# select tar binary to use based on whether platform is Darwin or Linux
if [[ $(uname) == "Darwin" ]]; then
alias gtar=gtar
else
alias gtar=tar
fi
# if input files do not exist, create them
if [[ ! -f ./input1.txt ]] || \
[[ ! -f ./input2.txt ]] || \
[[ ! -f ./input3.txt ]] || \
[[ ! -f ./input4.txt ]]; then
for n in {1..4}; do
echo "${n} $(date +'%Y-%m-%d_%H%M%S')" > input${n}.txt && sleep 2;
done
fi
# ============
# Scenario 1: proper tarball made from multiple input files
# all input data exists at the same time and can be added to a single tarball all at once
gtar --sparse -cvf - ./input{1..4}.txt > combined_at_start.tar
# ============
# Scenario 2: naively-joined tarballs made from several independent tarballs
# this creates two tarballs independently that can be concatenated together,
# but the result will be larger than necessary, and will fail to extract fully
# unless --ignore-zeros is passed to for GNU tar
gtar --sparse -cvf - ./input{1,2}.txt > out1.tar
gtar --sparse -cvf - ./input{3,4}.txt > out2.tar
cat out1.tar out2.tar > combined_naively.tar
# ============
# Scenario 3: proper tarball from multiple input tarballs, joined by tar itself with --concatenate
# this relies on tar fully reading the input tarballs to find and remove
# end-of-file blocks and extra blocks of zeros that were added to allow
# the writing of block *groups* of consistent size. It can also only (?) add to an existing tarball
cp out1.tar combined_using_tar_native_concat.tar
gtar --sparse --concatenate -f combined_using_tar_native_concat.tar out2.tar
# ============
# Scenario 4: properly prepare tarballs for concatenation via `cat` or `gcloud storage objects compose`
# this configures block group size to be 1 to avoid adding extra blocks of zeros after files
# each tarball is then also trimmed of the two 512-byte blocks of zeros that indicate EOF
gtar --sparse --blocking-factor=1 -cvf - ./input{1,2}.txt | ghead --bytes -1024 > out1_slimmed_and_eof_trimmed.tar
gtar --sparse --blocking-factor=1 -cvf - ./input{3,4}.txt | ghead --bytes -1024 > out2_slimmed_and_eof_trimmed.tar
cat out1_slimmed_and_eof_trimmed.tar out2_slimmed_and_eof_trimmed.tar <(head -c 1024 /dev/zero) > combined_with_cat_after_proper_prep.tar
# ============
# gather sizes for all combined tarballs, before and after compression
touch file_sizes_and_tar_upacking_success
printf "tarball_name\tsize_before_gzip\tsize_after_gzip\textracts_with_default_params\textracts_with_ignore_zeros\n" > file_sizes_and_tar_upacking_success
for tarball in combined_at_start.tar combined_naively.tar combined_using_tar_native_concat.tar combined_with_cat_after_proper_prep.tar; do
gzip --to-stdout --force --best -k ${tarball} > ${tarball}.gz
# check if tarball can list all files
extracts_with_plain_gtar="$([[ $(echo $(gtar --list --xform 's/\.\/input([0-9]+)\.txt/\1/gx' --show-transformed-names --file ${tarball}) | sed -e 's/[[:space:]]*//g') == "1234" ]] && echo 'yes' || echo 'no')"
extracts_with_gtar_ignore_zeros="$([[ $(echo $(gtar --ignore-zeros --list --xform 's/\.\/input([0-9]+)\.txt/\1/gx' --show-transformed-names --file ${tarball}) | sed -e 's/[[:space:]]*//g') == "1234" ]] && echo 'yes' || echo 'no')"
printf "$(basename ${tarball} .tar)\t$(stat -f %z ${tarball})\t$(stat -f %z ${tarball}.gz)\t${extracts_with_plain_gtar}\t${extracts_with_gtar_ignore_zeros}\n" >> file_sizes_and_tar_upacking_success
done
# ============
# if 'column' is available, print a table of the results. Otherwise, just cat
if [[ $(command -v column) ]]; then
column -t <(sort -k3,3 -g file_sizes_and_tar_upacking_success)
else
cat file_sizes_and_tar_upacking_success
fi| tarball_name | size_before_gzip | size_after_gzip | extracts_with_default_params | extracts_with_ignore_zeros |
|---|---|---|---|---|
| combined_at_start | 10240 | 217 | yes | yes |
| combined_naively | 20480 | 238 | no | yes |
| combined_using_tar_native_concat | 20480 | 246 | yes | yes |
| combined_with_cat_after_proper_prep | 5120 | 223 | yes | yes |
Footnotes
-
The tar header and format are described in full in the POSIX standard of 1990, IEEE 1003.1-1990, p169-173. Its design and limitations stem from the linear storage of data on archival magnetic tapes without the modern conveniences of file manifests, rapid random seeking, or data location indices. ↩
-
Notably, the bsd variant of
tarthat ships with macOS cannot--ignore-zeros, though GNUtaris available for macOS from Homebrew in thegnu-tarpackage, which adds GNU tar to thePATHas "gtar" ↩ -
Until now the solution implemented by this repo is to add a readme file alongside concatenated tarballs noting the need to tell
tarto--ignore-zeros. ↩ -
GNU
tarcalls these groups of blocks "records," and makes the group size configurable through the cryptically-named parameter, "--blocking-factor" ↩ -
The block group size of an existing archive can sometimes be determined using GNU
tar, and printed along with a list of block start positions for files within an archive; ex.:tar --checkpoint=1 --checkpoint-action=exec='echo block group size: $TAR_BLOCKING_FACTOR' --totals --block-number --verbose --list --file ${my_archive_tar}, though YMMV for tarballs adhering to newer format standards, or for archives containing a variety of different block group sizes ↩