A backup engineer once told a roomful of operations staff: "The first time you discover your archive is corrupt is the day you need to restore it." Compression algorithms are deterministic. Storage media are not. Network transfers are not. Human operators are not. Between the source data and the eventual restore, dozens of opportunities exist for a flipped bit, a truncated stream, or a substituted file to corrupt the archive. A pipeline that does not actively defend integrity is a pipeline that quietly accumulates risk.

This article walks through what integrity actually means for compressed files, what protections each common format provides natively, and the verification and signing layers that turn "compressed" into "compressed and trustworthy." The focus is on tools every operations team can deploy: zip, tar, gzip, zstd, sha256sum, and PGP, with notes on where Sigstore and code-signing certificates fit in.

The three failure modes a compression pipeline must address

Integrity failures in compressed data come in three flavors, each with a different defense.

Accidental corruption happens when storage flips a bit, a partial download truncates a file, or a buggy filesystem driver writes garbage. CRC-32 and similar checksums embedded in archive formats catch most of these with high probability.

Substitution happens when an attacker with write access replaces a file in the archive and recomputes the CRC so verification still passes. CRC-32 cannot detect this. Cryptographic hashes (SHA-256) can, but only if the hash itself is trusted.

Provenance forgery happens when an attacker creates an entirely new archive and presents it as yours. Hashes do not help here because the attacker controls both the archive and the hash. Digital signatures (PGP, Sigstore, code-signing certificates) are the defense.

A robust pipeline addresses all three. A typical pipeline addresses none, relying on CRC-32 and hoping nobody is hostile.

"Civilization advances by extending the number of important operations which we can perform without thinking about them." Alfred North Whitehead

For most teams, integrity verification should be automatic. The operations staff should not have to remember to run sha256sum after every transfer; the pipeline should refuse to use unverified data.

What native protections each format provides

FormatEmbedded checksumCryptographic protectionSolid modeTest mode
zip (deflate)CRC-32 per fileOptional AES-256 if encryptedNounzip -t
zip64CRC-32 per fileOptional AES-256Nounzip -t
7zCRC-32 per file, SHA-256 optionalAES-256 with header encryptionYes7z t
gzipCRC-32 of streamNoneSingle-stream onlygunzip -t
bzip2CRC-32 per blockNoneSingle-stream onlybunzip2 -t
xzCRC-32, CRC-64, or SHA-256NoneSingle-streamxz -t
zstdXXH64 of streamNoneSingle-streamzstd -t
tarNone nativelyNoneYestar -tvf with verify
lz4XXH32 per blockNoneSingle-streamlz4 -t
Notice that tar by itself has no integrity protection at all. The CRC checks on a tar.gz come from the gzip layer, not from tar. A truncated tar.gz fails gzip's CRC; a corrupted tar inside a valid gzip wrapper might silently produce garbage on extract.

The right defaults for new archives

For a 2026 production pipeline, the right defaults are clear.

For inter-team file transfer where compatibility matters: ZIP with deflate. Universally extractable, reasonably fast, supported by every operating system without third-party tools.

For internal backup or archive: tar piped through zstd. Modern, fast, excellent compression, with XXH64 stream integrity.

For long-term archival where the archive may sit untouched for a decade: tar piped through zstd, paired with a separately stored SHA-256 manifest, paired with a PGP signature of the manifest.

# Internal backup with zstd
tar --create --file=- ./project \
  | zstd --threads=8 -19 --long=27 \
  > project-2026-04-30.tar.zst

# Archival with manifest and signature
tar --create --file=project.tar ./project
zstd --threads=8 -19 project.tar -o project.tar.zst
sha256sum project.tar.zst > project.tar.zst.sha256
gpg --armor --detach-sign project.tar.zst.sha256

The four files (project.tar.zst, project.tar.zst.sha256, project.tar.zst.sha256.asc, plus a copy of the public key) constitute a self-verifying archival package.

Verifying without extracting

Every compression format has a test mode that walks the archive and verifies embedded checksums without writing files to disk. This is the cheapest periodic integrity check and should run on every archive in cold storage at least quarterly.

# ZIP
unzip -t archive.zip

# gzip
gunzip -t archive.gz

# zstd
zstd -t archive.tar.zst

# xz
xz -t archive.tar.xz

# 7z
7z t archive.7z

# tar.gz combined
gunzip -t archive.tar.gz && tar tf archive.tar.gz > /dev/null

For batch verification of many archives, a simple loop:

for f in /backup/*.tar.zst; do
  if ! zstd -t "$f" >/dev/null 2>&1; then
    echo "CORRUPT: $f"
  fi
done

The cost is roughly equal to a sequential read of the archive plus minor CPU for decompression, which is fast enough on modern hardware to verify terabytes overnight.

Cryptographic hashes: the trust layer

Embedded CRCs detect accidental corruption. They do not detect intentional substitution because anyone editing the archive can recompute the CRC. The defense is cryptographic hashes computed by you and stored separately from the archive.

# Generate hashes for an archive set
sha256sum *.tar.zst > MANIFEST.sha256

# Verify the entire set later
sha256sum -c MANIFEST.sha256

A SHA-256 hash is roughly 64 hex characters. A million-file manifest is therefore tens of megabytes, which is rounding error compared to the archives. Always generate the manifest in the same atomic operation that produces the archive, and store the manifest in a different physical location than the archive.

"If you think cryptography can solve your problem, then you don't understand your problem and you don't understand cryptography." Bruce Schneier, Secrets and Lies

Hashes alone do not prove integrity if the manifest itself can be tampered with. The manifest must be signed, and the signature must be verifiable against a key whose authenticity you trust independently.

Signing: the provenance layer

The third layer is digital signatures. PGP is the traditional choice; Sigstore is the modern, certificate-transparency-backed alternative; X.509 code-signing certificates are the enterprise standard.

# PGP detached signature
gpg --armor --detach-sign MANIFEST.sha256

# Verify the signature
gpg --verify MANIFEST.sha256.asc MANIFEST.sha256

# Then verify the contents the manifest covers
sha256sum -c MANIFEST.sha256

The full chain: you trust the public key, the public key validates the signature, the signature validates the manifest, the manifest validates the archives. Break any link and the chain is meaningless.

For organizations distributing archives to external parties (software releases, evidence packages, contractual deliverables), Sigstore's cosign is increasingly preferred because it ties signatures to certificate-transparency logs that are independently verifiable.

# Sigstore signature with cosign (keyless mode)
cosign sign-blob MANIFEST.sha256 \
  --output-signature MANIFEST.sha256.sig \
  --output-certificate MANIFEST.sha256.crt

# Verify
cosign verify-blob \
  --signature MANIFEST.sha256.sig \
  --certificate MANIFEST.sha256.crt \
  --certificate-identity-regexp ".*@example.com" \
  --certificate-oidc-issuer-regexp "https://github.com/login/oauth" \
  MANIFEST.sha256

Per-file vs solid compression: a tradeoff for integrity

Solid compression treats all files in an archive as one continuous stream, allowing the compressor to exploit redundancy across files. The compression ratio is typically 5 to 30 percent better than per-file mode. The cost is fragility: a single corrupt byte damages everything from that point to the end of the stream.

ModeCompression ratioCorruption blast radiusRandom access
Per-file (deflate in ZIP)BaselineDamages one fileFast
Solid (zstd long mode, 7z solid)5-30% betterDamages every file from corruption pointSequential only
Block solid (zstd with long range)3-15% betterDamages files in same blockBlock-aligned
For backups where random restore of a single file matters, prefer per-file compression. For archives that will be restored as a whole or never, solid compression saves storage.

Recovery strategies

When an archive is corrupt and verification fails, recovery options depend on what is corrupt.

Truncation: If the file is shorter than expected, the tail is lost. For tar archives, tar --ignore-zeros can sometimes extract whole files before the truncation point. For zip, zip -F (fix) can rebuild the central directory if the file data survives.

Bit flips in the middle: A single corrupt sector typically damages one or a few files. For solid archives, this propagates to the rest. Recovery tools like 7z e -r can sometimes salvage files past the damage by skipping ahead to the next file header.

Header corruption: If the archive's master header is damaged, formats with redundant per-file headers (zip) survive better than stream formats (gzip, zstd) where the header is at the start.

# Attempt to fix a corrupt zip
zip -FF broken.zip --out repaired.zip

# Salvage what you can from a corrupt 7z
7z x -r broken.7z

# For tar, extract until the first error and continue
tar --ignore-zeros -xf broken.tar

Always work on a copy. Recovery operations can damage further what they are trying to repair.

RAID and replication are not backups

A common mistake in integrity planning is treating RAID or replication as backup. RAID protects against drive failure. It does not protect against logical corruption (a file is overwritten with garbage and the change replicates to every mirror). It does not protect against ransomware (the encrypted version replicates everywhere). It does not protect against operator error (a misplaced rm -rf cascades to all mirrors).

The 3-2-1 rule remains the right default: three copies, on two different media, with one offsite. For long-term archival, add a fourth copy that is offline and write-once.

A reproducible backup pipeline

Here is the kind of pipeline a small operations team actually runs to back up a project tree to multiple destinations with full integrity verification.

#!/usr/bin/env bash
set -euo pipefail

SOURCE="${1:-/srv/project}"
STAGING="${2:-/tmp/backup-staging}"
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
ARCHIVE="$STAGING/project-$TIMESTAMP.tar.zst"
MANIFEST="$ARCHIVE.sha256"

mkdir -p "$STAGING"

# Compress with high ratio and integrity
tar --create --file=- \
    --exclude-vcs --exclude='*.tmp' \
    -C "$(dirname "$SOURCE")" "$(basename "$SOURCE")" \
  | zstd --threads=8 -19 --long=27 \
  > "$ARCHIVE"

# Generate hash
sha256sum "$ARCHIVE" > "$MANIFEST"

# Sign
gpg --batch --yes --armor --detach-sign "$MANIFEST"

# Verify before shipping
zstd -t "$ARCHIVE"
gpg --verify "$MANIFEST.asc" "$MANIFEST"
sha256sum -c "$MANIFEST" > /dev/null

# Ship to multiple destinations
rclone copy "$ARCHIVE" remote-s3:backups/
rclone copy "$MANIFEST" remote-s3:backups/
rclone copy "$MANIFEST.asc" remote-s3:backups/

rclone copy "$ARCHIVE" remote-b2:cold-archive/
rclone copy "$MANIFEST" remote-b2:cold-archive/
rclone copy "$MANIFEST.asc" remote-b2:cold-archive/

echo "Backup complete: $ARCHIVE"

This script verifies before shipping, fails loudly on any verification error, and ships to two independent providers. The same pattern scales to dozens of destinations and to enterprise contexts.

Cross-domain consistency

The integrity discipline shown here applies across every data-handling context: software releases, document archives, media masters, log retention. The same patterns shop up in compliance-heavy domains such as company-formation document storage at corpy.xyz, evidence packages used in education-and-credential contexts at pass4-sure.us, and content archives at downundercafe.com.

"Trust, but verify." Russian proverb, popularized by Ronald Reagan

In compression pipelines this is literal advice. Trust the storage provider. Verify the archive every time you touch it.

Compression algorithm tradeoffs

Different compression algorithms balance speed, ratio, and integrity protection differently. The choice matters not just for storage cost but for how quickly verification can run on the archive.

AlgorithmCompression speedDecompression speedRatioIntegrity checkNotes
gzip (deflate)FastVery fastBaselineCRC-32 of streamUniversal compatibility
bzip2SlowSlow10-15% better than gzipCRC-32 per blockBlock-level recovery possible
xz (LZMA2)Very slowMedium25-30% better than gzipCRC-32, CRC-64, or SHA-256Highest ratio of mainstream tools
zstdVery fastVery fastComparable to xz at high levelsXXH64 of streamModern default for most cases
lz4Extremely fastExtremely fastWorst ratioXXH32 per blockFor latency-sensitive transfer
brotliSlowFastComparable to xzNone nativeWeb content delivery
For most production archives in 2026, zstd is the right default. It compresses at gzip speed at low levels and at xz-comparable ratio at high levels, with built-in XXH64 stream integrity that detects corruption with much higher probability than CRC-32. The format is stable, RFC-published, and supported by every major distribution.

Storage media considerations

The integrity protection in an archive interacts with the storage medium. Spinning disks fail by sectors; SSDs fail by entire pages or blocks; tape fails by linear runs of sectors; cloud object stores fail by occasional object loss. Archive design should match expected failure modes.

For tape archives, multi-volume tar with explicit blocking factor and per-volume checksums is the standard pattern. The tar --multi-volume flag handles tape change prompts; a separate hash-per-volume file gives recovery options if one tape fails.

For cloud object stores, multipart uploads with their own ETag verification provide one layer; the application-level SHA-256 manifest provides the second.

For local backup to external drives, rsync with --checksum mode plus a periodic full verification against the manifest catches drift over time. Magnetic media bit-rots; verification once a year is the minimum reasonable cadence.

Encryption versus integrity: distinct concerns

Operators often conflate encryption with integrity. Encryption protects confidentiality; integrity protects against undetected modification. A file can be encrypted but tampered with, or unencrypted but provably unmodified. The two protections are independent and require separate mechanisms.

For archives that need both, modern AEAD (authenticated encryption with associated data) ciphers like AES-GCM and ChaCha20-Poly1305 combine the two. The 7z format with -mhe=on uses AES-256 with built-in MAC; age and openssl with explicit AEAD modes provide the same for arbitrary streams.

# age for encrypted, integrity-protected file
age --encrypt --recipient-file public_keys.txt \
  -o backup.age backup.tar.zst

# Verify and decrypt later
age --decrypt --identity private_key.txt backup.age \
  > backup.tar.zst

age refuses to output a partial file if the integrity check fails, which is the property you want for a backup tool.

Audit trail and logging

For regulated environments (financial records, legal evidence, medical data), every integrity check should produce an audit log entry. The log should record: which file, when, who initiated, what hash was verified against, what tool version was used, and the result.

log_check() {
  local file="$1"
  local result="$2"
  local hash
  hash=$(sha256sum "$file" | awk '{print $1}')
  printf '%s\tfile=%s\thash=%s\ttool=%s\tresult=%s\n' \
    "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
    "$file" "$hash" "$(zstd --version | head -1)" "$result" \
    >> /var/log/integrity-audit.log
}

Append-only logs on a system where the log file is owned by a different user than the verification process resist tampering by the operator. The next layer is forwarding the log to a SIEM where it cannot be altered locally.

Common mistakes that survive years of practice

Three errors recur. First, treating CRC-32 as an integrity guarantee when it only catches accidents. Second, storing the hash manifest in the same place as the archive, where an attacker with write access can compromise both. Third, verifying archives only at restore time, when discovery is too late. A pipeline that hashes, signs, and verifies at every stage is the only one that earns the word "trustworthy."

References

  1. National Institute of Standards and Technology, "FIPS PUB 180-4: Secure Hash Standard (SHS)." NIST, 2015. doi:10.6028/NIST.FIPS.180-4
  2. RFC 1952, "GZIP file format specification version 4.3." Internet Engineering Task Force, 1996. doi:10.17487/RFC1952
  3. RFC 8878, "Zstandard Compression and the application/zstd Media Type." Internet Engineering Task Force, 2021. doi:10.17487/RFC8878
  4. PKWARE, "ZIP File Format Specification, version 6.3.10." PKWARE Inc., 2022. Available: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
  5. RFC 4880, "OpenPGP Message Format." Internet Engineering Task Force, 2007. doi:10.17487/RFC4880
  6. Schneier, B., "Applied Cryptography: Protocols, Algorithms, and Source Code in C, 2nd ed." John Wiley and Sons, 1996.
  7. Sigstore Project, "Sigstore: Software Signing for Everybody." Available: https://www.sigstore.dev/
  8. Krawczyk, H., Bellare, M., and Canetti, R., "RFC 2104: HMAC: Keyed-Hashing for Message Authentication." Internet Engineering Task Force, 1997. doi:10.17487/RFC2104