A backup engineer once told a roomful of operations staff: "The first time you discover your archive is corrupt is the day you need to restore it." Compression algorithms are deterministic. Storage media are not. Network transfers are not. Human operators are not. Between the source data and the eventual restore, dozens of opportunities exist for a flipped bit, a truncated stream, or a substituted file to corrupt the archive. A pipeline that does not actively defend integrity is a pipeline that quietly accumulates risk.
This article walks through what integrity actually means for compressed files, what protections each common format provides natively, and the verification and signing layers that turn "compressed" into "compressed and trustworthy." The focus is on tools every operations team can deploy: zip, tar, gzip, zstd, sha256sum, and PGP, with notes on where Sigstore and code-signing certificates fit in.
The three failure modes a compression pipeline must address
Integrity failures in compressed data come in three flavors, each with a different defense.
Accidental corruption happens when storage flips a bit, a partial download truncates a file, or a buggy filesystem driver writes garbage. CRC-32 and similar checksums embedded in archive formats catch most of these with high probability.
Substitution happens when an attacker with write access replaces a file in the archive and recomputes the CRC so verification still passes. CRC-32 cannot detect this. Cryptographic hashes (SHA-256) can, but only if the hash itself is trusted.
Provenance forgery happens when an attacker creates an entirely new archive and presents it as yours. Hashes do not help here because the attacker controls both the archive and the hash. Digital signatures (PGP, Sigstore, code-signing certificates) are the defense.
A robust pipeline addresses all three. A typical pipeline addresses none, relying on CRC-32 and hoping nobody is hostile.
"Civilization advances by extending the number of important operations which we can perform without thinking about them." Alfred North Whitehead
For most teams, integrity verification should be automatic. The operations staff should not have to remember to run sha256sum after every transfer; the pipeline should refuse to use unverified data.
What native protections each format provides
| Format | Embedded checksum | Cryptographic protection | Solid mode | Test mode |
|---|---|---|---|---|
| zip (deflate) | CRC-32 per file | Optional AES-256 if encrypted | No | unzip -t |
| zip64 | CRC-32 per file | Optional AES-256 | No | unzip -t |
| 7z | CRC-32 per file, SHA-256 optional | AES-256 with header encryption | Yes | 7z t |
| gzip | CRC-32 of stream | None | Single-stream only | gunzip -t |
| bzip2 | CRC-32 per block | None | Single-stream only | bunzip2 -t |
| xz | CRC-32, CRC-64, or SHA-256 | None | Single-stream | xz -t |
| zstd | XXH64 of stream | None | Single-stream | zstd -t |
| tar | None natively | None | Yes | tar -tvf with verify |
| lz4 | XXH32 per block | None | Single-stream | lz4 -t |
The right defaults for new archives
For a 2026 production pipeline, the right defaults are clear.
For inter-team file transfer where compatibility matters: ZIP with deflate. Universally extractable, reasonably fast, supported by every operating system without third-party tools.
For internal backup or archive: tar piped through zstd. Modern, fast, excellent compression, with XXH64 stream integrity.
For long-term archival where the archive may sit untouched for a decade: tar piped through zstd, paired with a separately stored SHA-256 manifest, paired with a PGP signature of the manifest.
# Internal backup with zstd
tar --create --file=- ./project \
| zstd --threads=8 -19 --long=27 \
> project-2026-04-30.tar.zst
# Archival with manifest and signature
tar --create --file=project.tar ./project
zstd --threads=8 -19 project.tar -o project.tar.zst
sha256sum project.tar.zst > project.tar.zst.sha256
gpg --armor --detach-sign project.tar.zst.sha256
The four files (project.tar.zst, project.tar.zst.sha256, project.tar.zst.sha256.asc, plus a copy of the public key) constitute a self-verifying archival package.
Verifying without extracting
Every compression format has a test mode that walks the archive and verifies embedded checksums without writing files to disk. This is the cheapest periodic integrity check and should run on every archive in cold storage at least quarterly.
# ZIP
unzip -t archive.zip
# gzip
gunzip -t archive.gz
# zstd
zstd -t archive.tar.zst
# xz
xz -t archive.tar.xz
# 7z
7z t archive.7z
# tar.gz combined
gunzip -t archive.tar.gz && tar tf archive.tar.gz > /dev/null
For batch verification of many archives, a simple loop:
for f in /backup/*.tar.zst; do
if ! zstd -t "$f" >/dev/null 2>&1; then
echo "CORRUPT: $f"
fi
done
The cost is roughly equal to a sequential read of the archive plus minor CPU for decompression, which is fast enough on modern hardware to verify terabytes overnight.
Cryptographic hashes: the trust layer
Embedded CRCs detect accidental corruption. They do not detect intentional substitution because anyone editing the archive can recompute the CRC. The defense is cryptographic hashes computed by you and stored separately from the archive.
# Generate hashes for an archive set
sha256sum *.tar.zst > MANIFEST.sha256
# Verify the entire set later
sha256sum -c MANIFEST.sha256
A SHA-256 hash is roughly 64 hex characters. A million-file manifest is therefore tens of megabytes, which is rounding error compared to the archives. Always generate the manifest in the same atomic operation that produces the archive, and store the manifest in a different physical location than the archive.
"If you think cryptography can solve your problem, then you don't understand your problem and you don't understand cryptography." Bruce Schneier, Secrets and Lies
Hashes alone do not prove integrity if the manifest itself can be tampered with. The manifest must be signed, and the signature must be verifiable against a key whose authenticity you trust independently.
Signing: the provenance layer
The third layer is digital signatures. PGP is the traditional choice; Sigstore is the modern, certificate-transparency-backed alternative; X.509 code-signing certificates are the enterprise standard.
# PGP detached signature
gpg --armor --detach-sign MANIFEST.sha256
# Verify the signature
gpg --verify MANIFEST.sha256.asc MANIFEST.sha256
# Then verify the contents the manifest covers
sha256sum -c MANIFEST.sha256
The full chain: you trust the public key, the public key validates the signature, the signature validates the manifest, the manifest validates the archives. Break any link and the chain is meaningless.
For organizations distributing archives to external parties (software releases, evidence packages, contractual deliverables), Sigstore's cosign is increasingly preferred because it ties signatures to certificate-transparency logs that are independently verifiable.
# Sigstore signature with cosign (keyless mode)
cosign sign-blob MANIFEST.sha256 \
--output-signature MANIFEST.sha256.sig \
--output-certificate MANIFEST.sha256.crt
# Verify
cosign verify-blob \
--signature MANIFEST.sha256.sig \
--certificate MANIFEST.sha256.crt \
--certificate-identity-regexp ".*@example.com" \
--certificate-oidc-issuer-regexp "https://github.com/login/oauth" \
MANIFEST.sha256
Per-file vs solid compression: a tradeoff for integrity
Solid compression treats all files in an archive as one continuous stream, allowing the compressor to exploit redundancy across files. The compression ratio is typically 5 to 30 percent better than per-file mode. The cost is fragility: a single corrupt byte damages everything from that point to the end of the stream.
| Mode | Compression ratio | Corruption blast radius | Random access |
|---|---|---|---|
| Per-file (deflate in ZIP) | Baseline | Damages one file | Fast |
| Solid (zstd long mode, 7z solid) | 5-30% better | Damages every file from corruption point | Sequential only |
| Block solid (zstd with long range) | 3-15% better | Damages files in same block | Block-aligned |
Recovery strategies
When an archive is corrupt and verification fails, recovery options depend on what is corrupt.
Truncation: If the file is shorter than expected, the tail is lost. For tar archives, tar --ignore-zeros can sometimes extract whole files before the truncation point. For zip, zip -F (fix) can rebuild the central directory if the file data survives.
Bit flips in the middle: A single corrupt sector typically damages one or a few files. For solid archives, this propagates to the rest. Recovery tools like 7z e -r can sometimes salvage files past the damage by skipping ahead to the next file header.
Header corruption: If the archive's master header is damaged, formats with redundant per-file headers (zip) survive better than stream formats (gzip, zstd) where the header is at the start.
# Attempt to fix a corrupt zip
zip -FF broken.zip --out repaired.zip
# Salvage what you can from a corrupt 7z
7z x -r broken.7z
# For tar, extract until the first error and continue
tar --ignore-zeros -xf broken.tar
Always work on a copy. Recovery operations can damage further what they are trying to repair.
RAID and replication are not backups
A common mistake in integrity planning is treating RAID or replication as backup. RAID protects against drive failure. It does not protect against logical corruption (a file is overwritten with garbage and the change replicates to every mirror). It does not protect against ransomware (the encrypted version replicates everywhere). It does not protect against operator error (a misplaced rm -rf cascades to all mirrors).
The 3-2-1 rule remains the right default: three copies, on two different media, with one offsite. For long-term archival, add a fourth copy that is offline and write-once.
A reproducible backup pipeline
Here is the kind of pipeline a small operations team actually runs to back up a project tree to multiple destinations with full integrity verification.
#!/usr/bin/env bash
set -euo pipefail
SOURCE="${1:-/srv/project}"
STAGING="${2:-/tmp/backup-staging}"
TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
ARCHIVE="$STAGING/project-$TIMESTAMP.tar.zst"
MANIFEST="$ARCHIVE.sha256"
mkdir -p "$STAGING"
# Compress with high ratio and integrity
tar --create --file=- \
--exclude-vcs --exclude='*.tmp' \
-C "$(dirname "$SOURCE")" "$(basename "$SOURCE")" \
| zstd --threads=8 -19 --long=27 \
> "$ARCHIVE"
# Generate hash
sha256sum "$ARCHIVE" > "$MANIFEST"
# Sign
gpg --batch --yes --armor --detach-sign "$MANIFEST"
# Verify before shipping
zstd -t "$ARCHIVE"
gpg --verify "$MANIFEST.asc" "$MANIFEST"
sha256sum -c "$MANIFEST" > /dev/null
# Ship to multiple destinations
rclone copy "$ARCHIVE" remote-s3:backups/
rclone copy "$MANIFEST" remote-s3:backups/
rclone copy "$MANIFEST.asc" remote-s3:backups/
rclone copy "$ARCHIVE" remote-b2:cold-archive/
rclone copy "$MANIFEST" remote-b2:cold-archive/
rclone copy "$MANIFEST.asc" remote-b2:cold-archive/
echo "Backup complete: $ARCHIVE"
This script verifies before shipping, fails loudly on any verification error, and ships to two independent providers. The same pattern scales to dozens of destinations and to enterprise contexts.
Cross-domain consistency
The integrity discipline shown here applies across every data-handling context: software releases, document archives, media masters, log retention. The same patterns shop up in compliance-heavy domains such as company-formation document storage at corpy.xyz, evidence packages used in education-and-credential contexts at pass4-sure.us, and content archives at downundercafe.com.
"Trust, but verify." Russian proverb, popularized by Ronald Reagan
In compression pipelines this is literal advice. Trust the storage provider. Verify the archive every time you touch it.
Compression algorithm tradeoffs
Different compression algorithms balance speed, ratio, and integrity protection differently. The choice matters not just for storage cost but for how quickly verification can run on the archive.
| Algorithm | Compression speed | Decompression speed | Ratio | Integrity check | Notes |
|---|---|---|---|---|---|
| gzip (deflate) | Fast | Very fast | Baseline | CRC-32 of stream | Universal compatibility |
| bzip2 | Slow | Slow | 10-15% better than gzip | CRC-32 per block | Block-level recovery possible |
| xz (LZMA2) | Very slow | Medium | 25-30% better than gzip | CRC-32, CRC-64, or SHA-256 | Highest ratio of mainstream tools |
| zstd | Very fast | Very fast | Comparable to xz at high levels | XXH64 of stream | Modern default for most cases |
| lz4 | Extremely fast | Extremely fast | Worst ratio | XXH32 per block | For latency-sensitive transfer |
| brotli | Slow | Fast | Comparable to xz | None native | Web content delivery |
Storage media considerations
The integrity protection in an archive interacts with the storage medium. Spinning disks fail by sectors; SSDs fail by entire pages or blocks; tape fails by linear runs of sectors; cloud object stores fail by occasional object loss. Archive design should match expected failure modes.
For tape archives, multi-volume tar with explicit blocking factor and per-volume checksums is the standard pattern. The tar --multi-volume flag handles tape change prompts; a separate hash-per-volume file gives recovery options if one tape fails.
For cloud object stores, multipart uploads with their own ETag verification provide one layer; the application-level SHA-256 manifest provides the second.
For local backup to external drives, rsync with --checksum mode plus a periodic full verification against the manifest catches drift over time. Magnetic media bit-rots; verification once a year is the minimum reasonable cadence.
Encryption versus integrity: distinct concerns
Operators often conflate encryption with integrity. Encryption protects confidentiality; integrity protects against undetected modification. A file can be encrypted but tampered with, or unencrypted but provably unmodified. The two protections are independent and require separate mechanisms.
For archives that need both, modern AEAD (authenticated encryption with associated data) ciphers like AES-GCM and ChaCha20-Poly1305 combine the two. The 7z format with -mhe=on uses AES-256 with built-in MAC; age and openssl with explicit AEAD modes provide the same for arbitrary streams.
# age for encrypted, integrity-protected file
age --encrypt --recipient-file public_keys.txt \
-o backup.age backup.tar.zst
# Verify and decrypt later
age --decrypt --identity private_key.txt backup.age \
> backup.tar.zst
age refuses to output a partial file if the integrity check fails, which is the property you want for a backup tool.
Audit trail and logging
For regulated environments (financial records, legal evidence, medical data), every integrity check should produce an audit log entry. The log should record: which file, when, who initiated, what hash was verified against, what tool version was used, and the result.
log_check() {
local file="$1"
local result="$2"
local hash
hash=$(sha256sum "$file" | awk '{print $1}')
printf '%s\tfile=%s\thash=%s\ttool=%s\tresult=%s\n' \
"$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
"$file" "$hash" "$(zstd --version | head -1)" "$result" \
>> /var/log/integrity-audit.log
}
Append-only logs on a system where the log file is owned by a different user than the verification process resist tampering by the operator. The next layer is forwarding the log to a SIEM where it cannot be altered locally.
Common mistakes that survive years of practice
Three errors recur. First, treating CRC-32 as an integrity guarantee when it only catches accidents. Second, storing the hash manifest in the same place as the archive, where an attacker with write access can compromise both. Third, verifying archives only at restore time, when discovery is too late. A pipeline that hashes, signs, and verifies at every stage is the only one that earns the word "trustworthy."
References
- National Institute of Standards and Technology, "FIPS PUB 180-4: Secure Hash Standard (SHS)." NIST, 2015. doi:10.6028/NIST.FIPS.180-4
- RFC 1952, "GZIP file format specification version 4.3." Internet Engineering Task Force, 1996. doi:10.17487/RFC1952
- RFC 8878, "Zstandard Compression and the application/zstd Media Type." Internet Engineering Task Force, 2021. doi:10.17487/RFC8878
- PKWARE, "ZIP File Format Specification, version 6.3.10." PKWARE Inc., 2022. Available: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
- RFC 4880, "OpenPGP Message Format." Internet Engineering Task Force, 2007. doi:10.17487/RFC4880
- Schneier, B., "Applied Cryptography: Protocols, Algorithms, and Source Code in C, 2nd ed." John Wiley and Sons, 1996.
- Sigstore Project, "Sigstore: Software Signing for Everybody." Available: https://www.sigstore.dev/
- Krawczyk, H., Bellare, M., and Canetti, R., "RFC 2104: HMAC: Keyed-Hashing for Message Authentication." Internet Engineering Task Force, 1997. doi:10.17487/RFC2104
Frequently Asked Questions
What native protections each format provides?
Ready to Convert Your Files?
Use our free online file converter supporting 240+ formats. No signup required, fast processing, and secure handling of your files.
Convert Files
