Compression is the unseen middle act of every file conversion. The input is decoded; the data is transformed into the new format's working representation; compression squeezes the result into bytes; the bytes are written. Get the compression step wrong and a 200 KB JPG becomes a 4 MB PNG, an Opus voice recording becomes a 1 MB FLAC for no reason, a CSV becomes a Parquet that is somehow larger than the source. Get it right and the conversion is invisible. This article is about the third step: what compression is actually doing inside a conversion, why the tradeoffs work the way they do, and how to tune them in production.

What Compression Actually Is

Compression is the substitution of a shorter description for a longer one. It works because real-world data has structure: repeated bytes, predictable correlations, high-frequency patterns the human eye does not see. Compression algorithms detect this structure and encode the data using fewer bits than a naive byte-per-byte representation requires.

Information theory gives the floor. Claude Shannon proved in 1948 that the average bits per symbol for any lossless code on a source is bounded below by the source's entropy: H(X) = - sum p(x) log2 p(x). A perfectly redundant source (always emitting the same byte) has zero entropy and compresses to nothing. A perfectly random source has 8 bits per byte of entropy and cannot be compressed at all. Most real data is neither extreme; it has structure that the coder must model.

"Information is the resolution of uncertainty. The job of a compression algorithm is to detect the uncertainty that is not really there and stop encoding it." Claude Shannon, A Mathematical Theory of Communication, 1948.

Lossless compression approaches the entropy limit. Lossy compression goes further by deciding which information to discard, accepting reconstruction error in exchange for smaller files. The choice of what to discard is the entire art of perceptual coding.

The Three Compression Stages in a Conversion

Almost every modern compressed format goes through three stages.

Stage 1: Decorrelation. Transform the data so that nearby samples become independent. For images, the Discrete Cosine Transform or wavelet transform concentrates spatial energy into a few low-frequency coefficients. For audio, the Modified Discrete Cosine Transform separates frequency bands. For generic data, dictionary methods like LZ77 detect repeated substrings.

Stage 2: Quantization (only in lossy). Round values to nearby grid points, reducing the alphabet size. JPEG divides DCT coefficients by entries in a quantization table. MP3 allocates fewer bits to perceptually masked frequencies. AV1 uses adaptive quantization per superblock.

Stage 3: Entropy coding. Encode the resulting symbols using fewer bits for common values and more bits for rare ones. Huffman coding, arithmetic coding, range coding, and ANS (Asymmetric Numeral Systems) are the dominant techniques.

Format conversion typically dismantles the source's three stages, applies the target's three stages, and writes the result. Each stage is an opportunity for loss, error, or inefficiency.

Lossless Versus Lossy: The Decision

The first decision in any conversion pipeline is whether the target needs to be lossless. The rule of thumb:

Content typeLossless?Reason
Source code, configs, JSONYesEvery byte matters
Database dumps, ParquetYesData integrity
Photographs for webNoPerceptual budget allows JPG/AVIF
Photographs for archiveYes (TIFF/PNG/JXL)Future re-derivation
Music for distributionNo (Opus/AAC)Bandwidth
Music for masteringYes (FLAC/WAV)Editing headroom
Voice for telephonyNo (Opus low bitrate)Bandwidth, intelligibility only
Documents for archiveYes (PDF/A)Legal admissibility
Video for streamingNo (AV1/H.264)Bandwidth dominates
Video for archival masterYes (FFV1)Re-derivation
The lossy choices look casual until you realize the decision propagates: a lossy source feeding a downstream conversion produces compounded loss, not equivalent loss.

Generation Loss and Why It Matters

Generation loss is the silent enemy of conversion pipelines. A JPG opened, edited, and saved is mathematically a worse JPG than the original. The DCT-quantize-Huffman pipeline is non-linear and irreversible. Each save round-trip projects the image into a smaller subspace and destroys information.

The mitigation is to keep losslessly-compressed masters and only encode lossily at the final publishing step. A canonical pipeline:

# Capture as RAW, archive as DNG (lossless)
dcraw -4 -T capture.cr2 > master.tiff

# Edit in 16-bit color, save the working file as PNG or TIFF
magick master.tiff -modulate 100,110,100 working.tiff

# Final export to lossy publishing format from the master, never re-encode
magick working.tiff -strip -quality 82 \
  -sampling-factor 4:2:0 publish.jpg

The same pattern applies to audio (WAV master, FLAC archive, Opus distribution), video (FFV1 master, H.264 distribution), and documents (DOCX or ODT master, PDF/A distribution).

"The integrity of a content pipeline is the integrity of its weakest re-encode. Every lossy step you can move closer to the final output is a step you should move." Bruce Schneier, generalizing from cryptographic key-rotation principles.

The Rate-Distortion Curve

For lossy formats, every encoder is solving an optimization problem: minimize the number of bits subject to a distortion constraint, or equivalently, minimize distortion subject to a bit budget. The Lagrangian formulation is J = D + lambda * R, and the encoder chooses each coding decision (block size, prediction mode, quantization step) to minimize J.

The shape of the rate-distortion curve determines what is achievable. On photographs, a typical curve looks like:

Bitrate (bits/pixel)JPG SSIMAVIF SSIMJPEG XL SSIM
0.100.860.930.94
0.200.940.970.98
0.400.980.9920.994
0.800.9920.9980.998
1.600.9970.99950.9996
The curves matter for tuning. AVIF and JXL extract more quality at low bitrates because their coding tools (intra prediction, large transforms, ANS entropy coding) handle the rate-distortion space better than JPEG's 1992-vintage tooling.

Tuning Compression in a Pipeline

Three knobs matter in practice.

Quality factor or quantization parameter. The most important. JPG quality 82, AVIF cq-level 22, x264 CRF 23, libopus 128 kbps are the defaults that match perceptual transparency for typical content. Lower for size, higher for quality, both with diminishing returns.

Effort or preset. Encoder search depth. Higher effort takes longer to encode but produces a smaller file at the same quality. svt-av1 preset 4 is roughly 30 percent smaller than preset 8 at the same CRF.

Two-pass for video. Two-pass encoding first analyzes the file to allocate bits across complex and simple sections, then encodes with that allocation. Worth it for streaming distribution; not worth it for live capture.

# Image: tune for archive quality versus web distribution
avifenc --speed 0 -a cq-level=18 archive.png archive.avif
avifenc --speed 6 -a cq-level=28 web.png web.avif

# Video: two-pass AV1 for VOD
ffmpeg -y -i input.mp4 -c:v libsvtav1 -b:v 2M -preset 4 \
  -pass 1 -an -f null /dev/null
ffmpeg -i input.mp4 -c:v libsvtav1 -b:v 2M -preset 4 \
  -pass 2 -c:a libopus -b:a 128k output.mkv

# Audio: VBR vs CBR
opusenc --bitrate 128 --vbr   music.wav music_vbr.opus
opusenc --bitrate 128 --cvbr  music.wav music_cvbr.opus
opusenc --bitrate 128 --hard-cbr music.wav music_cbr.opus

Container Compression Is Different from Stream Compression

A common confusion: putting a JPG inside a ZIP does not compress the JPG. The DEFLATE inside ZIP looks at the byte stream and finds no repetition, because JPG bytes are entropy-coded. The same applies to MP3, AAC, MP4, AVIF, and any format whose final stage was an entropy coder.

The right mental model: compress the data, then containerize. Do not compress the container. ZIP a directory of source files yields tight compression because the source code has structural redundancy. ZIP a directory of JPGs yields almost nothing.

# Demonstrating: source code compresses, JPGs do not
du -h ./src    # 12 MB
zip -9 src.zip ./src/* > /dev/null
du -h src.zip  # 2.1 MB

du -h ./photos # 480 MB
zip -9 photos.zip ./photos/* > /dev/null
du -h photos.zip # 478 MB

Compression in Generic Pipelines

For non-media data the toolchain is small and well-understood.

ToolAlgorithmWhen to use
zstdLZ77 + Huffman + ANS variantsDefault for new pipelines
gzipDEFLATECompatibility with old tools
xzLZMA2Maximum ratio, slow encode
brotliLZ77 + Huffman + static dictionaryWeb text content
lz4LZ77 fast variantHigh-throughput, lower ratio
snappyLZ77 fast variantHadoop, RPC frame compression
For storage and transit, zstd is the right default. Encoder is fast, decoder is faster, ratio matches gzip -9 at default level and exceeds it at high levels. Brotli is the right choice for HTTP text content because Brotli's static dictionary is tuned for web content and shipping browsers all decode it.
# Brotli for static web content
brotli -q 11 index.html -o index.html.br

# Configure nginx to serve brotli when available
# location ~* \.(js|css|html)$ {
#     brotli_static on;
#     gzip_static on;
# }
"An algorithm is not the implementation of an idea. The implementation is where the idea meets the cache hierarchy, the branch predictor, and the input distribution. Most algorithm comparisons in textbooks are wrong about real performance because they ignore these." Donald Knuth, paraphrased from The Art of Computer Programming, Volume 1.

Format-Specific Optimization Patterns

A small catalog of optimizations that produce real gains.

JPG: progressive plus mozjpeg. mozjpeg's trellis quantization and progressive scan tables produce 5 to 15 percent smaller files at the same quality.

PNG: zopfli backend. zopflipng is 80 to 100 times slower than optipng but produces 5 to 10 percent smaller files. Worth it for static assets that ship to many users.

AVIF: speed 0 for archives, speed 6 for live. The encoder's search depth dominates encode time. Use the lowest speed that meets your latency budget.

FLAC: --best plus --verify. Maximum effort encode plus a verify pass that decodes and compares. Catches encoder bugs that would otherwise corrupt the archive.

Video: tune the keyframe interval. Long GOPs save bytes; short GOPs improve seek latency. Streaming usually picks 2 to 4 seconds. Archive masters can use much longer.

# Archive AV1 with long GOP
ffmpeg -i master.mkv -c:v libsvtav1 -crf 18 -preset 2 \
  -g 240 -c:a flac archive.mkv

# Streaming AV1 with short GOP
ffmpeg -i master.mkv -c:v libsvtav1 -crf 30 -preset 6 \
  -g 60 -keyint_min 60 -c:a libopus -b:a 96k stream.mkv

Measuring What You Did

Optimization without measurement is wishing. The metrics that matter:

For images, SSIM and the newer Butteraugli or SSIMULACRA2 against the source. PSNR is misleading for perceptual content; do not trust it as a primary metric.

For audio, ABX testing against the source at the encode bitrate. PEAQ scores are useful for automated regression but cannot replace human listening.

For video, VMAF (Netflix's perceptual quality model) at the per-frame and aggregate level.

For generic data, ratio (uncompressed_bytes / compressed_bytes) and decompression throughput.

# SSIM via ffmpeg
ffmpeg -i original.png -i compressed.avif \
  -lavfi ssim -f null -

# VMAF for video
ffmpeg -i compressed.mkv -i original.mkv \
  -lavfi libvmaf=log_path=vmaf.json -f null -

For more on instrumenting content pipelines, the operational notes at whennotesfly.com and the workflow patterns at evolang.info are useful starting points. For applied compression in QR generation and barcode imagery see qr-bar-code.com.

Practical Recommendations

If you write conversion code, separate the three stages explicitly. Decode is one function. Transform is a second function. Compress is a third. Test each independently. Most production bugs in conversion pipelines come from blurring the boundary, where the encoder is implicitly trusted to fix decoder errors or the decoder is implicitly trusted to fix encoder mismatches.

If you operate conversion pipelines, measure everything: input size, output size, encode time, perceptual quality, decode error rate. The cost of measurement is small; the cost of running the wrong compression for a year because nobody noticed is large.

Compression is not a black box. Every byte saved is the result of a specific decision, and every decision can be tuned, measured, and improved.

Compression and Cache Behavior

A compression decision is also a CPU and memory decision. Decoders work in caches; cache-friendly decoders run faster regardless of bitstream size. The interaction matters for high-throughput services.

FormatL1 cache pressureMemory bandwidthNotes
zstd -3 decodeLowModerateFits L2 on most CPUs
gzip decodeLowModerateOlder but cache-friendly
xz decodeHighHighLZMA dictionary forces L3 thrashing
brotli decodeModerateModerateStatic dictionary helps cache
lz4 decodeLowLowDesigned for memory speed
For services that decompress many files per second (web servers, log shippers), the decode performance dominates over the ratio. lz4 and zstd at low levels are the right tools. For one-shot archival where decompression happens rarely, xz or zstd at high levels win on ratio.

Streaming Versus Block Compression

A subtle distinction that affects format choice: some compressors require the entire input before producing output (block mode), while others can produce output as input arrives (streaming). Streaming is mandatory for tape-style archives, log shippers, and network protocols.

# Streaming compression in a pipeline (no temp file)
mysqldump db | zstd -3 | aws s3 cp - s3://backup/db.sql.zst

# Block compression, fastest single-file
zstd -19 --long=27 huge.bin -o huge.bin.zst

# Concatenated zstd frames (decompresses correctly)
cat part1.zst part2.zst > combined.zst
zstd -d combined.zst -o combined.bin

Concatenated zstd frames are a useful property for log rotation: append new compressed batches to a single file and the result remains a valid zstd file.

Chunking and Random Access

Most lossy media formats support seeking to arbitrary points without decoding from the beginning. JPEG has restart markers (RST). MP3 has frame headers every ~26 ms. Video has keyframes and access units. Generic compressors usually do not support random access by default.

Zstandard's seekable format (an extension specified by the zstd project) and indexed bzip2 (bzgrep, bzip2recover) provide random access at the cost of some compression ratio. For very large archives where partial reads dominate, the seekable variants win.

Use caseChoiceReasoning
One-shot archivezstd defaultSimple, fast, dense
Genomic data with random readsbgzip (block gzip)Tabix-indexable
Log archive with grep needszstd seekableFast partial reads
Database backupxz or zstd long modeDensity beats access
Network protocol payloadbrotli or zstd low levelDecode latency dominates
  1. Shannon, Claude E. "A Mathematical Theory of Communication." Bell System Technical Journal, vol. 27, July and October 1948, pp. 379 to 423 and 623 to 656.
  2. Salomon, David, and Giovanni Motta. Handbook of Data Compression. 5th ed., Springer, 2010. ISBN 978-1-84882-902-2.
  3. Sayood, Khalid. Introduction to Data Compression. 5th ed., Morgan Kaufmann, 2017. ISBN 978-0128094747.
  4. Wallace, Gregory K. "The JPEG Still Picture Compression Standard." Communications of the ACM, vol. 34, no. 4, April 1991. DOI: 10.1145/103085.103089.
  5. Collet, Yann, and Murray Kucherawy. RFC 8478, Zstandard Compression. October 2018.
  6. Alakuijala, Jyrki et al. "Brotli: A general-purpose data compressor." ACM Transactions on Information Systems, vol. 37, no. 1, 2019. DOI: 10.1145/3231935.
  7. Li, Zhi et al. "VMAF: The Journey Continues." Netflix Tech Blog, 2018.
  8. Duda, Jarek. "Asymmetric Numeral Systems: Entropy Coding Combining Speed of Huffman Coding with Compression Rate of Arithmetic Coding." arXiv:1311.2540, 2013.