Every file format is a frozen argument about what computers should preserve and what they should discard. JPEG argued that human eyes tolerate chroma subsampling. ZIP argued that random access matters more than ratio. PDF argued that layout was worth more than reflow. Each of these arguments was correct for its decade and is now under pressure from a different set of constraints: machine learning pipelines that read tensors instead of pixels, archival mandates that demand fixity hashes, bandwidth economics that punish every wasted byte on mobile networks, and security postures that assume every container will eventually be parsed by a hostile decoder.

The next ten years of file formats will be shaped less by raw compression breakthroughs and more by how formats interact with these systemic forces. This article maps the formats that will matter, the standards bodies driving them, and the practical implications for engineers, archivists, and product teams choosing what to write to disk today.

What Drives Format Evolution

A file format succeeds when it solves a real cost problem at the moment of its release. JPEG solved storage cost in 1992, when a 100 MB hard drive felt expensive. MP3 solved bandwidth cost in 1995, when a song over a modem took fifteen minutes. AVIF solves bandwidth cost again in the 2020s, when video-dominated mobile traffic squeezes carrier margins. Behind each transition sits a quantifiable economic delta the new format makes obvious.

Three forces are now compounding to drive the current wave of format innovation.

Compute is cheap, bandwidth is metered. Decoders that would have been unthinkable in 1995, such as the lattice-based entropy coding inside JPEG XL or the neural upsamplers shipping in proprietary codecs, are routine on a 2026 phone SoC. This shifts the optimal trade-off toward complex encoders that produce smaller files.

Archival mandates are now binding. ISO 19005 (PDF/A), ISO 14721 (OAIS), and the FADGI still image guidelines are no longer aspirational. Government records, scientific datasets, and regulated industries treat preservation as a contractual deliverable, which forces format authors to expose hashes, manifests, and reproducible builds inside the container.

Machine learning ate the workflow. A growing share of files are read more often by models than by people. Tensor-first containers, embedding stores, and graph serialization formats now compete with traditional document and media formats for engineering attention.

"A format is a contract. The bytes on disk promise to mean a specific thing to anyone who reads them, this year, next decade, or after the company that wrote them is gone." Tim Bray, co-author of XML and JSON Patch RFCs

The Container Landscape in 2026

The table below summarizes the dominant formats per domain and the challengers gaining share. Compression ratios assume comparable visual or semantic quality and are drawn from the public benchmark sets cited in the references section.

DomainIncumbentChallengerCompression GainStandardization
Still image (web)JPEGAVIF30 to 50 percent smaller at equal SSIMAOMedia AV1 image profile
Still image (archival)TIFFJPEG XL20 to 60 percent smaller, lossless transcode from JPEGISO/IEC 18181
Video (streaming)H.264AV130 percent smaller at equal VMAFAOMedia, ITU H.273 metadata
General compressiongzip (zlib)Zstandard10 to 20 percent better ratio at higher speedRFC 8878
Archive containerZIP7z, ageStronger encryption and ratiosOpen spec, age IETF draft
DocumentPDF 1.7PDF 2.0Native digital signatures, geospatialISO 32000-2
ML weightspicklesafetensors, GGUFMemory-mapped, hash-verifiedde facto, community-maintained
Tabular dataCSV, ParquetApache Arrow IPC, LanceColumnar, vectorized scanApache Arrow project
Two patterns recur across this table. First, every challenger ships a stronger integrity story than its incumbent. Second, every challenger explicitly targets a workload (mobile bandwidth, GPU memory mapping, columnar scan) rather than a generic "smaller files" pitch.

JPEG XL and the Slow Image Transition

JPEG XL (ISO/IEC 18181) is the most technically complete still image format released since JPEG 2000, and it occupies an unusual political position. Google removed Chromium support in 2023, citing insufficient ecosystem demand, then partially reversed that stance after pushback from publishers and image processing vendors. Apple shipped JPEG XL support in Safari 17 and across the Apple Photos pipeline. The result is a format with strong technical merit and uneven browser coverage.

The case for JPEG XL rests on three properties competitors lack:

  1. Lossless JPEG transcoding. A 20 percent average file size reduction with bit-exact reversibility back to the original JPEG. No other format offers this.
  2. A single codec for lossy and lossless. AVIF requires separate code paths and produces visible artifacts in lossless mode that JPEG XL avoids.
  3. Progressive decoding with responsive scaling. A single JXL file can serve thumbnail, preview, and full-resolution requests from prefixes of the same byte stream.
"Engineers default to whatever ships in Chrome. That is not a technical evaluation. JPEG XL deserves to be evaluated on the workload, not on the browser politics." Jon Sneyers, JPEG XL co-editor, in a 2024 W3C TPAC discussion

For the next two to three years, JPEG XL will dominate professional and archival pipelines while AVIF dominates the public web. The boundary between the two will shift as Chrome support evolves and as content delivery networks add automatic transcoding at the edge. For a deeper comparison of when to pick each format, see the understanding JPG vs PNG key differences discussion on this site.

AVIF, the Web's Default Future

AVIF (AV1 Image File Format) is the format the open web actually adopted. Browser support is now near-universal for static images, and AVIF is the default output for Cloudflare Polish, Akamai Image Manager, and Vercel's image optimizer. The format inherits AV1's intra-frame coding, which produces dense files at the cost of slower encoding and a higher decoder complexity than WebP.

A typical encoder invocation, using libavif's command-line wrapper:

# High-quality AVIF for hero images
avifenc --min 0 --max 30 --speed 4 \
        --jobs all --yuv 444 \
        input.png output.avif

# Aggressive web-tier AVIF
avifenc --qcolor 50 --qalpha 60 --speed 6 \
        --yuv 420 input.png output.avif

The quality-to-size curve on photographic content beats WebP by a meaningful margin. The same curve on flat illustrations and screenshots is closer to a tie, and PNG or lossless WebP often beats AVIF for line art with hard edges.

A correct production HTML pattern serves AVIF first, falls back to WebP for older browsers, and ends with a JPEG that any decoder on earth can read:

<picture>
  <source type="image/avif" srcset="hero.avif" />
  <source type="image/webp" srcset="hero.webp" />
  <img src="hero.jpg" alt="Hero illustration"
       width="1200" height="630" loading="lazy" />
</picture>

This three-source pattern will outlive any single format. Even when JPEG XL reaches universal support, the picture element remains the correct delivery primitive.

Zstandard and the Quiet Compression Revolution

Outside of media, the most important format shift of the past decade is the displacement of zlib by Zstandard. Zstd, originally released by Facebook in 2016, was standardized as RFC 8878 and now ships in the Linux kernel (zswap, btrfs, squashfs), the npm registry tarballs, RPM, OpenZFS, and most major CDN compression layers.

The performance profile is the reason. Zstd matches gzip's compression ratio at five to ten times the speed, and at higher levels reaches ratios closer to xz with order-of-magnitude faster decompression.

ToolRatio (Silesia corpus)Compress speedDecompress speed
gzip -62.7460 MB/s280 MB/s
zstd -32.88470 MB/s1380 MB/s
zstd -193.667 MB/s1310 MB/s
xz -63.729 MB/s80 MB/s
These numbers, drawn from the lzbench harness on a single Xeon core, explain why Zstd is now the default. Operators who used to choose between "fast and bigger" and "small and slow" can choose both.
"Compression is a tax on every byte that moves. Zstandard cut that tax by an order of magnitude in real workloads, which is why it spread without anyone running a campaign for it." Yann Collet, author of LZ4 and Zstandard, in a Linux Foundation talk

AI-Native Formats: safetensors, GGUF, ONNX

Machine learning workloads have created an entirely new category of file format. Pickle, the Python serialization format used by PyTorch through 2022, is now widely understood as a security disaster: loading a pickled model executes arbitrary Python code by design. The replacement formats prioritize three properties pickle lacks.

Memory mapping. safetensors stores tensors as a JSON header followed by raw bytes, aligned to allow zero-copy mmap into GPU or CPU memory. Loading a 70-billion-parameter model becomes an mmap call rather than a deserialization pass.

Integrity verification. GGUF, the format used by llama.cpp and most local inference engines, embeds metadata fields that include quantization scheme, original model hash, and license declarations.

Graph portability. ONNX defines a stable operator set across PyTorch, TensorFlow, JAX, and inference runtimes such as ONNX Runtime and TensorRT. It is now the only widely supported portable model format.

# Convert a HuggingFace model to safetensors
from safetensors.torch import save_file
import torch

model = torch.load("model.bin", map_location="cpu")
save_file(model, "model.safetensors")

# Loading is a memory map, not a deserialization
from safetensors import safe_open
with safe_open("model.safetensors", framework="pt") as f:
    for key in f.keys():
        tensor = f.get_tensor(key)

The pattern repeats elsewhere. Lance, a columnar format from the LanceDB project, is doing for vector embeddings what Parquet did for analytics tables: a self-describing, version-aware, hash-verified container designed around the actual access pattern of the workload.

PDF 2.0 and the Document Future

PDF as a format has not been static. ISO 32000-2 (PDF 2.0), finalized in 2017 and revised in 2020, ships modern digital signatures (PAdES), associated files for embedded source data, geospatial coordinates, and Unicode-correct text extraction defaults. The relevance of PDF to the future is not innovation but inertia: regulated industries and courts have settled on PDF as the legal record format, which means the standard will be maintained for decades regardless of what the consumer web does.

The complement to PDF 2.0 is PDF/A-4 (ISO 19005-4), the archival profile that constrains PDF 2.0 to long-term-readable subsets. PDF/A-4 explicitly drops constraints from earlier PDF/A profiles that prevented embedding of source files, which means a single PDF/A-4 file can now serve as both the human-readable record and the machine-readable original. For teams writing legal or compliance documents, the right defaults are PDF 2.0 for production and PDF/A-4 for the archived copy. Communication-focused teams pairing this with clear long-form writing patterns can find structured guidance at evolang.info.

Encryption, Signing, and Provenance Inside the Container

Future formats assume the file will travel through hostile networks, untrusted intermediaries, and AI scrapers. Three integrity-and-provenance technologies are converging into the container layer itself.

C2PA (Coalition for Content Provenance and Authenticity). A signed manifest standard for media files that records the capture device, edit history, and signing identity. Adobe, Microsoft, Sony, Nikon, and Leica ship C2PA-compliant cameras and tools. The manifest lives inside the file as a JUMBF box, which makes provenance survive copy-paste.

age and rage. A modern alternative to PGP for at-rest encryption, with a clean spec, no metadata leakage in the header, and small implementations in Go and Rust. Many archive workflows are migrating from gpg-encrypted tarballs to age-encrypted Zstd archives.

Sigstore and in-toto. Software supply-chain projects that bind file hashes to identities through transparency logs. The same pattern is appearing in dataset distribution and model release workflows.

"Provenance is not a feature you add to a format. It is a property of the bytes themselves. Either the file proves where it came from, or it does not." Eric Rescorla, IETF Security Area Director, in a 2024 IAB workshop on content authenticity

What This Means for Practitioners

The practical guidance for teams choosing formats today divides cleanly by domain.

For web delivery. Serve AVIF first, WebP second, JPEG last, behind a CDN that handles negotiation. Do not try to pick a single format manually. The picture element and Accept headers exist for this reason.

For archives. Use TIFF or JPEG XL for images, PDF/A-4 for documents, FLAC for audio, and Matroska or MXF for video. Pair every archive with a manifest that lists SHA-256 hashes and the originating tool versions.

For data pipelines. Default to Parquet for analytics, Arrow IPC for in-memory exchange, and Zstandard for any general compression. Move off CSV and pickle for anything that will be read more than once.

For ML weights. Use safetensors for training pipelines, GGUF for local inference distribution, and ONNX for portable deployment. Treat pickle as a legacy format requiring justification.

For documents. PDF 2.0 for the production copy, PDF/A-4 for the archive copy. Word documents and HTML are not archival formats and should not be treated as such.

For a category-by-category breakdown of practical conversion choices, the why convert to open source formats article on this site walks through migration mechanics in detail. Teams running test-prep or certification content workflows can also reference the format-handling notes at pass4-sure.us for high-volume static publishing.

The Decade Ahead

Three predictions are reasonably safe. JPEG XL will reach Chrome eventually, ending the dual-format era for the open web. C2PA-style provenance will become the default for camera output, then for AI-generated content, then mandated by platform terms of service. And Zstandard will quietly continue eating zlib's remaining footprint until gzip is a legacy compatibility layer rather than a default.

Three predictions are less safe but worth tracking. Neural codecs that ship learned decoders alongside compressed bitstreams will move from research to production for niche workloads. Columnar tensor formats will displace traditional model serialization in distributed training. And archival institutions will publish format risk indexes that drive procurement, similar to how the Library of Congress sustainability scoring drives current preservation choices.

The right mental model for file formats is geological. New formats deposit on top of old ones, and very few formats truly disappear. JPEG, ZIP, and PDF will still be readable in 2050, even as their share of new files declines toward zero. The work of choosing formats today is choosing where on that geology your data should sit.

References

  1. ISO/IEC 18181-1:2022, Information technology, JPEG XL Image Coding System, Part 1: Core coding system. https://www.iso.org/standard/77977.html
  2. RFC 8878, Zstandard Compression and the application/zstd Media Type. IETF, 2021. https://www.rfc-editor.org/rfc/rfc8878
  3. ISO 32000-2:2020, Document management, Portable document format, Part 2: PDF 2.0. https://www.iso.org/standard/75839.html
  4. Library of Congress, Sustainability of Digital Formats: Planning for Library of Congress Collections. https://www.loc.gov/preservation/digital/formats/
  5. Coalition for Content Provenance and Authenticity, Technical Specification 2.1. https://c2pa.org/specifications/specifications/2.1/
  6. Alakuijala, J. et al. JPEG XL next-generation image compression architecture and coding tools. Proceedings of SPIE Applications of Digital Image Processing XLII, 2019. https://doi.org/10.1117/12.2529237
  7. AOMedia, AV1 Image File Format (AVIF) Specification, version 1.1.0. https://aomediacodec.github.io/av1-avif/
  8. Apache Arrow Project, Columnar Format Specification 1.4. https://arrow.apache.org/docs/format/Columnar.html