Metadata is the part of a file that nobody thinks about until it goes missing. A photographer discovers that the wedding album has lost every "shot at f/2.8 on the 85mm" detail because their batch script stripped EXIF. A journalist finds out the GPS coordinates of an anonymous source's home are embedded in the photo they just published. A podcast network gets complaints because every episode shows up as "Track 01 Unknown Artist" in car stereos. Three different industries, three different metadata standards, one shared root cause: the file converter in the middle of the workflow did not preserve what the producer assumed was preserved.
This article is about what metadata actually is, which standards apply where, what survives common conversions, and how to control preservation with tools like exiftool, ffmpeg, and qpdf. The focus is practical: which flag, which command, which verification step. Format theory is in the references at the end.
What metadata is, structurally
A file is two streams of bytes: payload and metadata. The payload is the part the user perceives: pixels, audio samples, document text. Metadata describes the payload: who made it, when, with what equipment, in what color space, with what intended audience. Metadata is small (usually under 1 percent of file size) but carries information that took human labor to produce.
Metadata in a media file lives in one of three places: in the container header (e.g., MP4 box atoms), in a dedicated metadata segment (e.g., JPEG APP1 markers for EXIF), or as an XMP packet that can appear in many container types because it is just XML. A file can have multiple metadata blocks pointing at overlapping fields with conflicting values.
"The fundamental cause of trouble in the modern world is that the stupid are cocksure while the intelligent are full of doubt." Bertrand Russell, The Triumph of Stupidity
Most batch scripts strip metadata because the author was not sure whether to keep it. The right default is preserve, with explicit stripping where you have a documented reason.
The major metadata standards
| Standard | Used in | What it stores | Maintained by |
|---|---|---|---|
| EXIF | JPEG, TIFF, HEIC, RAW | Camera settings, GPS, timestamps, orientation | CIPA (Camera & Imaging Products Association) |
| IPTC | JPEG, TIFF, embedded as legacy | Captions, keywords, copyright, contact info | International Press Telecommunications Council |
| XMP | Almost any format | Anything (extensible RDF/XML) | Adobe (now ISO 16684) |
| ID3 | MP3 | Title, artist, album, art, lyrics | id3.org community |
| iXML | WAV (BWF) | Production audio metadata | Gallery (broadcast audio standard) |
| MXF descriptors | MXF | Broadcast video production data | SMPTE |
| PDF Info dict | Title, author, creation date | Adobe (ISO 32000) | |
| PDF XMP | Anything Adobe extends | ISO 32000 | |
| Vorbis comments | OGG, FLAC | Tags as key-value pairs | Xiph.org |
What survives common conversions
The honest answer is "less than you think, and you should verify every conversion path." The following table summarizes typical behavior with default settings of major tools.
| Conversion | EXIF | IPTC | XMP | Embedded art |
|---|---|---|---|---|
| JPEG to JPEG (ImageMagick mogrify) | Preserved | Preserved | Preserved | N/A |
| JPEG to JPEG (with -strip) | Removed | Removed | Removed | N/A |
| JPEG to WebP (cwebp default) | Preserved | Lost | Lost | N/A |
| JPEG to AVIF (libavif) | Preserved | Preserved | Preserved | N/A |
| JPEG to PNG (most tools) | Lost or partial | Lost | Lost | N/A |
| RAW to JPEG (most converters) | Subset preserved | Lost | Variable | N/A |
| HEIC to JPEG (sips on macOS) | Mostly preserved | Lost | Lost | N/A |
| MP3 to MP3 (ffmpeg default) | N/A | N/A | N/A | Lost (no -map 0:v) |
| MP3 to MP3 (ffmpeg -map 0) | N/A | N/A | N/A | Preserved |
| FLAC to MP3 (with -map_metadata 0) | N/A | N/A | N/A | Preserved if mapped |
| MP4 to MP4 (ffmpeg -c copy) | N/A | N/A | N/A | N/A (movie atoms preserved) |
| PDF to PDF (qpdf) | N/A | N/A | Preserved | N/A |
| PDF to DOCX (LibreOffice) | N/A | N/A | Title and author only | N/A |
| DOCX to PDF (LibreOffice) | N/A | N/A | Title and author preserved | N/A |
EXIF: the camera's story
EXIF is the metadata format for photography. It records aperture, shutter speed, ISO, focal length, lens model, GPS coordinates, capture timestamp, and the camera's color space assumption. A modern phone photo has roughly 60 EXIF tags filled in.
# View all metadata
exiftool photo.jpg
# View only EXIF
exiftool -EXIF:all photo.jpg
# Copy all metadata from one file to another
exiftool -tagsFromFile source.jpg -all:all target.jpg
# Strip GPS but keep everything else
exiftool -gps:all= photo.jpg
The single most important EXIF tag is Orientation. Phones record images in landscape pixel order regardless of how the user held the device, with an Orientation tag indicating "rotate 90 clockwise to display correctly." A converter that ignores this tag produces sideways thumbnails. Always normalize orientation early in any pipeline:
# ImageMagick: bake orientation into pixels then reset the tag
mogrify -auto-orient photo.jpg
# Or with exiftool, just set the tag without rotating pixels
# (only do this if the pixels are already in the right order)
exiftool -Orientation=1 -n photo.jpg
"The most important property of a program is whether it accomplishes the intention of its user." C.A.R. Hoare
A photo whose orientation tag is correct in EXIF but whose pixels are sideways accomplishes nothing useful. Verify orientation on output, do not assume the metadata describes reality.
IPTC: the editorial layer
IPTC stores information that an editor or photographer adds: caption, headline, keywords, byline, credit, copyright notice, contact email, location name. News agencies use IPTC heavily. Wedding photographers use it to embed model release status. E-commerce uses it to embed product SKUs.
The IPTC IIM standard (the older binary form) is being superseded by IPTC Photo Metadata 2024, which uses XMP as the carrier. Most tools read both and write to both for backward compatibility.
# Add IPTC fields to a photo
exiftool \
-IPTC:By-line="Jane Doe" \
-IPTC:CopyrightNotice="(c) 2026 Example Studios" \
-IPTC:Caption-Abstract="Bride and groom at sunset, Cancun" \
-IPTC:Keywords="wedding, sunset, Cancun, beach" \
photo.jpg
For batch operations, a CSV-driven exiftool invocation lets a content manager update fields across thousands of files from a spreadsheet:
exiftool -csv=metadata.csv -overwrite_original *.jpg
The CSV's first column is the source filename and the rest are tag names matching exiftool's syntax.
XMP: the format Adobe gave the world
XMP is Adobe's universal metadata format. It is RDF/XML, which means it is verbose but extensible. Anything you want to record about a file can be expressed as a custom XMP namespace, and competent tools will preserve it through conversions.
XMP packets can be embedded in JPEG, TIFF, PNG, PDF, MP3, MP4, MOV, and many others. The packet is human-readable, which is occasionally useful for debugging:
# Extract XMP from a PDF
qpdf --show-object=trailer document.pdf
exiftool -xmp -b document.pdf > extracted.xmp
For pipelines that process content across multiple media types (a marketing team handling photos, videos, and PDFs), XMP is the most reliable way to carry structured metadata without per-format lookup tables.
Audio metadata: ID3, Vorbis comments, iXML
Audio metadata fragments more than image metadata. MP3 uses ID3, FLAC and OGG use Vorbis comments, WAV files in production use BWF iXML, AAC uses iTunes-style atoms in MP4. They all store roughly the same information (title, artist, album, year) but the field names and structures differ.
| Field | ID3v2.3 | Vorbis | MP4 atom |
|---|---|---|---|
| Title | TIT2 | TITLE | nam |
| Artist | TPE1 | ARTIST | ART |
| Album | TALB | ALBUM | alb |
| Track number | TRCK | TRACKNUMBER | trkn |
| Year | TYER | DATE | day |
| Genre | TCON | GENRE | gen |
| Album artist | TPE2 | ALBUMARTIST | aART |
| Composer | TCOM | COMPOSER | wrt |
| Lyrics | USLT | LYRICS | lyr |
| Cover art | APIC | METADATA_BLOCK_PICTURE | covr |
ffmpeg -i input.flac \
-metadata title="Episode 47" \
-metadata artist="Audio Engineering Show" \
-metadata album="Season 3" \
-metadata date="2026" \
-map_metadata 0 \
-id3v2_version 3 \
-c:a libmp3lame -q:a 2 \
output.mp3
The -map_metadata 0 carries everything from input 0 first, then -metadata flags override or add specific fields.
PDF metadata: the document layer
PDF stores metadata in two places: the legacy Info dictionary (Title, Author, Subject, Keywords, Producer, Creator, CreationDate, ModDate) and the XMP packet, which can carry anything. ISO 32000-2 mandates XMP for PDF/A archival compliance, and the Info dictionary is increasingly seen as a legacy fallback.
# View PDF metadata
exiftool document.pdf
# Set Info dictionary fields with qpdf
qpdf --replace-input \
--set-page-labels=1:r:i \
--add-page-info \
document.pdf
# Set with exiftool
exiftool \
-Title="Annual Report 2026" \
-Author="Jane Doe" \
-Subject="Financial summary" \
-Keywords="2026, annual, financial" \
document.pdf
For PDF/A compliance (long-term archival), metadata must be present in both the Info dictionary and as a matching XMP packet. veraPDF or qpdf can verify the match.
Privacy: when stripping is the goal
The default for production batches should be preserve, but there are clear cases where stripping is the right action.
| Scenario | What to strip | Why |
|---|---|---|
| Photos uploaded to a public website | GPS, camera serial number | Geolocation and equipment-tracking risk |
| Documents shared with adversaries (legal, journalism) | Author, modification history, comments | Reveals identity and editorial process |
| Real-estate marketing photos | GPS only | Reveals exact property location |
| Medical imaging shared for second opinion | Patient name, exam ID, hospital | HIPAA and equivalent regulations |
| Photos of children for public sharing | All metadata | Comprehensive privacy |
# Strip everything from a JPEG
exiftool -all= -overwrite_original photo.jpg
# Strip only GPS
exiftool -gps:all= -overwrite_original photo.jpg
# Strip metadata from a PDF
qpdf --linearize --object-streams=generate \
--remove-info=true \
--remove-metadata=true \
input.pdf clean.pdf
Verification is critical because some tools claim to strip metadata but leave fragments. Always run exiftool against the output to confirm.
Cross-domain consistency in metadata pipelines
A team producing content across multiple sites needs a metadata strategy that survives format conversion. The same field names, the same controlled vocabulary for keywords, the same copyright notice format. This is true whether the content is study materials at pass4-sure.us, email-and-writing templates at evolang.info, or company-formation guides at corpy.xyz.
The pattern that scales is: single source of truth for metadata in a database, batch-applied to deliverables at conversion time, never edited in the deliverable itself. This avoids drift between platforms.
"The road to hell is paved with broken hyperlinks." Tim Berners-Lee, paraphrased
The same applies to metadata. A pipeline that breaks copyright notices on every conversion gradually loses the ability to prove who made what.
Verification: the QC step nobody runs
Every batch should verify metadata on the output, not trust the tool's defaults. The simplest check:
# After conversion, dump metadata for spot review
exiftool -j -EXIF:all -IPTC:all -XMP:all output.jpg > output.json
# Compare to a reference
diff <(exiftool -j input.jpg) <(exiftool -j output.jpg) | less
For batches, write the comparison into the runner so any drop in field count triggers a warning.
Metadata in video containers
Video containers (MP4, MKV, MOV, MXF) store metadata as a tree of typed atoms or elements. The key practical concerns:
# Probe video metadata
ffprobe -v error -show_format -show_streams input.mp4
# Carry all metadata through a remux
ffmpeg -i input.mp4 -c copy \
-map_metadata 0 -map_chapters 0 \
-movflags use_metadata_tags \
output.mp4
# Add specific metadata
ffmpeg -i input.mp4 -c copy \
-metadata title="Episode 47" \
-metadata description="In which we discuss compression" \
-metadata creation_time="2026-04-30T12:00:00Z" \
-metadata:s:v:0 language=eng \
-metadata:s:a:0 language=eng \
-metadata:s:s:0 language=fra \
output.mp4
The -movflags use_metadata_tags is critical for Apple ecosystems. Without it, custom metadata fields written to MP4 are dropped on iOS playback because Apple's parser only reads a known set of atoms.
For broadcast workflows using MXF, metadata fields are far more structured (SMPTE descriptive metadata schemes) and require dedicated tools like FFmpeg's mxfmd5 or Avid's MXF tools. Generic ffmpeg can read most fields but does not understand the full SMPTE schema.
Tooling table
| Need | Tool | One-line example |
|---|---|---|
| Read any metadata | exiftool | exiftool file |
| Write image metadata | exiftool | exiftool -Title=X file.jpg |
| Write video metadata | ffmpeg | ffmpeg -i in -metadata title=X -c copy out |
| Write PDF metadata | qpdf or exiftool | exiftool -Title=X file.pdf |
| Strip all metadata | exiftool | exiftool -all= file |
| Verify PDF/A metadata compliance | veraPDF | verapdf --flavour 2b file |
| Diff metadata before/after | exiftool with diff | See verification section |
Common mistakes that survive years of practice
Three errors recur. First, batches that strip metadata "to save space" save kilobytes and lose information that took human labor to produce. Second, batches that assume metadata transfers across formats produce silent losses (especially JPEG to PNG, which loses EXIF in most tools). Third, pipelines that never verify output metadata accumulate drift that nobody notices until a customer complains.
A pipeline that respects these three rules ships files that carry their full provenance to the people who need it.
References
- CIPA DC-008-2023, "Exchangeable image file format for digital still cameras: EXIF Version 3.0." Camera & Imaging Products Association, 2023.
- ISO 16684-1:2019, "Graphic technology - Extensible metadata platform (XMP) - Part 1: Data model, serialization and core properties." International Organization for Standardization.
- ISO 32000-2:2020, "Document management - Portable document format - Part 2: PDF 2.0." International Organization for Standardization.
- IPTC, "IPTC Photo Metadata Standard 2024.1." International Press Telecommunications Council, 2024. Available: https://iptc.org/std/photometadata/specification/
- ID3.org, "ID3 tag version 2.4.0 - Native Frames." Available: https://id3.org/id3v2.4.0-frames
- Xiph.Org Foundation, "Vorbis comment specification." Available: https://www.xiph.org/vorbis/doc/v-comment.html
- ISO 19005-3:2012, "Document management - Electronic document file format for long-term preservation - Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3)."
- Harvey, P., "ExifTool documentation." Available: https://exiftool.org/
Frequently Asked Questions
What metadata is, structurally?
A file is two streams of bytes: payload and metadata. The payload is the part the user perceives: pixels, audio samples, document text. Metadata describes the payload: who made it, when, with what equipment, in what color space, with what intended audience. Metadata is small (usually under 1 percent of file size) but carries information that took human labor to produce.
What survives common conversions?
The honest answer is "less than you think, and you should verify every conversion path." The following table summarizes typical behavior with default settings of major tools.
Ready to Convert Your Files?
Use our free online file converter supporting 240+ formats. No signup required, fast processing, and secure handling of your files.
Convert Files
