A file conversion is a security boundary, and almost no team treats it like one. Files arrive from somewhere (user upload, vendor export, email attachment), pass through a converter, and exit somewhere else (CDN, S3 bucket, archive). Each side of that pipeline has different trust levels, and the converter sits in the middle with the rare ability to read everything. Most data leaks I have investigated happened in this gap: EXIF GPS in a published photo, tracked changes in a leaked memo, macro-laden DOCX in an inbound conversion service, residual text under a PDF redaction. This article is a practical security playbook for the people who run these pipelines.
The Threat Model
Conversion pipelines face four classes of threat.
Information disclosure through metadata. The file format carries information the user did not intend to publish. EXIF GPS, document author names, tracked changes, comments, embedded thumbnails, ICC profiles with system fingerprints. The risk is not the content, it is the metadata around the content.
Code execution through embedded objects. Office macros, PDF JavaScript, SVG event handlers, image format polyglots that double as scripts. The converter is opening untrusted content; if it executes any of it, the converter host is compromised.
Information disclosure through the converter. The conversion service logs, retains, or transmits user content beyond what is necessary. Free online converters are notorious for this. Self-hosted converters can have the same problem if logs are too verbose.
Integrity loss in transit. The converted file is not the file the user uploaded; subtle modifications produce wrong outputs that nobody notices until later.
"Security is a process, not a product. The product helps; the process is what determines whether the product fails closed or fails open." Bruce Schneier, Secrets and Lies, paraphrased and applied to conversion-pipeline operations.
The Metadata Problem in Detail
Almost every modern file format carries metadata, and almost every metadata field can leak something.
| Format | Metadata risk | Mitigation |
|---|---|---|
| JPG / HEIC / TIFF | EXIF GPS, device serial, software version | exiftool -all= |
| PNG | tEXt and iTXt chunks, time chunk | exiftool -all= |
| Author, comments, embedded fonts, JS | qpdf --linearize --remove-metadata | |
| DOCX | Author, revision history, comments, custom properties | Inspect Document, then export PDF/A |
| MP3 / MP4 | ID3 tags, embedded artwork with EXIF | ffmpeg -map_metadata -1 |
| SVG | onload handlers, external references, comments | scour or svgo with strip |
| EPUB | Reading device fingerprints, dc:creator | epubcheck plus manual edit |
# Photo pipeline strip on conversion
exiftool -all= -overwrite_original photo.jpg
# Optional: keep ICC profile for color accuracy
exiftool -all= -tagsFromFile @ -ICC_Profile photo.jpg
# PDF metadata wipe
qpdf --linearize --remove-metadata input.pdf clean.pdf
# DOCX inspect and clean
libreoffice --headless --convert-to pdf:writer_pdf_Export \
--outdir ./out cleaned.docx
# SVG sanitization (removes scripts and external refs)
scour -i unsafe.svg -o safe.svg \
--remove-metadata --enable-id-stripping --strip-xml-prolog
EXIF GPS: The Canonical Failure
The most embarrassing data leak in modern publishing history is EXIF GPS. Every smartphone photograph carries a precise location, often to within a few meters. Every photograph uploaded without metadata stripping leaks that location.
In 2012, anti-virus pioneer John McAfee was located in Guatemala from EXIF GPS in a Vice magazine photo. In 2015, a stalker located a TV personality from Twitter photos. In countless investigations, journalist sources have been deanonymized by the photos they shared. The mitigation is two lines of code, and yet most CMS systems still ship without it.
# Detection: extract GPS from a directory of photos
exiftool -gps:all -filename -r ./uploads | grep -B1 -i 'GPS'
# Removal at upload boundary
exiftool -gps:all= -overwrite_original ./uploads/*.{jpg,heic,tiff}
# As a pre-commit hook for a content repo
#!/bin/bash
for f in $(git diff --cached --name-only --diff-filter=A | \
grep -Ei '\.(jpg|jpeg|heic|tiff|png)$'); do
exiftool -gps:all= -overwrite_original "$f"
git add "$f"
done
Document Sanitization
DOCX, PPTX, and XLSX are ZIP containers holding XML. Comments, tracked changes, custom properties, embedded objects, and document-level metadata all live inside that ZIP. A conversion to PDF/A through LibreOffice or Word's "Inspect Document" pass strips most of this, but verification is necessary.
# Unzip a DOCX to inspect what is in it
unzip -l report.docx
unzip -p report.docx docProps/core.xml | xmllint --format -
unzip -p report.docx word/comments.xml 2>/dev/null && echo "comments present"
# Convert to PDF/A-2b after inspection
libreoffice --headless \
--convert-to 'pdf:writer_pdf_Export:SelectPdfVersion=2' \
report.docx
# Validate with verapdf
verapdf --flavour 2b report.pdf
A common pattern in regulated industries: the pipeline accepts DOCX, converts to PDF/A, and the PDF/A becomes the canonical record. The original DOCX is held in a separate access-controlled archive. The sanitization happens at the boundary into broader distribution.
"When you accept a document from outside your trust boundary, the only safe thing you can do with it is convert it through software that sees it differently. The conversion is the de facto sandbox." Daniel J. Bernstein, paraphrased from his discussion of email format hardening.
PDF Redaction Done Properly
PDF redaction is the single most error-prone conversion in legal and government work. The failures are famous: black rectangles in PDFs from court filings, intelligence agencies, corporate disclosures, where the text was still extractable. The rectangles obscured the visual rendering but did not remove the underlying content stream.
A real redaction:
# Step 1: open the PDF in a redaction-aware tool
# (Adobe Acrobat Pro Redact, Foxit Redact, or pdfredact-tools)
# Step 2: apply redaction marks
# (this is a manual step in any GUI tool worth using)
# Step 3: APPLY the redactions (this rewrites the content stream)
# In Acrobat: Tools > Redact > Apply
# Step 4: verify by extracting text
pdftotext redacted.pdf - | grep -i 'sensitive_term'
# Expected output: nothing
# Step 5: rasterize as belt-and-suspenders
# (turns the PDF into an image-only PDF that has no text at all)
ocrmypdf --deskew --redo-ocr --image-dpi 300 \
redacted.pdf raster_redacted.pdf
For high-stakes redactions (litigation, intelligence, medical records) the rasterize step is mandatory. It reduces the PDF to pixels, eliminating any chance of residual text in the content stream. The cost is loss of selectability and screen reader support.
Encryption at the Boundaries
A converted file in transit or at rest needs encryption. The right pattern is to encrypt with general-purpose tools (GPG, age, openssl) rather than relying on per-format password protection, which is often weak.
# Encrypt with GPG to a recipient's public key
gpg --output report.pdf.gpg --encrypt --recipient alice@example.com report.pdf
# Symmetric AES-256 with openssl (use a strong passphrase)
openssl enc -aes-256-gcm -pbkdf2 -iter 600000 -salt \
-in report.pdf -out report.pdf.enc
# Decrypt
openssl enc -d -aes-256-gcm -pbkdf2 -iter 600000 \
-in report.pdf.enc -out report.pdf
# Modern alternative with age (simpler, safer defaults)
age -r age1ql3z7hjy54... -o report.pdf.age report.pdf
age -d -i ~/.config/age/key.txt -o report.pdf report.pdf.age
PDF password protection (RC4 in older PDFs, AES-128 or AES-256 in PDF 2.0) is acceptable for casual confidentiality but should not be the only layer. Combine with transport encryption (TLS) and at-rest encryption (LUKS, BitLocker, S3 SSE).
"Cryptography is the strongest part of any system. The interesting question is always: what is around the cryptography? Where do the unencrypted bytes touch disk, the network, or another process?" Bruce Schneier, Cryptography Engineering, second edition.
Choosing a Conversion Service
The four-question checklist for evaluating a conversion service.
| Question | Why it matters |
|---|---|
| Does the service retain uploaded files? | Retention plus a breach equals a leak |
| Does the service log file content or metadata? | Logs are an exfil vector |
| Is the conversion done client-side or server-side? | Client-side never leaves the device |
| Is the service code open and auditable? | You cannot trust what you cannot read |
When server-side conversion is required (large files, batch jobs, integrated workflows), self-host the converter, isolate it in a sandbox (Docker, gVisor, Firejail), and ensure the upload bucket has a tight retention policy.
# Self-hosted conversion in a Docker sandbox
docker run --rm -v "$(pwd):/work" -w /work \
--read-only --tmpfs /tmp \
--cap-drop all \
linuxserver/libreoffice \
libreoffice --headless --convert-to pdf input.docx
Supply Chain Risk
The converter binary itself is in your supply chain. CVE history of imaging libraries (libpng, libjpeg-turbo, libtiff, ImageMagick) is long and dominated by parser bugs in obscure format paths. A malicious TIFF can crash or take over an outdated ImageMagick installation. The mitigations:
| Risk | Mitigation |
|---|---|
| Outdated converter binary | Patch promptly, subscribe to CVE alerts |
| Format-specific parser bugs | Disable unused format support if possible |
| Macro execution in office docs | Use --headless and disable macro execution |
| ImageMagick policy.xml gaps | Audit and tighten the delegate policy |
| ffmpeg input format auto-detect | Force input format with -f flag |
<!-- /etc/ImageMagick-7/policy.xml -->
<policymap>
<policy domain="coder" rights="none" pattern="PS" />
<policy domain="coder" rights="none" pattern="EPI" />
<policy domain="coder" rights="none" pattern="PDF" />
<policy domain="coder" rights="none" pattern="XPS" />
<policy domain="coder" rights="none" pattern="MSL" />
<policy domain="resource" name="memory" value="256MiB" />
<policy domain="resource" name="map" value="512MiB" />
<policy domain="resource" name="time" value="120" />
</policymap>
A Defensible Conversion Pipeline
A reference pipeline for handling untrusted user uploads safely:
- Receive the upload behind TLS.
- Validate file type with libmagic or
file -b --mime-type. - Reject files outside the allowed list (no surprise formats).
- Quarantine in object storage with no public ACL.
- Convert in a sandboxed worker with no network egress.
- Strip metadata after conversion.
- Validate the output (verapdf for PDF/A, pngcheck for PNG, JHOVE for many formats).
- Move clean output to the public bucket.
- Delete the quarantine copy after a defined retention window.
- Log the pipeline operations, not the file content.
# Example sandbox container compose snippet
services:
converter:
image: ghcr.io/yourorg/converter:latest
network_mode: none
read_only: true
tmpfs:
- /tmp
cap_drop:
- ALL
security_opt:
- no-new-privileges:true
mem_limit: 512m
cpus: 1.0
For applied defensive workflows in adjacent domains see the operational notes at whennotesfly.com, the certification training guides at pass4-sure.us, and the corporate document handling at corpy.xyz.
Practical Recommendations
Strip metadata at every boundary. Convert in a sandbox. Encrypt at rest and in transit. Verify redactions by re-extracting text. Audit converter dependencies. Log operations, not content.
The most common pattern of failure is not a sophisticated attack. It is a default that nobody overrode: the EXIF that nobody stripped, the comments that nobody removed, the redaction that nobody verified, the converter binary that nobody patched. Defaults matter more than capabilities. Choose them deliberately.
Concrete Threat Scenarios
Three scenarios I have actually investigated, with the failure mode and the fix.
Scenario 1: GPS leak in a real estate listing. A listings site converted iPhone HEIC uploads to JPG via a third-party SaaS. The SaaS preserved EXIF including GPS. The listing photos exposed the seller's previous address (where the photos were processed) in the metadata. Mitigation: strip GPS at upload boundary before any SaaS handoff.
Scenario 2: Tracked changes in a leaked memo. A consulting firm shipped a DOCX deliverable with tracked changes hidden but not removed. The client opened it in a forensic tool and saw the internal back-and-forth, including dismissive comments about their request. Mitigation: convert all DOCX to PDF/A through a sanitization filter before sending.
Scenario 3: ImageMagick RCE through a malicious SVG. An image hosting service ran ImageMagick with default policy.xml. A user uploaded an SVG with an embedded MSL command that exfiltrated /etc/passwd. The Imagetragick CVE family covered the underlying class of bug. Mitigation: harden policy.xml to disable MSL, MVG, EPHEMERAL, URL, HTTPS, FTP delegates.
<!-- ImageMagick hardening additions -->
<policy domain="coder" rights="none" pattern="MSL" />
<policy domain="coder" rights="none" pattern="MVG" />
<policy domain="coder" rights="none" pattern="EPHEMERAL" />
<policy domain="coder" rights="none" pattern="URL" />
<policy domain="coder" rights="none" pattern="HTTPS" />
<policy domain="coder" rights="none" pattern="HTTP" />
<policy domain="coder" rights="none" pattern="FTP" />
<policy domain="path" rights="none" pattern="@*" />
Compliance Mapping
For regulated industries the conversion pipeline often must demonstrate compliance with specific frameworks. A condensed mapping:
| Framework | Conversion-relevant requirement | Implementation |
|---|---|---|
| GDPR Art. 32 | Pseudonymization, encryption | Strip identifiers, encrypt at rest |
| HIPAA 164.312 | Access controls, audit logs | Sandbox, log operations only |
| PCI-DSS 3.4 | Render PAN unreadable | Mask before any PDF generation |
| SOC 2 CC6.1 | Logical access | Authenticated converter access |
| ISO 27001 A.8.2 | Information classification | Tag inputs, route by classification |
| SEC 17a-4 | Records retention | Immutable WORM storage of outputs |
| eIDAS | Qualified electronic signatures | Sign PDF/A outputs with HSM |
Logging Without Leaking
A nontrivial design problem: how do you log enough to debug failures without logging the content you are trying to protect? The pattern that works:
# Log the operation, not the content
log_event() {
local sha256
sha256=$(sha256sum "$1" | cut -d' ' -f1 | cut -c1-12)
printf '%s op=%s file=%s size=%d sha=%s status=%s\n' \
"$(date -Iseconds)" "$2" "$(basename "$1")" \
"$(stat -c %s "$1")" "$sha256" "$3"
}
# Now logs contain a content hash but not content
log_event input.pdf convert success
# Output: 2026-05-02T10:23:45+00:00 op=convert file=input.pdf size=438291 sha=a3f2c1d4 status=success
The truncated SHA-256 lets you correlate operations across systems without revealing the file. The size lets you spot anomalies. The op and status drive alerting. The actual bytes never enter logs.
- Schneier, Bruce. Cryptography Engineering: Design Principles and Practical Applications. Wiley, 2010. ISBN 978-0470474242.
- NIST SP 800-88 Rev. 1. Guidelines for Media Sanitization. National Institute of Standards and Technology, December 2014.
- NIST SP 800-175B Rev. 1. Guideline for Using Cryptographic Standards in the Federal Government. March 2020.
- ISO/IEC 27001:2022. Information security management systems, Requirements.
- RFC 9580. OpenPGP. Internet Engineering Task Force, July 2024.
- Adobe Systems. PDF Reference, sixth edition (PDF 1.7), and ISO 32000-2:2020 for PDF 2.0.
- PDF Association. PDF/A and Document Redaction Best Practices, 2021.
- exiftool by Phil Harvey. https://exiftool.org/
Frequently Asked Questions
Document Sanitization?
DOCX, PPTX, and XLSX are ZIP containers holding XML. Comments, tracked changes, custom properties, embedded objects, and document-level metadata all live inside that ZIP. A conversion to PDF/A through LibreOffice or Word's "Inspect Document" pass strips most of this, but verification is necessary.
Ready to Convert Your Files?
Use our free online file converter supporting 240+ formats. No signup required, fast processing, and secure handling of your files.
Convert Files


