How to Safeguard Your Data During File Conversion

A file conversion is a security boundary, and almost no team treats it like one. Files arrive from somewhere (user upload, vendor export, email attachment), pass through a converter, and exit somewhere else (CDN, S3 bucket, archive). Each side of that pipeline has different trust levels, and the converter sits in the middle with the rare ability to read everything. Most data leaks I have investigated happened in this gap: EXIF GPS in a published photo, tracked changes in a leaked memo, macro-laden DOCX in an inbound conversion service, residual text under a PDF redaction. This article is a practical security playbook for the people who run these pipelines.

The Threat Model

Conversion pipelines face four classes of threat.

Information disclosure through metadata. The file format carries information the user did not intend to publish. EXIF GPS, document author names, tracked changes, comments, embedded thumbnails, ICC profiles with system fingerprints. The risk is not the content, it is the metadata around the content.

Code execution through embedded objects. Office macros, PDF JavaScript, SVG event handlers, image format polyglots that double as scripts. The converter is opening untrusted content; if it executes any of it, the converter host is compromised.

Information disclosure through the converter. The conversion service logs, retains, or transmits user content beyond what is necessary. Free online converters are notorious for this. Self-hosted converters can have the same problem if logs are too verbose.

Integrity loss in transit. The converted file is not the file the user uploaded; subtle modifications produce wrong outputs that nobody notices until later.

"Security is a process, not a product. The product helps; the process is what determines whether the product fails closed or fails open." Bruce Schneier, Secrets and Lies, paraphrased and applied to conversion-pipeline operations.

The Metadata Problem in Detail

Almost every modern file format carries metadata, and almost every metadata field can leak something.

Format	Metadata risk	Mitigation
JPG / HEIC / TIFF	EXIF GPS, device serial, software version	exiftool -all=
PNG	tEXt and iTXt chunks, time chunk	exiftool -all=
PDF	Author, comments, embedded fonts, JS	qpdf --linearize --remove-metadata
DOCX	Author, revision history, comments, custom properties	Inspect Document, then export PDF/A
MP3 / MP4	ID3 tags, embedded artwork with EXIF	ffmpeg -map_metadata -1
SVG	onload handlers, external references, comments	scour or svgo with strip
EPUB	Reading device fingerprints, dc:creator	epubcheck plus manual edit

The strip-then-publish pattern is non-negotiable for any pipeline that emits files to a public surface.

# Photo pipeline strip on conversion
exiftool -all= -overwrite_original photo.jpg
# Optional: keep ICC profile for color accuracy
exiftool -all= -tagsFromFile @ -ICC_Profile photo.jpg

# PDF metadata wipe
qpdf --linearize --remove-metadata input.pdf clean.pdf

# DOCX inspect and clean
libreoffice --headless --convert-to pdf:writer_pdf_Export \
  --outdir ./out cleaned.docx

# SVG sanitization (removes scripts and external refs)
scour -i unsafe.svg -o safe.svg \
  --remove-metadata --enable-id-stripping --strip-xml-prolog

EXIF GPS: The Canonical Failure

The most embarrassing data leak in modern publishing history is EXIF GPS. Every smartphone photograph carries a precise location, often to within a few meters. Every photograph uploaded without metadata stripping leaks that location.

In 2012, anti-virus pioneer John McAfee was located in Guatemala from EXIF GPS in a Vice magazine photo. In 2015, a stalker located a TV personality from Twitter photos. In countless investigations, journalist sources have been deanonymized by the photos they shared. The mitigation is two lines of code, and yet most CMS systems still ship without it.

# Detection: extract GPS from a directory of photos
exiftool -gps:all -filename -r ./uploads | grep -B1 -i 'GPS'

# Removal at upload boundary
exiftool -gps:all= -overwrite_original ./uploads/*.{jpg,heic,tiff}

# As a pre-commit hook for a content repo
#!/bin/bash
for f in $(git diff --cached --name-only --diff-filter=A | \
           grep -Ei '\.(jpg|jpeg|heic|tiff|png)$'); do
  exiftool -gps:all= -overwrite_original "$f"
  git add "$f"
done

Document Sanitization

DOCX, PPTX, and XLSX are ZIP containers holding XML. Comments, tracked changes, custom properties, embedded objects, and document-level metadata all live inside that ZIP. A conversion to PDF/A through LibreOffice or Word's "Inspect Document" pass strips most of this, but verification is necessary.

# Unzip a DOCX to inspect what is in it
unzip -l report.docx
unzip -p report.docx docProps/core.xml | xmllint --format -
unzip -p report.docx word/comments.xml 2>/dev/null && echo "comments present"

# Convert to PDF/A-2b after inspection
libreoffice --headless \
  --convert-to 'pdf:writer_pdf_Export:SelectPdfVersion=2' \
  report.docx

# Validate with verapdf
verapdf --flavour 2b report.pdf

A common pattern in regulated industries: the pipeline accepts DOCX, converts to PDF/A, and the PDF/A becomes the canonical record. The original DOCX is held in a separate access-controlled archive. The sanitization happens at the boundary into broader distribution.

"When you accept a document from outside your trust boundary, the only safe thing you can do with it is convert it through software that sees it differently. The conversion is the de facto sandbox." Daniel J. Bernstein, paraphrased from his discussion of email format hardening.

PDF Redaction Done Properly

PDF redaction is the single most error-prone conversion in legal and government work. The failures are famous: black rectangles in PDFs from court filings, intelligence agencies, corporate disclosures, where the text was still extractable. The rectangles obscured the visual rendering but did not remove the underlying content stream.

A real redaction:

# Step 1: open the PDF in a redaction-aware tool
# (Adobe Acrobat Pro Redact, Foxit Redact, or pdfredact-tools)

# Step 2: apply redaction marks
# (this is a manual step in any GUI tool worth using)

# Step 3: APPLY the redactions (this rewrites the content stream)
# In Acrobat: Tools > Redact > Apply

# Step 4: verify by extracting text
pdftotext redacted.pdf - | grep -i 'sensitive_term'
# Expected output: nothing

# Step 5: rasterize as belt-and-suspenders
# (turns the PDF into an image-only PDF that has no text at all)
ocrmypdf --deskew --redo-ocr --image-dpi 300 \
  redacted.pdf raster_redacted.pdf

For high-stakes redactions (litigation, intelligence, medical records) the rasterize step is mandatory. It reduces the PDF to pixels, eliminating any chance of residual text in the content stream. The cost is loss of selectability and screen reader support.

Encryption at the Boundaries

A converted file in transit or at rest needs encryption. The right pattern is to encrypt with general-purpose tools (GPG, age, openssl) rather than relying on per-format password protection, which is often weak.

# Encrypt with GPG to a recipient's public key
gpg --output report.pdf.gpg --encrypt --recipient alice@example.com report.pdf

# Symmetric AES-256 with openssl (use a strong passphrase)
openssl enc -aes-256-gcm -pbkdf2 -iter 600000 -salt \
  -in report.pdf -out report.pdf.enc

# Decrypt
openssl enc -d -aes-256-gcm -pbkdf2 -iter 600000 \
  -in report.pdf.enc -out report.pdf

# Modern alternative with age (simpler, safer defaults)
age -r age1ql3z7hjy54... -o report.pdf.age report.pdf
age -d -i ~/.config/age/key.txt -o report.pdf report.pdf.age

PDF password protection (RC4 in older PDFs, AES-128 or AES-256 in PDF 2.0) is acceptable for casual confidentiality but should not be the only layer. Combine with transport encryption (TLS) and at-rest encryption (LUKS, BitLocker, S3 SSE).

"Cryptography is the strongest part of any system. The interesting question is always: what is around the cryptography? Where do the unencrypted bytes touch disk, the network, or another process?" Bruce Schneier, Cryptography Engineering, second edition.

Choosing a Conversion Service

The four-question checklist for evaluating a conversion service.

Question	Why it matters
Does the service retain uploaded files?	Retention plus a breach equals a leak
Does the service log file content or metadata?	Logs are an exfil vector
Is the conversion done client-side or server-side?	Client-side never leaves the device
Is the service code open and auditable?	You cannot trust what you cannot read

For sensitive content, the answer is almost always to run conversion locally with established open-source tools. LibreOffice, ImageMagick, ffmpeg, exiftool, qpdf, pandoc, and a Linux box do the work of every commercial conversion SaaS for free, with no upload boundary to worry about.

When server-side conversion is required (large files, batch jobs, integrated workflows), self-host the converter, isolate it in a sandbox (Docker, gVisor, Firejail), and ensure the upload bucket has a tight retention policy.

# Self-hosted conversion in a Docker sandbox
docker run --rm -v "$(pwd):/work" -w /work \
  --read-only --tmpfs /tmp \
  --cap-drop all \
  linuxserver/libreoffice \
  libreoffice --headless --convert-to pdf input.docx

Supply Chain Risk

The converter binary itself is in your supply chain. CVE history of imaging libraries (libpng, libjpeg-turbo, libtiff, ImageMagick) is long and dominated by parser bugs in obscure format paths. A malicious TIFF can crash or take over an outdated ImageMagick installation. The mitigations:

Risk	Mitigation
Outdated converter binary	Patch promptly, subscribe to CVE alerts
Format-specific parser bugs	Disable unused format support if possible
Macro execution in office docs	Use --headless and disable macro execution
ImageMagick policy.xml gaps	Audit and tighten the delegate policy
ffmpeg input format auto-detect	Force input format with -f flag

ImageMagick's `policy.xml` is the most common configuration mistake. The default policy allows reading and writing many formats including PS, PDF, and SVG, all of which have historical parser bugs. A hardened policy disables formats you do not need.

<!-- /etc/ImageMagick-7/policy.xml -->
<policymap>
  <policy domain="coder" rights="none" pattern="PS" />
  <policy domain="coder" rights="none" pattern="EPI" />
  <policy domain="coder" rights="none" pattern="PDF" />
  <policy domain="coder" rights="none" pattern="XPS" />
  <policy domain="coder" rights="none" pattern="MSL" />
  <policy domain="resource" name="memory" value="256MiB" />
  <policy domain="resource" name="map" value="512MiB" />
  <policy domain="resource" name="time" value="120" />
</policymap>

A Defensible Conversion Pipeline

A reference pipeline for handling untrusted user uploads safely:

Receive the upload behind TLS.
Validate file type with libmagic or file -b --mime-type.
Reject files outside the allowed list (no surprise formats).
Quarantine in object storage with no public ACL.
Convert in a sandboxed worker with no network egress.
Strip metadata after conversion.
Validate the output (verapdf for PDF/A, pngcheck for PNG, JHOVE for many formats).
Move clean output to the public bucket.
Delete the quarantine copy after a defined retention window.
Log the pipeline operations, not the file content.

# Example sandbox container compose snippet
services:
  converter:
    image: ghcr.io/yourorg/converter:latest
    network_mode: none
    read_only: true
    tmpfs:
      - /tmp
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
    mem_limit: 512m
    cpus: 1.0

For applied defensive workflows in adjacent domains see the operational notes at whennotesfly.com, the certification training guides at pass4-sure.us, and the corporate document handling at corpy.xyz.

Practical Recommendations

Strip metadata at every boundary. Convert in a sandbox. Encrypt at rest and in transit. Verify redactions by re-extracting text. Audit converter dependencies. Log operations, not content.

The most common pattern of failure is not a sophisticated attack. It is a default that nobody overrode: the EXIF that nobody stripped, the comments that nobody removed, the redaction that nobody verified, the converter binary that nobody patched. Defaults matter more than capabilities. Choose them deliberately.

Concrete Threat Scenarios

Three scenarios I have actually investigated, with the failure mode and the fix.

Scenario 1: GPS leak in a real estate listing. A listings site converted iPhone HEIC uploads to JPG via a third-party SaaS. The SaaS preserved EXIF including GPS. The listing photos exposed the seller's previous address (where the photos were processed) in the metadata. Mitigation: strip GPS at upload boundary before any SaaS handoff.

Scenario 2: Tracked changes in a leaked memo. A consulting firm shipped a DOCX deliverable with tracked changes hidden but not removed. The client opened it in a forensic tool and saw the internal back-and-forth, including dismissive comments about their request. Mitigation: convert all DOCX to PDF/A through a sanitization filter before sending.

Scenario 3: ImageMagick RCE through a malicious SVG. An image hosting service ran ImageMagick with default policy.xml. A user uploaded an SVG with an embedded MSL command that exfiltrated /etc/passwd. The Imagetragick CVE family covered the underlying class of bug. Mitigation: harden policy.xml to disable MSL, MVG, EPHEMERAL, URL, HTTPS, FTP delegates.

<!-- ImageMagick hardening additions -->
<policy domain="coder" rights="none" pattern="MSL" />
<policy domain="coder" rights="none" pattern="MVG" />
<policy domain="coder" rights="none" pattern="EPHEMERAL" />
<policy domain="coder" rights="none" pattern="URL" />
<policy domain="coder" rights="none" pattern="HTTPS" />
<policy domain="coder" rights="none" pattern="HTTP" />
<policy domain="coder" rights="none" pattern="FTP" />
<policy domain="path" rights="none" pattern="@*" />

Compliance Mapping

For regulated industries the conversion pipeline often must demonstrate compliance with specific frameworks. A condensed mapping:

Framework	Conversion-relevant requirement	Implementation
GDPR Art. 32	Pseudonymization, encryption	Strip identifiers, encrypt at rest
HIPAA 164.312	Access controls, audit logs	Sandbox, log operations only
PCI-DSS 3.4	Render PAN unreadable	Mask before any PDF generation
SOC 2 CC6.1	Logical access	Authenticated converter access
ISO 27001 A.8.2	Information classification	Tag inputs, route by classification
SEC 17a-4	Records retention	Immutable WORM storage of outputs
eIDAS	Qualified electronic signatures	Sign PDF/A outputs with HSM

Each row implies specific converter behavior. PCI-DSS-compliant conversion of card data, for example, must mask the PAN in any rendered PDF before that PDF leaves the cardholder data environment. The conversion is the masking enforcement point.

Logging Without Leaking

A nontrivial design problem: how do you log enough to debug failures without logging the content you are trying to protect? The pattern that works:

# Log the operation, not the content
log_event() {
  local sha256
  sha256=$(sha256sum "$1" | cut -d' ' -f1 | cut -c1-12)
  printf '%s op=%s file=%s size=%d sha=%s status=%s\n' \
    "$(date -Iseconds)" "$2" "$(basename "$1")" \
    "$(stat -c %s "$1")" "$sha256" "$3"
}

# Now logs contain a content hash but not content
log_event input.pdf convert success
# Output: 2026-05-02T10:23:45+00:00 op=convert file=input.pdf size=438291 sha=a3f2c1d4 status=success

The truncated SHA-256 lets you correlate operations across systems without revealing the file. The size lets you spot anomalies. The op and status drive alerting. The actual bytes never enter logs.

Schneier, Bruce. Cryptography Engineering: Design Principles and Practical Applications. Wiley, 2010. ISBN 978-0470474242.
NIST SP 800-88 Rev. 1. Guidelines for Media Sanitization. National Institute of Standards and Technology, December 2014.
NIST SP 800-175B Rev. 1. Guideline for Using Cryptographic Standards in the Federal Government. March 2020.
ISO/IEC 27001:2022. Information security management systems, Requirements.
RFC 9580. OpenPGP. Internet Engineering Task Force, July 2024.
Adobe Systems. PDF Reference, sixth edition (PDF 1.7), and ISO 32000-2:2020 for PDF 2.0.
PDF Association. PDF/A and Document Redaction Best Practices, 2021.
exiftool by Phil Harvey. https://exiftool.org/

How to Safeguard Your Data During File Conversion

The Threat Model

The Metadata Problem in Detail

EXIF GPS: The Canonical Failure

Document Sanitization

PDF Redaction Done Properly

Encryption at the Boundaries

Choosing a Conversion Service

Supply Chain Risk

A Defensible Conversion Pipeline

Practical Recommendations

Concrete Threat Scenarios

Compliance Mapping

Logging Without Leaking

Tags

Frequently Asked Questions

Document Sanitization?

Ready to Convert Your Files?

How to Safeguard Your Data During File Conversion

The Threat Model

The Metadata Problem in Detail

EXIF GPS: The Canonical Failure

Document Sanitization

PDF Redaction Done Properly

Encryption at the Boundaries

Choosing a Conversion Service

Supply Chain Risk

A Defensible Conversion Pipeline

Practical Recommendations

Concrete Threat Scenarios

Compliance Mapping

Logging Without Leaking

Tags

Frequently Asked Questions

Document Sanitization?

Related Articles

Safely Convert Sensitive Files Online: Security Tips

How to Ensure Privacy When Converting Files Online

Enhancing Document Security During File Conversion

Ready to Convert Your Files?