A 50 MB PDF for a four-page letter is not unusual. A 200 MB scanned book is routine. A government-issued certificate weighing 80 MB because the logo was embedded as a 4000 by 4000 pixel uncompressed bitmap is, somehow, common. PDFs bloat for reasons unrelated to their actual content: scanners default to 600 DPI when 200 is plenty, Word embeds full font files when subsets would do, image-heavy reports duplicate the same logo on every page, and "Save as PDF" routines pick conservative settings to avoid blame for missing pixels.

Optimizing a PDF is rarely about exotic compression. It is about removing waste. This guide walks through the structural sources of PDF bloat, the open-source tools that address each one (qpdf, Ghostscript, pdftk, mutool from MuPDF), and the verification steps that distinguish "smaller and still useful" from "smaller and broken."

Why PDFs are bigger than they need to be

A PDF is a container of objects: pages, fonts, images, vector paths, form fields, scripts, metadata. Most bloat comes from four sources.

Embedded images at higher resolution than needed. A photo printed at 4 inches wide on a page does not need to be more than 600 pixels wide for screen display or 1200 pixels for print. Documents routinely embed 4000-pixel sources.

Whole-font embeds. A document using twelve characters of a font may embed the entire 800 KB font file. Modern PDF generators subset fonts (embedding only the glyphs used), but older tools and some "Save as PDF" routines do not.

Uncompressed object streams. PDF supports stream compression on text and metadata streams, but documents written by some tools leave them uncompressed for compatibility. This adds 5 to 30 percent to the file for no benefit.

Duplicated content. A logo placed on every page of a 200-page report can appear 200 times in the file unless the generator deduplicates the image object. Most do, but some do not.

"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away." Antoine de Saint-Exupery

PDF optimization is the engineering equivalent. Subtract everything that adds size without adding meaning.

The toolkit

ToolStrengthsLimitations
qpdfLossless structural optimization, object stream rewriting, linearizationDoes not recompress images
GhostscriptFull PDF re-render with image downsampling, font subsettingCan break form fields, alters PDF version
pdftkPage operations (split, merge, rotate, watermark)Stalled development, original pdftk is Java-based
MuPDF mutoolFast structural compression, page extractionLess feature-rich than Ghostscript
cpdfCommercial, very capable, scriptablePaid for non-academic use
ocrmypdfOCR plus optimization in one pipelineRequires Tesseract
For most pipelines, qpdf plus Ghostscript covers 95 percent of needs. The pattern is: Ghostscript for image and font optimization, qpdf for structural cleanup, in that order.

Lossless structural optimization with qpdf

qpdf does not touch the document's visible content. It rewrites the PDF's underlying structure in a more compact form: object streams compressed, redundant indirect objects merged, content streams reflowed. Typical savings: 5 to 25 percent on documents that have not been previously optimized.

# Basic structural optimization
qpdf --linearize --object-streams=generate \
  --compress-streams=y \
  --recompress-flate \
  --compression-level=9 \
  input.pdf output.pdf

The --linearize flag rearranges the file so web browsers can start displaying page 1 before downloading the rest. This is the format Adobe calls "Fast Web View." For any PDF that will be served over HTTP, linearize.

The --object-streams=generate flag packs many small objects into compressed streams. The --recompress-flate flag re-runs deflate compression with maximum effort, which is slower but yields measurable savings.

For batches:

ls input/*.pdf | parallel -j 6 \
  'qpdf --linearize --object-streams=generate \
    --recompress-flate --compression-level=9 \
    {} output/{/.}.opt.pdf'

Image downsampling with Ghostscript

The largest single optimization for typical documents is downsampling embedded images. Ghostscript's preset profiles handle this with one command:

gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dPDFSETTINGS=/ebook \
   -dNOPAUSE -dQUIET -dBATCH \
   -sOutputFile=output.pdf input.pdf

The PDFSETTINGS values map to image resolution targets:

PresetColor/grayscale DPIMonochrome DPIUse caseTypical reduction
/screen72300Email, screen-only70-90%
/ebook150300E-readers, web preview50-80%
/printer3001200Office printing20-50%
/prepress3001200Professional print10-30%
/defaultVariableVariableMixedVariable
For most office documents that will be read on screens or printed on standard office printers, `/ebook` is the right choice. It produces files indistinguishable from the original at typical viewing sizes while cutting size dramatically.

For scanned documents, override defaults with explicit parameters:

gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dDownsampleColorImages=true \
   -dColorImageResolution=200 \
   -dDownsampleGrayImages=true \
   -dGrayImageResolution=200 \
   -dDownsampleMonoImages=true \
   -dMonoImageResolution=300 \
   -dColorImageDownsampleType=/Bicubic \
   -dGrayImageDownsampleType=/Bicubic \
   -dMonoImageDownsampleType=/Subsample \
   -dCompressFonts=true \
   -dSubsetFonts=true \
   -dDetectDuplicateImages=true \
   -dNOPAUSE -dQUIET -dBATCH \
   -sOutputFile=scanned.opt.pdf scanned.pdf

The -dDetectDuplicateImages=true flag deduplicates repeated images, which is critical for documents with logos on every page.

Font subsetting

A font file can be hundreds of kilobytes. A PDF that uses 30 characters of a font ideally embeds only those 30 glyphs. The technique is called subsetting and Ghostscript does it automatically with -dSubsetFonts=true. Most modern PDF generators already subset, but documents from older tools (and some web-to-PDF converters) embed full fonts.

To check whether fonts are subset, use pdffonts:

pdffonts document.pdf

The output column "emb" shows whether the font is embedded, and "sub" shows whether it is subset. Embedded but not subset means there is room for size reduction.

# Force re-subset all fonts via Ghostscript
gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dEmbedAllFonts=true \
   -dSubsetFonts=true \
   -dCompressFonts=true \
   -sOutputFile=resubset.pdf input.pdf
"There are only two hard things in Computer Science: cache invalidation and naming things." Phil Karlton

For PDFs, font handling is the third hard thing. Subsets, encoding tables, ligatures, and CID mapping interact in subtle ways, and optimizing fonts can break copy-paste or accessibility if done wrong. Always verify text extraction after font optimization.

A complete optimization pipeline

The pipeline below combines Ghostscript image downsampling with qpdf structural optimization, with verification at each step.

#!/usr/bin/env bash
set -euo pipefail

INPUT="$1"
OUTPUT="$2"
TMP=$(mktemp -d)

# Step 1: Probe source
SRC_SIZE=$(stat -c%s "$INPUT")
SRC_PAGES=$(pdfinfo "$INPUT" | awk '/^Pages:/ {print $2}')
SRC_TEXT=$(pdftotext "$INPUT" - | wc -c)
echo "Source: $INPUT ($SRC_SIZE bytes, $SRC_PAGES pages, $SRC_TEXT chars text)"

# Step 2: Ghostscript pass for image and font work
gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dPDFSETTINGS=/ebook \
   -dDetectDuplicateImages=true \
   -dCompressFonts=true \
   -dSubsetFonts=true \
   -dNOPAUSE -dQUIET -dBATCH \
   -sOutputFile="$TMP/gs.pdf" "$INPUT"

# Step 3: qpdf structural pass
qpdf --linearize --object-streams=generate \
     --recompress-flate --compression-level=9 \
     "$TMP/gs.pdf" "$OUTPUT"

# Step 4: Verify
OUT_SIZE=$(stat -c%s "$OUTPUT")
OUT_PAGES=$(pdfinfo "$OUTPUT" | awk '/^Pages:/ {print $2}')
OUT_TEXT=$(pdftotext "$OUTPUT" - | wc -c)
echo "Output: $OUTPUT ($OUT_SIZE bytes, $OUT_PAGES pages, $OUT_TEXT chars text)"

if [[ "$SRC_PAGES" != "$OUT_PAGES" ]]; then
  echo "FAIL: page count changed"
  exit 1
fi

TEXT_DELTA=$((SRC_TEXT - OUT_TEXT))
TEXT_DELTA=${TEXT_DELTA#-}
if (( TEXT_DELTA > SRC_TEXT / 100 )); then
  echo "WARN: text content changed by more than 1%"
fi

REDUCTION=$(echo "scale=1; (1 - $OUT_SIZE / $SRC_SIZE) * 100" | bc -l)
echo "Reduction: $REDUCTION%"

rm -rf "$TMP"

The verification step catches the most common optimization failures: pages dropped silently, text rasterized into images, or characters lost in font subsetting.

When to optimize, when not to

Not every PDF should be optimized. The decision matrix:

PDF typeOptimize?Why
Office document for emailYes, /ebook presetSaves bandwidth, no quality concern
Marketing material for webYes, /ebook with linearizeFaster page load
Print-ready file for pressNo, or /prepress onlyPress requires high resolution
Legal contract or signed PDFCarefully, with qpdf onlySignature blocks must not be invalidated
PDF/A archival documentNoOptimization breaks PDF/A compliance
Scientific paper with figuresYes, /ebook presetFigures rarely need print-resolution
Form with fillable fieldsqpdf onlyGhostscript can flatten form fields
OCR-generated PDFYes via ocrmypdf --optimize 3Specifically designed for this case
For signed PDFs, never use Ghostscript; it always breaks the signature because re-rendering invalidates the byte-range hash. Use qpdf with `--preserve-unreferenced` to retain the signature object.

OCR-generated PDFs need different handling

A PDF produced by OCR contains both the scanned image and an invisible text layer aligned with it. Optimization that downsamples the image dramatically reduces size; optimization that drops the text layer makes the document unsearchable. ocrmypdf handles this correctly:

ocrmypdf --optimize 3 \
  --jpeg-quality 75 \
  --png-quality 75 \
  --output-type pdfa \
  scanned.pdf optimized.pdf

The --optimize 3 level applies pngquant and jbig2 compression to the scanned images while preserving the text layer. For typical scanned documents, this produces 70 to 90 percent size reduction.

PDF/A: when archival trumps size

PDF/A is the ISO standard for long-term archival. It mandates embedded fonts, bans JavaScript and external dependencies, and requires structural metadata. The result is a self-contained, format-stable PDF that should remain readable for decades.

PDF/A files are typically larger than equivalent regular PDFs because they cannot rely on system fonts or external resources. Do not try to optimize PDF/A archives below their compliant minimum size; you will break the standard.

# Convert a PDF to PDF/A-2b
gs -sDEVICE=pdfwrite \
   -dPDFA=2 -dPDFACompatibilityPolicy=1 \
   -sColorConversionStrategy=UseDeviceIndependentColor \
   -sOutputFile=archive.pdf input.pdf

# Verify with veraPDF
verapdf --flavour 2b archive.pdf

For documents going into long-term archival contexts, including those used by legal-formation document storage at corpy.xyz or structured study materials at pass4-sure.us, PDF/A is the right format and aggressive size optimization is the wrong goal.

Batch processing pipelines

For high-volume environments, the optimization pipeline runs as a queue. The same architectural ideas apply as for image and audio batches: a manifest, parallel workers, verification at each step, and resumability.

#!/usr/bin/env bash
set -euo pipefail

INPUT_DIR="${1:-./incoming}"
OUTPUT_DIR="${2:-./optimized}"
PARALLEL="${3:-4}"

mkdir -p "$OUTPUT_DIR" ./logs

optimize_one() {
  local src="$1"
  local base
  base=$(basename "$src" .pdf)
  local out="$OUTPUT_DIR/$base.pdf"
  local log="./logs/$base.log"

  if [[ -f "$out" && "$out" -nt "$src" ]]; then
    echo "skip $base"
    return
  fi

  ./optimize-pdf.sh "$src" "$out" > "$log" 2>&1 \
    || echo "FAIL: $base" >> ./logs/failures.log
}

export -f optimize_one
export OUTPUT_DIR

find "$INPUT_DIR" -name "*.pdf" \
  | parallel -j "$PARALLEL" --joblog ./logs/batch.log optimize_one

Ghostscript is single-threaded per process, so PARALLEL roughly equals the CPU core count. On a typical server, 100-page PDFs optimize at roughly 5 to 30 seconds each, so a daily batch of 1,000 documents finishes in well under an hour with eight cores.

"Premature optimization is the root of all evil." Donald Knuth

For PDFs the inverse is also true: postponed optimization is the root of bandwidth bills. Optimize as part of the producer pipeline, not as an afterthought when the file is already in distribution.

Accessibility considerations

A PDF that has been aggressively optimized may have lost its accessibility tree. Tagged PDFs include a logical structure that screen readers use to navigate: paragraph boundaries, heading levels, reading order, alt text on images. Ghostscript with default settings can strip this tree because the optimizer does not understand it.

To preserve accessibility, use -dPreserveAnnots=true -dPreserveMarkedContent=true:

gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dPDFSETTINGS=/ebook \
   -dPreserveAnnots=true \
   -dPreserveMarkedContent=true \
   -sOutputFile=accessible.pdf input.pdf

Verify with PAC (PDF Accessibility Checker) or veraPDF. A PDF that passes WCAG 2.1 before optimization should still pass after; if not, the optimization stripped something it should not have.

For documents subject to accessibility regulations (Section 508, EN 301 549, EAA), accessibility takes priority over file size. Optimize within bounds that preserve the tagged structure.

Image format choices inside the PDF

PDF can embed images as JPEG (DCTDecode), JPEG 2000 (JPXDecode), CCITT Group 4 fax (for monochrome), JBIG2 (for compressed monochrome), or Flate-compressed raw bitmaps. The right format depends on content type.

Image typeBest embedding formatTypical compression
Photographs, colorJPEG quality 75-8520:1
Screenshots with sharp edgesFlate (lossless) or PNG5:1
Scanned text in monochromeJBIG2 lossless50:1
Scanned text in grayscaleJPEG 2000 lossless10:1
Diagrams with few colorsFlate with palette30:1
Vector contentShould be vector, not imageIndefinite
The biggest savings come from converting full-color scanned text to JBIG2 monochrome where appropriate. ocrmypdf with `--optimize 3` makes that decision automatically.

Verification of accessibility, integrity, and structure

Every optimized PDF should pass three verification checks before distribution.

# Page count, font subsetting, image inventory
pdfinfo optimized.pdf
pdffonts optimized.pdf
pdfimages -list optimized.pdf

# Text extraction comparison
pdftotext source.pdf source.txt
pdftotext optimized.pdf optimized.txt
diff -q source.txt optimized.txt

# Accessibility (if regulated)
verapdf --flavour ua1 optimized.pdf

# Linearization (for web delivery)
qpdf --check optimized.pdf

The four-step verification takes seconds per document and catches the failure modes that silent optimization can introduce.

Cross-platform consistency

PDF rendering is supposed to be platform-independent. In practice, optimizing a PDF on one platform and rendering it on another can produce subtle differences. Three causes show up most often.

Font substitution: a PDF that references a system font (rather than embedding it) renders with whatever font the viewer has. Optimization that drops embedded fonts because they are present on the build machine produces a file that looks correct locally and wrong everywhere else. Always verify with pdffonts that all used fonts are embedded.

Color space drift: optimization that strips ICC profiles makes the renderer guess. The guess is usually sRGB, which is wrong for documents prepared in print color spaces. Preserve ICC profiles through the optimization unless the destination is known to be sRGB.

Form field flattening: Ghostscript can convert interactive form fields to rendered text, which makes the form smaller but uneditable. Use qpdf for form-bearing documents.

A QC pipeline that opens optimized PDFs in three different viewers (Adobe Reader, browser PDF.js, Apple Preview) and visually compares against the source catches most of these silent regressions.

Optimization for specific industries

Different industries have different optimization sweet spots. Legal documents need exact preservation; marketing materials prioritize file size; archival prefers PDF/A even at larger size; e-learning prioritizes accessibility.

IndustryPriorityAvoid
Legal contractsPreserve signatures and exact appearanceGhostscript on signed files
Marketing brochuresAggressive size reductionLoss of brand color accuracy
Scientific papersSearch and accessibilityStripping bookmarks or alt text
Medical recordsHIPAA compliance, no metadata leakageLeaving DICOM tags in cover sheets
Government archivalPDF/A compliance, fonts embeddedExcluding required metadata
E-commerce catalogsWeb delivery speedImage quality below 150 DPI for product shots
A pipeline that asks "what industry is this for" before applying defaults produces better outcomes than a pipeline that runs the same Ghostscript command on everything.

Common mistakes that survive years of practice

Three errors recur. First, optimizing signed PDFs with Ghostscript breaks the signature; always use qpdf for those. Second, applying /screen preset to documents intended for print produces visibly blurry output; match the preset to the destination. Third, skipping the verification step after optimization lets silent failures (page count changes, text loss, accessibility regressions) reach distribution.

A pipeline that respects these three rules ships PDFs that are smaller, faster to download, and indistinguishable from the source at typical viewing.

References

  1. ISO 32000-2:2020, "Document management - Portable document format - Part 2: PDF 2.0." International Organization for Standardization.
  2. ISO 19005-1:2005, "Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF 1.4 (PDF/A-1)." International Organization for Standardization.
  3. ISO 19005-2:2011, "Document management - Electronic document file format for long-term preservation - Part 2: Use of ISO 32000-1 (PDF/A-2)."
  4. Adobe Systems, "PDF Reference, sixth edition, version 1.7." Adobe Systems Inc., 2006.
  5. Berkenbilt, J., "qpdf: A Content-Preserving PDF Transformation System." Available: https://qpdf.sourceforge.io/
  6. Artifex Software, "Ghostscript Documentation." Available: https://www.ghostscript.com/doc/current/Use.htm
  7. Deutsch, P., "RFC 1951: DEFLATE Compressed Data Format Specification version 1.3." Internet Engineering Task Force, 1996. doi:10.17487/RFC1951
  8. ISO 32000-1:2008, "Document management - Portable document format - Part 1: PDF 1.7."