PDF Optimization: Reduce Size Without Quality Loss

A 50 MB PDF for a four-page letter is not unusual. A 200 MB scanned book is routine. A government-issued certificate weighing 80 MB because the logo was embedded as a 4000 by 4000 pixel uncompressed bitmap is, somehow, common. PDFs bloat for reasons unrelated to their actual content: scanners default to 600 DPI when 200 is plenty, Word embeds full font files when subsets would do, image-heavy reports duplicate the same logo on every page, and "Save as PDF" routines pick conservative settings to avoid blame for missing pixels.

Optimizing a PDF is rarely about exotic compression. It is about removing waste. This guide walks through the structural sources of PDF bloat, the open-source tools that address each one (qpdf, Ghostscript, pdftk, mutool from MuPDF), and the verification steps that distinguish "smaller and still useful" from "smaller and broken."

Why PDFs are bigger than they need to be

A PDF is a container of objects: pages, fonts, images, vector paths, form fields, scripts, metadata. Most bloat comes from four sources.

Embedded images at higher resolution than needed. A photo printed at 4 inches wide on a page does not need to be more than 600 pixels wide for screen display or 1200 pixels for print. Documents routinely embed 4000-pixel sources.

Whole-font embeds. A document using twelve characters of a font may embed the entire 800 KB font file. Modern PDF generators subset fonts (embedding only the glyphs used), but older tools and some "Save as PDF" routines do not.

Uncompressed object streams. PDF supports stream compression on text and metadata streams, but documents written by some tools leave them uncompressed for compatibility. This adds 5 to 30 percent to the file for no benefit.

Duplicated content. A logo placed on every page of a 200-page report can appear 200 times in the file unless the generator deduplicates the image object. Most do, but some do not.

"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away." Antoine de Saint-Exupery

PDF optimization is the engineering equivalent. Subtract everything that adds size without adding meaning.

The toolkit

Tool	Strengths	Limitations
qpdf	Lossless structural optimization, object stream rewriting, linearization	Does not recompress images
Ghostscript	Full PDF re-render with image downsampling, font subsetting	Can break form fields, alters PDF version
pdftk	Page operations (split, merge, rotate, watermark)	Stalled development, original pdftk is Java-based
MuPDF mutool	Fast structural compression, page extraction	Less feature-rich than Ghostscript
cpdf	Commercial, very capable, scriptable	Paid for non-academic use
ocrmypdf	OCR plus optimization in one pipeline	Requires Tesseract

For most pipelines, qpdf plus Ghostscript covers 95 percent of needs. The pattern is: Ghostscript for image and font optimization, qpdf for structural cleanup, in that order.

Lossless structural optimization with qpdf

qpdf does not touch the document's visible content. It rewrites the PDF's underlying structure in a more compact form: object streams compressed, redundant indirect objects merged, content streams reflowed. Typical savings: 5 to 25 percent on documents that have not been previously optimized.

# Basic structural optimization
qpdf --linearize --object-streams=generate \
  --compress-streams=y \
  --recompress-flate \
  --compression-level=9 \
  input.pdf output.pdf

The --linearize flag rearranges the file so web browsers can start displaying page 1 before downloading the rest. This is the format Adobe calls "Fast Web View." For any PDF that will be served over HTTP, linearize.

The --object-streams=generate flag packs many small objects into compressed streams. The --recompress-flate flag re-runs deflate compression with maximum effort, which is slower but yields measurable savings.

For batches:

ls input/*.pdf | parallel -j 6 \
  'qpdf --linearize --object-streams=generate \
    --recompress-flate --compression-level=9 \
    {} output/{/.}.opt.pdf'

Image downsampling with Ghostscript

The largest single optimization for typical documents is downsampling embedded images. Ghostscript's preset profiles handle this with one command:

gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dPDFSETTINGS=/ebook \
   -dNOPAUSE -dQUIET -dBATCH \
   -sOutputFile=output.pdf input.pdf

The PDFSETTINGS values map to image resolution targets:

Preset	Color/grayscale DPI	Monochrome DPI	Use case	Typical reduction
/screen	72	300	Email, screen-only	70-90%
/ebook	150	300	E-readers, web preview	50-80%
/printer	300	1200	Office printing	20-50%
/prepress	300	1200	Professional print	10-30%
/default	Variable	Variable	Mixed	Variable

For most office documents that will be read on screens or printed on standard office printers, `/ebook` is the right choice. It produces files indistinguishable from the original at typical viewing sizes while cutting size dramatically.

For scanned documents, override defaults with explicit parameters:

gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dDownsampleColorImages=true \
   -dColorImageResolution=200 \
   -dDownsampleGrayImages=true \
   -dGrayImageResolution=200 \
   -dDownsampleMonoImages=true \
   -dMonoImageResolution=300 \
   -dColorImageDownsampleType=/Bicubic \
   -dGrayImageDownsampleType=/Bicubic \
   -dMonoImageDownsampleType=/Subsample \
   -dCompressFonts=true \
   -dSubsetFonts=true \
   -dDetectDuplicateImages=true \
   -dNOPAUSE -dQUIET -dBATCH \
   -sOutputFile=scanned.opt.pdf scanned.pdf

The -dDetectDuplicateImages=true flag deduplicates repeated images, which is critical for documents with logos on every page.

Font subsetting

A font file can be hundreds of kilobytes. A PDF that uses 30 characters of a font ideally embeds only those 30 glyphs. The technique is called subsetting and Ghostscript does it automatically with -dSubsetFonts=true. Most modern PDF generators already subset, but documents from older tools (and some web-to-PDF converters) embed full fonts.

To check whether fonts are subset, use pdffonts:

pdffonts document.pdf

The output column "emb" shows whether the font is embedded, and "sub" shows whether it is subset. Embedded but not subset means there is room for size reduction.

# Force re-subset all fonts via Ghostscript
gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dEmbedAllFonts=true \
   -dSubsetFonts=true \
   -dCompressFonts=true \
   -sOutputFile=resubset.pdf input.pdf

"There are only two hard things in Computer Science: cache invalidation and naming things." Phil Karlton

For PDFs, font handling is the third hard thing. Subsets, encoding tables, ligatures, and CID mapping interact in subtle ways, and optimizing fonts can break copy-paste or accessibility if done wrong. Always verify text extraction after font optimization.

A complete optimization pipeline

The pipeline below combines Ghostscript image downsampling with qpdf structural optimization, with verification at each step.

#!/usr/bin/env bash
set -euo pipefail

INPUT="$1"
OUTPUT="$2"
TMP=$(mktemp -d)

# Step 1: Probe source
SRC_SIZE=$(stat -c%s "$INPUT")
SRC_PAGES=$(pdfinfo "$INPUT" | awk '/^Pages:/ {print $2}')
SRC_TEXT=$(pdftotext "$INPUT" - | wc -c)
echo "Source: $INPUT ($SRC_SIZE bytes, $SRC_PAGES pages, $SRC_TEXT chars text)"

# Step 2: Ghostscript pass for image and font work
gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dPDFSETTINGS=/ebook \
   -dDetectDuplicateImages=true \
   -dCompressFonts=true \
   -dSubsetFonts=true \
   -dNOPAUSE -dQUIET -dBATCH \
   -sOutputFile="$TMP/gs.pdf" "$INPUT"

# Step 3: qpdf structural pass
qpdf --linearize --object-streams=generate \
     --recompress-flate --compression-level=9 \
     "$TMP/gs.pdf" "$OUTPUT"

# Step 4: Verify
OUT_SIZE=$(stat -c%s "$OUTPUT")
OUT_PAGES=$(pdfinfo "$OUTPUT" | awk '/^Pages:/ {print $2}')
OUT_TEXT=$(pdftotext "$OUTPUT" - | wc -c)
echo "Output: $OUTPUT ($OUT_SIZE bytes, $OUT_PAGES pages, $OUT_TEXT chars text)"

if [[ "$SRC_PAGES" != "$OUT_PAGES" ]]; then
  echo "FAIL: page count changed"
  exit 1
fi

TEXT_DELTA=$((SRC_TEXT - OUT_TEXT))
TEXT_DELTA=${TEXT_DELTA#-}
if (( TEXT_DELTA > SRC_TEXT / 100 )); then
  echo "WARN: text content changed by more than 1%"
fi

REDUCTION=$(echo "scale=1; (1 - $OUT_SIZE / $SRC_SIZE) * 100" | bc -l)
echo "Reduction: $REDUCTION%"

rm -rf "$TMP"

The verification step catches the most common optimization failures: pages dropped silently, text rasterized into images, or characters lost in font subsetting.

When to optimize, when not to

Not every PDF should be optimized. The decision matrix:

PDF type	Optimize?	Why
Office document for email	Yes, /ebook preset	Saves bandwidth, no quality concern
Marketing material for web	Yes, /ebook with linearize	Faster page load
Print-ready file for press	No, or /prepress only	Press requires high resolution
Legal contract or signed PDF	Carefully, with qpdf only	Signature blocks must not be invalidated
PDF/A archival document	No	Optimization breaks PDF/A compliance
Scientific paper with figures	Yes, /ebook preset	Figures rarely need print-resolution
Form with fillable fields	qpdf only	Ghostscript can flatten form fields
OCR-generated PDF	Yes via ocrmypdf --optimize 3	Specifically designed for this case

For signed PDFs, never use Ghostscript; it always breaks the signature because re-rendering invalidates the byte-range hash. Use qpdf with `--preserve-unreferenced` to retain the signature object.

OCR-generated PDFs need different handling

A PDF produced by OCR contains both the scanned image and an invisible text layer aligned with it. Optimization that downsamples the image dramatically reduces size; optimization that drops the text layer makes the document unsearchable. ocrmypdf handles this correctly:

ocrmypdf --optimize 3 \
  --jpeg-quality 75 \
  --png-quality 75 \
  --output-type pdfa \
  scanned.pdf optimized.pdf

The --optimize 3 level applies pngquant and jbig2 compression to the scanned images while preserving the text layer. For typical scanned documents, this produces 70 to 90 percent size reduction.

PDF/A: when archival trumps size

PDF/A is the ISO standard for long-term archival. It mandates embedded fonts, bans JavaScript and external dependencies, and requires structural metadata. The result is a self-contained, format-stable PDF that should remain readable for decades.

PDF/A files are typically larger than equivalent regular PDFs because they cannot rely on system fonts or external resources. Do not try to optimize PDF/A archives below their compliant minimum size; you will break the standard.

# Convert a PDF to PDF/A-2b
gs -sDEVICE=pdfwrite \
   -dPDFA=2 -dPDFACompatibilityPolicy=1 \
   -sColorConversionStrategy=UseDeviceIndependentColor \
   -sOutputFile=archive.pdf input.pdf

# Verify with veraPDF
verapdf --flavour 2b archive.pdf

For documents going into long-term archival contexts, including those used by legal-formation document storage at corpy.xyz or structured study materials at pass4-sure.us, PDF/A is the right format and aggressive size optimization is the wrong goal.

Batch processing pipelines

For high-volume environments, the optimization pipeline runs as a queue. The same architectural ideas apply as for image and audio batches: a manifest, parallel workers, verification at each step, and resumability.

#!/usr/bin/env bash
set -euo pipefail

INPUT_DIR="${1:-./incoming}"
OUTPUT_DIR="${2:-./optimized}"
PARALLEL="${3:-4}"

mkdir -p "$OUTPUT_DIR" ./logs

optimize_one() {
  local src="$1"
  local base
  base=$(basename "$src" .pdf)
  local out="$OUTPUT_DIR/$base.pdf"
  local log="./logs/$base.log"

  if [[ -f "$out" && "$out" -nt "$src" ]]; then
    echo "skip $base"
    return
  fi

  ./optimize-pdf.sh "$src" "$out" > "$log" 2>&1 \
    || echo "FAIL: $base" >> ./logs/failures.log
}

export -f optimize_one
export OUTPUT_DIR

find "$INPUT_DIR" -name "*.pdf" \
  | parallel -j "$PARALLEL" --joblog ./logs/batch.log optimize_one

Ghostscript is single-threaded per process, so PARALLEL roughly equals the CPU core count. On a typical server, 100-page PDFs optimize at roughly 5 to 30 seconds each, so a daily batch of 1,000 documents finishes in well under an hour with eight cores.

"Premature optimization is the root of all evil." Donald Knuth

For PDFs the inverse is also true: postponed optimization is the root of bandwidth bills. Optimize as part of the producer pipeline, not as an afterthought when the file is already in distribution.

Accessibility considerations

A PDF that has been aggressively optimized may have lost its accessibility tree. Tagged PDFs include a logical structure that screen readers use to navigate: paragraph boundaries, heading levels, reading order, alt text on images. Ghostscript with default settings can strip this tree because the optimizer does not understand it.

To preserve accessibility, use -dPreserveAnnots=true -dPreserveMarkedContent=true:

gs -sDEVICE=pdfwrite \
   -dCompatibilityLevel=1.7 \
   -dPDFSETTINGS=/ebook \
   -dPreserveAnnots=true \
   -dPreserveMarkedContent=true \
   -sOutputFile=accessible.pdf input.pdf

Verify with PAC (PDF Accessibility Checker) or veraPDF. A PDF that passes WCAG 2.1 before optimization should still pass after; if not, the optimization stripped something it should not have.

For documents subject to accessibility regulations (Section 508, EN 301 549, EAA), accessibility takes priority over file size. Optimize within bounds that preserve the tagged structure.

Image format choices inside the PDF

PDF can embed images as JPEG (DCTDecode), JPEG 2000 (JPXDecode), CCITT Group 4 fax (for monochrome), JBIG2 (for compressed monochrome), or Flate-compressed raw bitmaps. The right format depends on content type.

Image type	Best embedding format	Typical compression
Photographs, color	JPEG quality 75-85	20:1
Screenshots with sharp edges	Flate (lossless) or PNG	5:1
Scanned text in monochrome	JBIG2 lossless	50:1
Scanned text in grayscale	JPEG 2000 lossless	10:1
Diagrams with few colors	Flate with palette	30:1
Vector content	Should be vector, not image	Indefinite

The biggest savings come from converting full-color scanned text to JBIG2 monochrome where appropriate. ocrmypdf with `--optimize 3` makes that decision automatically.

Verification of accessibility, integrity, and structure

Every optimized PDF should pass three verification checks before distribution.

# Page count, font subsetting, image inventory
pdfinfo optimized.pdf
pdffonts optimized.pdf
pdfimages -list optimized.pdf

# Text extraction comparison
pdftotext source.pdf source.txt
pdftotext optimized.pdf optimized.txt
diff -q source.txt optimized.txt

# Accessibility (if regulated)
verapdf --flavour ua1 optimized.pdf

# Linearization (for web delivery)
qpdf --check optimized.pdf

The four-step verification takes seconds per document and catches the failure modes that silent optimization can introduce.

Cross-platform consistency

PDF rendering is supposed to be platform-independent. In practice, optimizing a PDF on one platform and rendering it on another can produce subtle differences. Three causes show up most often.

Font substitution: a PDF that references a system font (rather than embedding it) renders with whatever font the viewer has. Optimization that drops embedded fonts because they are present on the build machine produces a file that looks correct locally and wrong everywhere else. Always verify with pdffonts that all used fonts are embedded.

Color space drift: optimization that strips ICC profiles makes the renderer guess. The guess is usually sRGB, which is wrong for documents prepared in print color spaces. Preserve ICC profiles through the optimization unless the destination is known to be sRGB.

Form field flattening: Ghostscript can convert interactive form fields to rendered text, which makes the form smaller but uneditable. Use qpdf for form-bearing documents.

A QC pipeline that opens optimized PDFs in three different viewers (Adobe Reader, browser PDF.js, Apple Preview) and visually compares against the source catches most of these silent regressions.

Optimization for specific industries

Different industries have different optimization sweet spots. Legal documents need exact preservation; marketing materials prioritize file size; archival prefers PDF/A even at larger size; e-learning prioritizes accessibility.

Industry	Priority	Avoid
Legal contracts	Preserve signatures and exact appearance	Ghostscript on signed files
Marketing brochures	Aggressive size reduction	Loss of brand color accuracy
Scientific papers	Search and accessibility	Stripping bookmarks or alt text
Medical records	HIPAA compliance, no metadata leakage	Leaving DICOM tags in cover sheets
Government archival	PDF/A compliance, fonts embedded	Excluding required metadata
E-commerce catalogs	Web delivery speed	Image quality below 150 DPI for product shots

A pipeline that asks "what industry is this for" before applying defaults produces better outcomes than a pipeline that runs the same Ghostscript command on everything.

Common mistakes that survive years of practice

Three errors recur. First, optimizing signed PDFs with Ghostscript breaks the signature; always use qpdf for those. Second, applying /screen preset to documents intended for print produces visibly blurry output; match the preset to the destination. Third, skipping the verification step after optimization lets silent failures (page count changes, text loss, accessibility regressions) reach distribution.

A pipeline that respects these three rules ships PDFs that are smaller, faster to download, and indistinguishable from the source at typical viewing.

References

ISO 32000-2:2020, "Document management - Portable document format - Part 2: PDF 2.0." International Organization for Standardization.
ISO 19005-1:2005, "Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF 1.4 (PDF/A-1)." International Organization for Standardization.
ISO 19005-2:2011, "Document management - Electronic document file format for long-term preservation - Part 2: Use of ISO 32000-1 (PDF/A-2)."
Adobe Systems, "PDF Reference, sixth edition, version 1.7." Adobe Systems Inc., 2006.
Berkenbilt, J., "qpdf: A Content-Preserving PDF Transformation System." Available: https://qpdf.sourceforge.io/
Artifex Software, "Ghostscript Documentation." Available: https://www.ghostscript.com/doc/current/Use.htm
Deutsch, P., "RFC 1951: DEFLATE Compressed Data Format Specification version 1.3." Internet Engineering Task Force, 1996. doi:10.17487/RFC1951
ISO 32000-1:2008, "Document management - Portable document format - Part 1: PDF 1.7."

PDF Optimization: Reduce Size Without Quality Loss

Why PDFs are bigger than they need to be

The toolkit

Lossless structural optimization with qpdf

Image downsampling with Ghostscript

Font subsetting

A complete optimization pipeline

When to optimize, when not to

OCR-generated PDFs need different handling

PDF/A: when archival trumps size

Batch processing pipelines

Accessibility considerations

Image format choices inside the PDF

Verification of accessibility, integrity, and structure

Cross-platform consistency

Optimization for specific industries

Common mistakes that survive years of practice

References

Tags

Frequently Asked Questions

Why PDFs are bigger than they need to be?

When to optimize, when not to?

Ready to Convert Your Files?

PDF Optimization: Reduce Size Without Quality Loss

Why PDFs are bigger than they need to be

The toolkit

Lossless structural optimization with qpdf

Image downsampling with Ghostscript

Font subsetting

A complete optimization pipeline

When to optimize, when not to

OCR-generated PDFs need different handling

PDF/A: when archival trumps size

Batch processing pipelines

Accessibility considerations

Image format choices inside the PDF

Verification of accessibility, integrity, and structure

Cross-platform consistency

Optimization for specific industries

Common mistakes that survive years of practice

References

Tags

Frequently Asked Questions

Why PDFs are bigger than they need to be?

When to optimize, when not to?

Related Articles

How to Convert PDF to Editable Word Formats Efficiently

How to Convert Document Formats for Maximum Compatibility

How to Ensure Quality When Converting Your Documents

Ready to Convert Your Files?