A 50 MB PDF for a four-page letter is not unusual. A 200 MB scanned book is routine. A government-issued certificate weighing 80 MB because the logo was embedded as a 4000 by 4000 pixel uncompressed bitmap is, somehow, common. PDFs bloat for reasons unrelated to their actual content: scanners default to 600 DPI when 200 is plenty, Word embeds full font files when subsets would do, image-heavy reports duplicate the same logo on every page, and "Save as PDF" routines pick conservative settings to avoid blame for missing pixels.
Optimizing a PDF is rarely about exotic compression. It is about removing waste. This guide walks through the structural sources of PDF bloat, the open-source tools that address each one (qpdf, Ghostscript, pdftk, mutool from MuPDF), and the verification steps that distinguish "smaller and still useful" from "smaller and broken."
Why PDFs are bigger than they need to be
A PDF is a container of objects: pages, fonts, images, vector paths, form fields, scripts, metadata. Most bloat comes from four sources.
Embedded images at higher resolution than needed. A photo printed at 4 inches wide on a page does not need to be more than 600 pixels wide for screen display or 1200 pixels for print. Documents routinely embed 4000-pixel sources.
Whole-font embeds. A document using twelve characters of a font may embed the entire 800 KB font file. Modern PDF generators subset fonts (embedding only the glyphs used), but older tools and some "Save as PDF" routines do not.
Uncompressed object streams. PDF supports stream compression on text and metadata streams, but documents written by some tools leave them uncompressed for compatibility. This adds 5 to 30 percent to the file for no benefit.
Duplicated content. A logo placed on every page of a 200-page report can appear 200 times in the file unless the generator deduplicates the image object. Most do, but some do not.
"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away." Antoine de Saint-Exupery
PDF optimization is the engineering equivalent. Subtract everything that adds size without adding meaning.
The toolkit
| Tool | Strengths | Limitations |
|---|---|---|
| qpdf | Lossless structural optimization, object stream rewriting, linearization | Does not recompress images |
| Ghostscript | Full PDF re-render with image downsampling, font subsetting | Can break form fields, alters PDF version |
| pdftk | Page operations (split, merge, rotate, watermark) | Stalled development, original pdftk is Java-based |
| MuPDF mutool | Fast structural compression, page extraction | Less feature-rich than Ghostscript |
| cpdf | Commercial, very capable, scriptable | Paid for non-academic use |
| ocrmypdf | OCR plus optimization in one pipeline | Requires Tesseract |
Lossless structural optimization with qpdf
qpdf does not touch the document's visible content. It rewrites the PDF's underlying structure in a more compact form: object streams compressed, redundant indirect objects merged, content streams reflowed. Typical savings: 5 to 25 percent on documents that have not been previously optimized.
# Basic structural optimization
qpdf --linearize --object-streams=generate \
--compress-streams=y \
--recompress-flate \
--compression-level=9 \
input.pdf output.pdf
The --linearize flag rearranges the file so web browsers can start displaying page 1 before downloading the rest. This is the format Adobe calls "Fast Web View." For any PDF that will be served over HTTP, linearize.
The --object-streams=generate flag packs many small objects into compressed streams. The --recompress-flate flag re-runs deflate compression with maximum effort, which is slower but yields measurable savings.
For batches:
ls input/*.pdf | parallel -j 6 \
'qpdf --linearize --object-streams=generate \
--recompress-flate --compression-level=9 \
{} output/{/.}.opt.pdf'
Image downsampling with Ghostscript
The largest single optimization for typical documents is downsampling embedded images. Ghostscript's preset profiles handle this with one command:
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=output.pdf input.pdf
The PDFSETTINGS values map to image resolution targets:
| Preset | Color/grayscale DPI | Monochrome DPI | Use case | Typical reduction |
|---|---|---|---|---|
| /screen | 72 | 300 | Email, screen-only | 70-90% |
| /ebook | 150 | 300 | E-readers, web preview | 50-80% |
| /printer | 300 | 1200 | Office printing | 20-50% |
| /prepress | 300 | 1200 | Professional print | 10-30% |
| /default | Variable | Variable | Mixed | Variable |
For scanned documents, override defaults with explicit parameters:
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-dDownsampleColorImages=true \
-dColorImageResolution=200 \
-dDownsampleGrayImages=true \
-dGrayImageResolution=200 \
-dDownsampleMonoImages=true \
-dMonoImageResolution=300 \
-dColorImageDownsampleType=/Bicubic \
-dGrayImageDownsampleType=/Bicubic \
-dMonoImageDownsampleType=/Subsample \
-dCompressFonts=true \
-dSubsetFonts=true \
-dDetectDuplicateImages=true \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=scanned.opt.pdf scanned.pdf
The -dDetectDuplicateImages=true flag deduplicates repeated images, which is critical for documents with logos on every page.
Font subsetting
A font file can be hundreds of kilobytes. A PDF that uses 30 characters of a font ideally embeds only those 30 glyphs. The technique is called subsetting and Ghostscript does it automatically with -dSubsetFonts=true. Most modern PDF generators already subset, but documents from older tools (and some web-to-PDF converters) embed full fonts.
To check whether fonts are subset, use pdffonts:
pdffonts document.pdf
The output column "emb" shows whether the font is embedded, and "sub" shows whether it is subset. Embedded but not subset means there is room for size reduction.
# Force re-subset all fonts via Ghostscript
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-dEmbedAllFonts=true \
-dSubsetFonts=true \
-dCompressFonts=true \
-sOutputFile=resubset.pdf input.pdf
"There are only two hard things in Computer Science: cache invalidation and naming things." Phil Karlton
For PDFs, font handling is the third hard thing. Subsets, encoding tables, ligatures, and CID mapping interact in subtle ways, and optimizing fonts can break copy-paste or accessibility if done wrong. Always verify text extraction after font optimization.
A complete optimization pipeline
The pipeline below combines Ghostscript image downsampling with qpdf structural optimization, with verification at each step.
#!/usr/bin/env bash
set -euo pipefail
INPUT="$1"
OUTPUT="$2"
TMP=$(mktemp -d)
# Step 1: Probe source
SRC_SIZE=$(stat -c%s "$INPUT")
SRC_PAGES=$(pdfinfo "$INPUT" | awk '/^Pages:/ {print $2}')
SRC_TEXT=$(pdftotext "$INPUT" - | wc -c)
echo "Source: $INPUT ($SRC_SIZE bytes, $SRC_PAGES pages, $SRC_TEXT chars text)"
# Step 2: Ghostscript pass for image and font work
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-dPDFSETTINGS=/ebook \
-dDetectDuplicateImages=true \
-dCompressFonts=true \
-dSubsetFonts=true \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile="$TMP/gs.pdf" "$INPUT"
# Step 3: qpdf structural pass
qpdf --linearize --object-streams=generate \
--recompress-flate --compression-level=9 \
"$TMP/gs.pdf" "$OUTPUT"
# Step 4: Verify
OUT_SIZE=$(stat -c%s "$OUTPUT")
OUT_PAGES=$(pdfinfo "$OUTPUT" | awk '/^Pages:/ {print $2}')
OUT_TEXT=$(pdftotext "$OUTPUT" - | wc -c)
echo "Output: $OUTPUT ($OUT_SIZE bytes, $OUT_PAGES pages, $OUT_TEXT chars text)"
if [[ "$SRC_PAGES" != "$OUT_PAGES" ]]; then
echo "FAIL: page count changed"
exit 1
fi
TEXT_DELTA=$((SRC_TEXT - OUT_TEXT))
TEXT_DELTA=${TEXT_DELTA#-}
if (( TEXT_DELTA > SRC_TEXT / 100 )); then
echo "WARN: text content changed by more than 1%"
fi
REDUCTION=$(echo "scale=1; (1 - $OUT_SIZE / $SRC_SIZE) * 100" | bc -l)
echo "Reduction: $REDUCTION%"
rm -rf "$TMP"
The verification step catches the most common optimization failures: pages dropped silently, text rasterized into images, or characters lost in font subsetting.
When to optimize, when not to
Not every PDF should be optimized. The decision matrix:
| PDF type | Optimize? | Why |
|---|---|---|
| Office document for email | Yes, /ebook preset | Saves bandwidth, no quality concern |
| Marketing material for web | Yes, /ebook with linearize | Faster page load |
| Print-ready file for press | No, or /prepress only | Press requires high resolution |
| Legal contract or signed PDF | Carefully, with qpdf only | Signature blocks must not be invalidated |
| PDF/A archival document | No | Optimization breaks PDF/A compliance |
| Scientific paper with figures | Yes, /ebook preset | Figures rarely need print-resolution |
| Form with fillable fields | qpdf only | Ghostscript can flatten form fields |
| OCR-generated PDF | Yes via ocrmypdf --optimize 3 | Specifically designed for this case |
OCR-generated PDFs need different handling
A PDF produced by OCR contains both the scanned image and an invisible text layer aligned with it. Optimization that downsamples the image dramatically reduces size; optimization that drops the text layer makes the document unsearchable. ocrmypdf handles this correctly:
ocrmypdf --optimize 3 \
--jpeg-quality 75 \
--png-quality 75 \
--output-type pdfa \
scanned.pdf optimized.pdf
The --optimize 3 level applies pngquant and jbig2 compression to the scanned images while preserving the text layer. For typical scanned documents, this produces 70 to 90 percent size reduction.
PDF/A: when archival trumps size
PDF/A is the ISO standard for long-term archival. It mandates embedded fonts, bans JavaScript and external dependencies, and requires structural metadata. The result is a self-contained, format-stable PDF that should remain readable for decades.
PDF/A files are typically larger than equivalent regular PDFs because they cannot rely on system fonts or external resources. Do not try to optimize PDF/A archives below their compliant minimum size; you will break the standard.
# Convert a PDF to PDF/A-2b
gs -sDEVICE=pdfwrite \
-dPDFA=2 -dPDFACompatibilityPolicy=1 \
-sColorConversionStrategy=UseDeviceIndependentColor \
-sOutputFile=archive.pdf input.pdf
# Verify with veraPDF
verapdf --flavour 2b archive.pdf
For documents going into long-term archival contexts, including those used by legal-formation document storage at corpy.xyz or structured study materials at pass4-sure.us, PDF/A is the right format and aggressive size optimization is the wrong goal.
Batch processing pipelines
For high-volume environments, the optimization pipeline runs as a queue. The same architectural ideas apply as for image and audio batches: a manifest, parallel workers, verification at each step, and resumability.
#!/usr/bin/env bash
set -euo pipefail
INPUT_DIR="${1:-./incoming}"
OUTPUT_DIR="${2:-./optimized}"
PARALLEL="${3:-4}"
mkdir -p "$OUTPUT_DIR" ./logs
optimize_one() {
local src="$1"
local base
base=$(basename "$src" .pdf)
local out="$OUTPUT_DIR/$base.pdf"
local log="./logs/$base.log"
if [[ -f "$out" && "$out" -nt "$src" ]]; then
echo "skip $base"
return
fi
./optimize-pdf.sh "$src" "$out" > "$log" 2>&1 \
|| echo "FAIL: $base" >> ./logs/failures.log
}
export -f optimize_one
export OUTPUT_DIR
find "$INPUT_DIR" -name "*.pdf" \
| parallel -j "$PARALLEL" --joblog ./logs/batch.log optimize_one
Ghostscript is single-threaded per process, so PARALLEL roughly equals the CPU core count. On a typical server, 100-page PDFs optimize at roughly 5 to 30 seconds each, so a daily batch of 1,000 documents finishes in well under an hour with eight cores.
"Premature optimization is the root of all evil." Donald Knuth
For PDFs the inverse is also true: postponed optimization is the root of bandwidth bills. Optimize as part of the producer pipeline, not as an afterthought when the file is already in distribution.
Accessibility considerations
A PDF that has been aggressively optimized may have lost its accessibility tree. Tagged PDFs include a logical structure that screen readers use to navigate: paragraph boundaries, heading levels, reading order, alt text on images. Ghostscript with default settings can strip this tree because the optimizer does not understand it.
To preserve accessibility, use -dPreserveAnnots=true -dPreserveMarkedContent=true:
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-dPDFSETTINGS=/ebook \
-dPreserveAnnots=true \
-dPreserveMarkedContent=true \
-sOutputFile=accessible.pdf input.pdf
Verify with PAC (PDF Accessibility Checker) or veraPDF. A PDF that passes WCAG 2.1 before optimization should still pass after; if not, the optimization stripped something it should not have.
For documents subject to accessibility regulations (Section 508, EN 301 549, EAA), accessibility takes priority over file size. Optimize within bounds that preserve the tagged structure.
Image format choices inside the PDF
PDF can embed images as JPEG (DCTDecode), JPEG 2000 (JPXDecode), CCITT Group 4 fax (for monochrome), JBIG2 (for compressed monochrome), or Flate-compressed raw bitmaps. The right format depends on content type.
| Image type | Best embedding format | Typical compression |
|---|---|---|
| Photographs, color | JPEG quality 75-85 | 20:1 |
| Screenshots with sharp edges | Flate (lossless) or PNG | 5:1 |
| Scanned text in monochrome | JBIG2 lossless | 50:1 |
| Scanned text in grayscale | JPEG 2000 lossless | 10:1 |
| Diagrams with few colors | Flate with palette | 30:1 |
| Vector content | Should be vector, not image | Indefinite |
Verification of accessibility, integrity, and structure
Every optimized PDF should pass three verification checks before distribution.
# Page count, font subsetting, image inventory
pdfinfo optimized.pdf
pdffonts optimized.pdf
pdfimages -list optimized.pdf
# Text extraction comparison
pdftotext source.pdf source.txt
pdftotext optimized.pdf optimized.txt
diff -q source.txt optimized.txt
# Accessibility (if regulated)
verapdf --flavour ua1 optimized.pdf
# Linearization (for web delivery)
qpdf --check optimized.pdf
The four-step verification takes seconds per document and catches the failure modes that silent optimization can introduce.
Cross-platform consistency
PDF rendering is supposed to be platform-independent. In practice, optimizing a PDF on one platform and rendering it on another can produce subtle differences. Three causes show up most often.
Font substitution: a PDF that references a system font (rather than embedding it) renders with whatever font the viewer has. Optimization that drops embedded fonts because they are present on the build machine produces a file that looks correct locally and wrong everywhere else. Always verify with pdffonts that all used fonts are embedded.
Color space drift: optimization that strips ICC profiles makes the renderer guess. The guess is usually sRGB, which is wrong for documents prepared in print color spaces. Preserve ICC profiles through the optimization unless the destination is known to be sRGB.
Form field flattening: Ghostscript can convert interactive form fields to rendered text, which makes the form smaller but uneditable. Use qpdf for form-bearing documents.
A QC pipeline that opens optimized PDFs in three different viewers (Adobe Reader, browser PDF.js, Apple Preview) and visually compares against the source catches most of these silent regressions.
Optimization for specific industries
Different industries have different optimization sweet spots. Legal documents need exact preservation; marketing materials prioritize file size; archival prefers PDF/A even at larger size; e-learning prioritizes accessibility.
| Industry | Priority | Avoid |
|---|---|---|
| Legal contracts | Preserve signatures and exact appearance | Ghostscript on signed files |
| Marketing brochures | Aggressive size reduction | Loss of brand color accuracy |
| Scientific papers | Search and accessibility | Stripping bookmarks or alt text |
| Medical records | HIPAA compliance, no metadata leakage | Leaving DICOM tags in cover sheets |
| Government archival | PDF/A compliance, fonts embedded | Excluding required metadata |
| E-commerce catalogs | Web delivery speed | Image quality below 150 DPI for product shots |
Common mistakes that survive years of practice
Three errors recur. First, optimizing signed PDFs with Ghostscript breaks the signature; always use qpdf for those. Second, applying /screen preset to documents intended for print produces visibly blurry output; match the preset to the destination. Third, skipping the verification step after optimization lets silent failures (page count changes, text loss, accessibility regressions) reach distribution.
A pipeline that respects these three rules ships PDFs that are smaller, faster to download, and indistinguishable from the source at typical viewing.
References
- ISO 32000-2:2020, "Document management - Portable document format - Part 2: PDF 2.0." International Organization for Standardization.
- ISO 19005-1:2005, "Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF 1.4 (PDF/A-1)." International Organization for Standardization.
- ISO 19005-2:2011, "Document management - Electronic document file format for long-term preservation - Part 2: Use of ISO 32000-1 (PDF/A-2)."
- Adobe Systems, "PDF Reference, sixth edition, version 1.7." Adobe Systems Inc., 2006.
- Berkenbilt, J., "qpdf: A Content-Preserving PDF Transformation System." Available: https://qpdf.sourceforge.io/
- Artifex Software, "Ghostscript Documentation." Available: https://www.ghostscript.com/doc/current/Use.htm
- Deutsch, P., "RFC 1951: DEFLATE Compressed Data Format Specification version 1.3." Internet Engineering Task Force, 1996. doi:10.17487/RFC1951
- ISO 32000-1:2008, "Document management - Portable document format - Part 1: PDF 1.7."
Frequently Asked Questions
Why PDFs are bigger than they need to be?
A PDF is a container of objects: pages, fonts, images, vector paths, form fields, scripts, metadata. Most bloat comes from four sources.
When to optimize, when not to?
Not every PDF should be optimized. The decision matrix:
Ready to Convert Your Files?
Use our free online file converter supporting 240+ formats. No signup required, fast processing, and secure handling of your files.
Convert Files


