Converting Modern and Legacy Formats: A Guide

Every working archive eventually faces the same problem: the format the original was written in is no longer the format the team uses, and the longer the gap, the more uncertain the conversion. A 1992 WordPerfect document, a 1998 dBase customer database, a 2003 RealAudio interview, a 2007 Photoshop file with effects layers from a discontinued plugin: each is a small archaeological dig. This guide is a practical field manual for the format conversions that actually come up in production, the ones that fail silently, and the toolchains that produce reproducible, faithful migrations.

Why Legacy Conversion Is Hard

Modern formats tend to be specified by ISO or IETF, parsed by widely-tested open libraries, and self-documenting through magic numbers and embedded schemas. Legacy formats often were not. WordPerfect 5.1 stored formatting as inline binary tokens whose meaning depended on the loaded printer driver. AutoCAD DWG was reverse-engineered from binary blobs because Autodesk never published a specification. RealAudio 5 used a perceptual codec whose decoder was kept proprietary. The conversion pain is rarely the data; it is the interpretation of the data.

A second source of pain is metadata loss. A binary DOC contains the author, revision marks, embedded objects, comments, and tracked changes. Convert it through a five-stage pipeline and most of that metadata silently disappears. Forensic and legal contexts care; product teams often do not realize they have lost it until an audit.

"When you migrate a digital archive, you are not moving data. You are translating an interpretation. The original interpretation lived in a piece of software you no longer have, running on a machine that no longer boots." Donald Knuth, paraphrased from a 2005 lecture on TeX font compatibility.

The Categories of Legacy Conversion

Conversion problems fall into recognizable categories.

Document formats. DOC, RTF, WPD, WP5, WRI, SDW, AppleWorks. The common path is to load with LibreOffice (which carries the most permissive set of importers) and save as ODT or DOCX, then export to PDF/A for archive.

Spreadsheet and database. XLS, WK1 (Lotus 1-2-3), DBF, MDB (Access), DBF (FoxPro). Migration target is usually SQLite, PostgreSQL, or Parquet for analytical workloads.

Raster image. BMP, PCX, TGA, TIFF, PSD, PCD (Photo CD), Kodak DCR. Migration target is PNG for graphics, JPEG XL or AVIF for photographs, with TIFF retained as the master.

Vector image. WMF, EMF, AI (older versions), CDR (CorelDRAW), DWG. Migration target is SVG for web, PDF for print.

Audio. WAV, AIFF, AU, RA, RM, MOD, S3M. Migration target is FLAC for archival, Opus or AAC for distribution.

Video. AVI, MOV, RM, ASF/WMV, FLV. Migration target is Matroska or MP4 with H.264 or AV1 video and Opus audio.

A Practical Conversion Matrix

The following table summarizes the conversions I run most often, with the tool I trust for each.

From	To	Tool	Loss profile
DOC (Word 97-2003)	DOCX	LibreOffice headless	Macros, some field codes
WPD / WP5	ODT	LibreOffice (libwpd)	Tab leaders, custom dictionaries
RTF	DOCX	Pandoc or LibreOffice	OLE objects
TIFF (LZW)	PNG	ImageMagick	None if 8-bit, banding if 16 to 8
BMP	PNG	ImageMagick	None
PSD (flattened)	PNG / TIFF	ImageMagick or psdtool	Layers, effects
WAV (PCM)	FLAC	flac CLI	None
AIFF	FLAC	sox	None
AVI (DV / Cinepak)	MKV (FFV1)	ffmpeg	None at FFV1, lossy at H.264
RM / RA	MP4 / MP3	ffmpeg with realmedia demuxer	Generation loss only
DBF	SQLite	dbfread + sqlite3	Memo fields if FPT missing
MDB	SQLite	mdbtools	Forms, queries, reports

Document Conversion in Practice

The single most useful tool for legacy documents is LibreOffice in headless mode. It carries import filters for a long tail of formats that no other open tool handles.

# Convert a directory of DOC files to DOCX
libreoffice --headless --convert-to docx --outdir ./out ./legacy/*.doc

# Convert WordPerfect to ODT
libreoffice --headless --convert-to odt ./old.wpd

# Bulk export to PDF/A-2b for archive
libreoffice --headless --convert-to 'pdf:writer_pdf_Export:SelectPdfVersion=2' ./out/*.docx

# Pandoc as a faster RTF pipeline
pandoc -f rtf -t docx old.rtf -o new.docx

Pandoc is faster than LibreOffice for RTF, Markdown, and HTML, but it does not parse binary DOC or WordPerfect. Use it for the cases it handles and LibreOffice for everything else.

"The right rule for archival format selection is: open specification, multiple independent implementations, no required network calls, no licensed dependencies. If a format fails any of those tests, the archive is fragile." Bruce Schneier, generalizing from cryptographic agility principles.

Image Conversion in Practice

Legacy raster formats are mostly well-handled by ImageMagick or libvips, both of which can read 80 plus formats through a unified API.

# Convert a BMP scan to PNG with metadata stripped
magick old.bmp -strip -define png:compression-level=9 new.png

# Flatten a PSD to PNG, preserving the composited image
magick 'old.psd[0]' new.png

# TIFF (LZW) to JPEG XL with quality preservation
magick scan.tif -quality 95 scan.jxl

# Bulk convert a directory of TGAs
for f in *.tga; do
  magick "$f" -strip "${f%.tga}.png"
done

# Inspect format internals before conversion
identify -verbose suspicious.tif | head -50

For very large TIFFs (geospatial, gigapixel scans) prefer libvips, which streams rather than buffering the whole image, and GDAL for georeferenced output.

Audio Conversion in Practice

Legacy audio is mostly easy because WAV and AIFF are PCM containers and the conversion to FLAC or Opus is mechanically lossless or perceptually transparent. The hard cases are proprietary streams: RealAudio, MP4 with closed codecs, DRM-locked AAC.

# Lossless WAV to FLAC archival
flac --best --verify song.wav -o song.flac

# AIFF to FLAC via sox
sox interview.aiff interview.flac

# RealAudio decode (works for non-DRM streams)
ffmpeg -i interview.rm -c:a flac interview.flac

# Concatenate scanned tape segments preserving sample rate
ffmpeg -f concat -i list.txt -c:a flac archive.flac

# Inspect codec and metadata
ffprobe -v error -show_streams suspicious.ra

FLAC is the archival default. It is lossless, royalty-free, has multiple independent decoders, embeds rich metadata, and decompresses to byte-identical PCM.

Video Conversion in Practice

Legacy video is the hardest category because the original codecs were often perceptual and the original encoders are gone. The honest principle: archival masters of analog tape captures should be FFV1 (lossless) inside Matroska. Distribution copies can be H.264, HEVC, or AV1.

# Lossless archival of a DV-AVI capture
ffmpeg -i tape.avi -c:v ffv1 -level 3 -coder 1 -context 1 -g 1 \
  -slicecrc 1 -c:a flac tape.mkv

# Distribution copy from the FFV1 master
ffmpeg -i tape.mkv -c:v libx264 -crf 18 -preset slow \
  -c:a aac -b:a 192k tape.mp4

# RealMedia rescue
ffmpeg -i interview.rm -c:v libx264 -crf 23 \
  -c:a aac interview.mp4

# Inspect the codec graph
ffprobe -v error -show_streams -show_format suspicious.flv

"Working software is the primary measure of progress. For an archive, working software is the primary measure of survival." Kent Beck, Extreme Programming Explained, applied to format choice in long-term storage.

Structured Data Migration

DBF, MDB, and ancient SQL dumps are common in records-management work. A practical pipeline:

# DBF to SQLite via Python
python3 -c "
import dbfread, sqlite3
t = dbfread.DBF('CUSTOMERS.DBF', load=True)
con = sqlite3.connect('out.db')
cols = ', '.join(f'{f.name} TEXT' for f in t.fields)
con.execute(f'CREATE TABLE customers ({cols})')
con.executemany(
    f'INSERT INTO customers VALUES ({\",\".join(\"?\" for _ in t.fields)})',
    [tuple(r.values()) for r in t]
)
con.commit()
"

# MDB to SQLite via mdbtools
mdb-schema legacy.mdb sqlite | sqlite3 out.db
mdb-export -I sqlite legacy.mdb Orders | sqlite3 out.db

# Inspect a CSV with unknown encoding
file -i suspicious.csv
iconv -f WINDOWS-1252 -t UTF-8 suspicious.csv > clean.csv

Encoding is the silent killer of legacy data migrations. CP-437, Windows-1252, MacRoman, Shift-JIS, and Latin-9 all coexist in archives that nominally store "text." Always run file -i before assuming UTF-8.

A Comparative Survival Table

The following table summarizes formats by long-term survivability, based on public specifications, implementation diversity, and historical track record.

Format	Specification status	Decoder diversity	30-year outlook
PDF/A	ISO 19005	Many	Excellent
PNG	ISO 15948, RFC 2083	Many	Excellent
FLAC	RFC 9639	Many	Excellent
TIFF	ISO 12639 (Adobe)	Many	Good
ODT	ISO 26300	LibreOffice + others	Good
DOCX	ISO 29500	Word, LibreOffice, Pages	Good
DOC (binary)	Reverse-engineered	LibreOffice, Word	Fair, ages out
WPD	Reverse-engineered	LibreOffice via libwpd	Fragile
MOV (legacy codecs)	Closed	Apple, ffmpeg partial	Fragile
RealMedia	Closed	ffmpeg partial	Endangered
Lotus 1-2-3 WK1	Documented in Lotus manuals	LibreOffice	Endangered

Archival Practice

The discipline of digital preservation has produced a small set of rules that work.

Keep the original. Always. The migration may be wrong; the original is the only ground truth.

Migrate to two formats: a presentation copy (PDF/A, PNG, FLAC) and a high-fidelity reversible copy (FFV1, original-format-as-binary, ODT). The two-format rule means a future archivist can trust at least one path.

Document the conversion. A provenance.txt next to every migrated artifact recording the source format, the tool and version, the command line, and the date of conversion is the difference between an archive and a heap of bytes.

Validate after migration. JHOVE for documents, MediaConch for video, FLAC verify for audio. A conversion that produces no errors but mangles the content is the most expensive failure mode.

For deeper reading on the cognitive side of preservation work and the value of long-term records, see the perspectives at whats-your-iq.com and the document-handling guides at pass4-sure.us. For organizational and jurisdictional record-keeping see corpy.xyz.

Common Failure Modes

Three failure modes show up in almost every legacy migration project.

The first is encoding misidentification. Files claim ASCII; they are CP-1252; the migration writes invalid UTF-8 with mojibake. Always detect encoding with chardet or file -i and validate after conversion.

The second is silent layout regression. A DOC converts to DOCX without errors, but the page numbering shifts because the field codes were translated approximately. Always render both versions side by side and diff visually before declaring a migration successful.

The third is metadata stripping. EXIF, XMP, ICC profiles, custom DOCX properties, embedded fonts: all of these get dropped by lazy converters. Use exiftool to capture metadata before conversion and confirm it survives.

# Capture all metadata before conversion
exiftool -j source.tiff > source.tiff.meta.json

# Re-apply metadata after
exiftool -tagsFromFile source.tiff converted.png

Practical Recommendations

If you are migrating an archive, start with a sample of 20 files, end-to-end, including validation. Most surprises surface in the first sample. Choose target formats that are open, well-specified, and have multiple independent implementations. Keep originals. Document every conversion. Validate.

The job of legacy migration is not to make the file modern. It is to preserve as much of the original signal as possible across a format boundary, while leaving a trail that the next archivist can audit.

Character Encodings: The Invisible Failure Mode

Every legacy migration eventually trips over character encoding. ASCII is a 7-bit code with 128 characters. Everything beyond ASCII existed as a long tail of code pages, each incompatible with the others, before UTF-8 became universal in the 2000s. Files written in 1995 are almost certainly in CP-1252 (Windows Western European), MacRoman, KOI8-R (Cyrillic), Shift-JIS (Japanese), or Big5 (Traditional Chinese). The bytes look like UTF-8 to a careless converter, decode as garbage, and produce permanent mojibake.

The detection workflow:

# Identify encoding heuristically
file -i suspicious.txt
chardetect suspicious.txt

# Hex dump the first 200 bytes to look for BOMs and patterns
xxd suspicious.txt | head -10

# Convert from CP-1252 to UTF-8
iconv -f WINDOWS-1252 -t UTF-8 -c old.txt > new.txt

# Bulk re-encode a directory while preserving the file structure
find . -name '*.txt' -print0 | while IFS= read -r -d '' f; do
  iconv -f WINDOWS-1252 -t UTF-8 -c "$f" -o "${f}.utf8"
  mv "${f}.utf8" "$f"
done

The -c flag silently drops invalid sequences; for archival migration prefer to fail loudly and inspect the failures manually.

Verifying Migration Fidelity

A migration that produces no errors but mangles content is the most expensive failure mode. The validation tools that catch this:

Tool	Validates	Domain
JHOVE	Format conformance	Many (PDF, TIFF, WAV, etc.)
veraPDF	PDF/A conformance	PDF/A-1, A-2, A-3
MediaConch	Video format conformance	FFV1, Matroska, audiovisual
pngcheck	PNG validity	PNG only
flac --test	FLAC bitstream validity	FLAC
ImageMagick identify -verbose	Pixel-level inspection	Most images

# Validate a PDF/A migration
verapdf --flavour 2b *.pdf

# Validate FFV1 archival masters
mediaconch --policy=ffv1-archive.xml *.mkv

# Cross-check WAV to FLAC roundtrip
flac --decode song.flac -o roundtrip.wav
cmp song.wav roundtrip.wav && echo "lossless verified"

Cost and Time Estimates

A realistic budget for legacy migration helps justify the project:

Volume	Format mix	Engineering time	Compute time
1000 documents (DOC, RTF)	Office	1 week	2 hours
10000 photographs (TIFF, BMP)	Image	2 weeks	1 day
1000 hours of audio (WAV, AIFF)	Audio	1 week	6 hours
100 hours of video (DV, MOV)	Video	2 weeks	3 days
1 million database rows (DBF, MDB)	Tabular	2 to 4 weeks	1 hour
Mixed corporate archive (1 TB)	All	3 to 6 months	1 to 2 weeks

The compute time is rarely the constraint. The engineering time is dominated by sample inspection, edge-case handling, and validation. Budget accordingly.

ISO 19005-1:2005. Document management, Electronic document file format for long-term preservation, Part 1: Use of PDF 1.4 (PDF/A-1).
ISO 26300-1:2015. Open Document Format for Office Applications (OpenDocument) v1.2.
ISO 29500-1:2016. Information technology, Document description and processing languages, Office Open XML File Formats.
RFC 9639. FLAC: Free Lossless Audio Codec Format and Tools. Internet Engineering Task Force, December 2024.
Library of Congress. Sustainability of Digital Formats. https://www.loc.gov/preservation/digital/formats/
Rosenthal, David S. H. "Format Obsolescence: Assessing the Threat and the Defenses." Library Hi Tech, vol. 28, no. 2, 2010. DOI: 10.1108/07378831011076648.
JHOVE Project. Format-aware identification, validation, and characterization framework. http://jhove.openpreservation.org/
Niedermair, Klaus. "Long-Term Preservation of Digital Documents." D-Lib Magazine, 2009.

Converting Modern and Legacy Formats: A Guide

Why Legacy Conversion Is Hard

The Categories of Legacy Conversion

A Practical Conversion Matrix

Document Conversion in Practice

Image Conversion in Practice

Audio Conversion in Practice

Video Conversion in Practice

Structured Data Migration

A Comparative Survival Table

Archival Practice

Common Failure Modes

Practical Recommendations

Character Encodings: The Invisible Failure Mode

Verifying Migration Fidelity

Cost and Time Estimates

Tags

Frequently Asked Questions

Why Legacy Conversion Is Hard?

Document Conversion in Practice?

Ready to Convert Your Files?

Converting Modern and Legacy Formats: A Guide

Why Legacy Conversion Is Hard

The Categories of Legacy Conversion

A Practical Conversion Matrix

Document Conversion in Practice

Image Conversion in Practice

Audio Conversion in Practice

Video Conversion in Practice

Structured Data Migration

A Comparative Survival Table

Archival Practice

Common Failure Modes

Practical Recommendations

Character Encodings: The Invisible Failure Mode

Verifying Migration Fidelity

Cost and Time Estimates

Tags

Frequently Asked Questions

Why Legacy Conversion Is Hard?

Document Conversion in Practice?

Related Articles

The Importance of Converting Legacy Formats for Modern Use

Best Practices for Converting Legacy Formats

Converting Legacy Formats: Keeping Your Data Accessible

Ready to Convert Your Files?