Every working archive eventually faces the same problem: the format the original was written in is no longer the format the team uses, and the longer the gap, the more uncertain the conversion. A 1992 WordPerfect document, a 1998 dBase customer database, a 2003 RealAudio interview, a 2007 Photoshop file with effects layers from a discontinued plugin: each is a small archaeological dig. This guide is a practical field manual for the format conversions that actually come up in production, the ones that fail silently, and the toolchains that produce reproducible, faithful migrations.

Why Legacy Conversion Is Hard

Modern formats tend to be specified by ISO or IETF, parsed by widely-tested open libraries, and self-documenting through magic numbers and embedded schemas. Legacy formats often were not. WordPerfect 5.1 stored formatting as inline binary tokens whose meaning depended on the loaded printer driver. AutoCAD DWG was reverse-engineered from binary blobs because Autodesk never published a specification. RealAudio 5 used a perceptual codec whose decoder was kept proprietary. The conversion pain is rarely the data; it is the interpretation of the data.

A second source of pain is metadata loss. A binary DOC contains the author, revision marks, embedded objects, comments, and tracked changes. Convert it through a five-stage pipeline and most of that metadata silently disappears. Forensic and legal contexts care; product teams often do not realize they have lost it until an audit.

"When you migrate a digital archive, you are not moving data. You are translating an interpretation. The original interpretation lived in a piece of software you no longer have, running on a machine that no longer boots." Donald Knuth, paraphrased from a 2005 lecture on TeX font compatibility.

The Categories of Legacy Conversion

Conversion problems fall into recognizable categories.

Document formats. DOC, RTF, WPD, WP5, WRI, SDW, AppleWorks. The common path is to load with LibreOffice (which carries the most permissive set of importers) and save as ODT or DOCX, then export to PDF/A for archive.

Spreadsheet and database. XLS, WK1 (Lotus 1-2-3), DBF, MDB (Access), DBF (FoxPro). Migration target is usually SQLite, PostgreSQL, or Parquet for analytical workloads.

Raster image. BMP, PCX, TGA, TIFF, PSD, PCD (Photo CD), Kodak DCR. Migration target is PNG for graphics, JPEG XL or AVIF for photographs, with TIFF retained as the master.

Vector image. WMF, EMF, AI (older versions), CDR (CorelDRAW), DWG. Migration target is SVG for web, PDF for print.

Audio. WAV, AIFF, AU, RA, RM, MOD, S3M. Migration target is FLAC for archival, Opus or AAC for distribution.

Video. AVI, MOV, RM, ASF/WMV, FLV. Migration target is Matroska or MP4 with H.264 or AV1 video and Opus audio.

A Practical Conversion Matrix

The following table summarizes the conversions I run most often, with the tool I trust for each.

FromToToolLoss profile
DOC (Word 97-2003)DOCXLibreOffice headlessMacros, some field codes
WPD / WP5ODTLibreOffice (libwpd)Tab leaders, custom dictionaries
RTFDOCXPandoc or LibreOfficeOLE objects
TIFF (LZW)PNGImageMagickNone if 8-bit, banding if 16 to 8
BMPPNGImageMagickNone
PSD (flattened)PNG / TIFFImageMagick or psdtoolLayers, effects
WAV (PCM)FLACflac CLINone
AIFFFLACsoxNone
AVI (DV / Cinepak)MKV (FFV1)ffmpegNone at FFV1, lossy at H.264
RM / RAMP4 / MP3ffmpeg with realmedia demuxerGeneration loss only
DBFSQLitedbfread + sqlite3Memo fields if FPT missing
MDBSQLitemdbtoolsForms, queries, reports

Document Conversion in Practice

The single most useful tool for legacy documents is LibreOffice in headless mode. It carries import filters for a long tail of formats that no other open tool handles.

# Convert a directory of DOC files to DOCX
libreoffice --headless --convert-to docx --outdir ./out ./legacy/*.doc

# Convert WordPerfect to ODT
libreoffice --headless --convert-to odt ./old.wpd

# Bulk export to PDF/A-2b for archive
libreoffice --headless --convert-to 'pdf:writer_pdf_Export:SelectPdfVersion=2' ./out/*.docx

# Pandoc as a faster RTF pipeline
pandoc -f rtf -t docx old.rtf -o new.docx

Pandoc is faster than LibreOffice for RTF, Markdown, and HTML, but it does not parse binary DOC or WordPerfect. Use it for the cases it handles and LibreOffice for everything else.

"The right rule for archival format selection is: open specification, multiple independent implementations, no required network calls, no licensed dependencies. If a format fails any of those tests, the archive is fragile." Bruce Schneier, generalizing from cryptographic agility principles.

Image Conversion in Practice

Legacy raster formats are mostly well-handled by ImageMagick or libvips, both of which can read 80 plus formats through a unified API.

# Convert a BMP scan to PNG with metadata stripped
magick old.bmp -strip -define png:compression-level=9 new.png

# Flatten a PSD to PNG, preserving the composited image
magick 'old.psd[0]' new.png

# TIFF (LZW) to JPEG XL with quality preservation
magick scan.tif -quality 95 scan.jxl

# Bulk convert a directory of TGAs
for f in *.tga; do
  magick "$f" -strip "${f%.tga}.png"
done

# Inspect format internals before conversion
identify -verbose suspicious.tif | head -50

For very large TIFFs (geospatial, gigapixel scans) prefer libvips, which streams rather than buffering the whole image, and GDAL for georeferenced output.

Audio Conversion in Practice

Legacy audio is mostly easy because WAV and AIFF are PCM containers and the conversion to FLAC or Opus is mechanically lossless or perceptually transparent. The hard cases are proprietary streams: RealAudio, MP4 with closed codecs, DRM-locked AAC.

# Lossless WAV to FLAC archival
flac --best --verify song.wav -o song.flac

# AIFF to FLAC via sox
sox interview.aiff interview.flac

# RealAudio decode (works for non-DRM streams)
ffmpeg -i interview.rm -c:a flac interview.flac

# Concatenate scanned tape segments preserving sample rate
ffmpeg -f concat -i list.txt -c:a flac archive.flac

# Inspect codec and metadata
ffprobe -v error -show_streams suspicious.ra

FLAC is the archival default. It is lossless, royalty-free, has multiple independent decoders, embeds rich metadata, and decompresses to byte-identical PCM.

Video Conversion in Practice

Legacy video is the hardest category because the original codecs were often perceptual and the original encoders are gone. The honest principle: archival masters of analog tape captures should be FFV1 (lossless) inside Matroska. Distribution copies can be H.264, HEVC, or AV1.

# Lossless archival of a DV-AVI capture
ffmpeg -i tape.avi -c:v ffv1 -level 3 -coder 1 -context 1 -g 1 \
  -slicecrc 1 -c:a flac tape.mkv

# Distribution copy from the FFV1 master
ffmpeg -i tape.mkv -c:v libx264 -crf 18 -preset slow \
  -c:a aac -b:a 192k tape.mp4

# RealMedia rescue
ffmpeg -i interview.rm -c:v libx264 -crf 23 \
  -c:a aac interview.mp4

# Inspect the codec graph
ffprobe -v error -show_streams -show_format suspicious.flv
"Working software is the primary measure of progress. For an archive, working software is the primary measure of survival." Kent Beck, Extreme Programming Explained, applied to format choice in long-term storage.

Structured Data Migration

DBF, MDB, and ancient SQL dumps are common in records-management work. A practical pipeline:

# DBF to SQLite via Python
python3 -c "
import dbfread, sqlite3
t = dbfread.DBF('CUSTOMERS.DBF', load=True)
con = sqlite3.connect('out.db')
cols = ', '.join(f'{f.name} TEXT' for f in t.fields)
con.execute(f'CREATE TABLE customers ({cols})')
con.executemany(
    f'INSERT INTO customers VALUES ({\",\".join(\"?\" for _ in t.fields)})',
    [tuple(r.values()) for r in t]
)
con.commit()
"

# MDB to SQLite via mdbtools
mdb-schema legacy.mdb sqlite | sqlite3 out.db
mdb-export -I sqlite legacy.mdb Orders | sqlite3 out.db

# Inspect a CSV with unknown encoding
file -i suspicious.csv
iconv -f WINDOWS-1252 -t UTF-8 suspicious.csv > clean.csv

Encoding is the silent killer of legacy data migrations. CP-437, Windows-1252, MacRoman, Shift-JIS, and Latin-9 all coexist in archives that nominally store "text." Always run file -i before assuming UTF-8.

A Comparative Survival Table

The following table summarizes formats by long-term survivability, based on public specifications, implementation diversity, and historical track record.

FormatSpecification statusDecoder diversity30-year outlook
PDF/AISO 19005ManyExcellent
PNGISO 15948, RFC 2083ManyExcellent
FLACRFC 9639ManyExcellent
TIFFISO 12639 (Adobe)ManyGood
ODTISO 26300LibreOffice + othersGood
DOCXISO 29500Word, LibreOffice, PagesGood
DOC (binary)Reverse-engineeredLibreOffice, WordFair, ages out
WPDReverse-engineeredLibreOffice via libwpdFragile
MOV (legacy codecs)ClosedApple, ffmpeg partialFragile
RealMediaClosedffmpeg partialEndangered
Lotus 1-2-3 WK1Documented in Lotus manualsLibreOfficeEndangered

Archival Practice

The discipline of digital preservation has produced a small set of rules that work.

Keep the original. Always. The migration may be wrong; the original is the only ground truth.

Migrate to two formats: a presentation copy (PDF/A, PNG, FLAC) and a high-fidelity reversible copy (FFV1, original-format-as-binary, ODT). The two-format rule means a future archivist can trust at least one path.

Document the conversion. A provenance.txt next to every migrated artifact recording the source format, the tool and version, the command line, and the date of conversion is the difference between an archive and a heap of bytes.

Validate after migration. JHOVE for documents, MediaConch for video, FLAC verify for audio. A conversion that produces no errors but mangles the content is the most expensive failure mode.

For deeper reading on the cognitive side of preservation work and the value of long-term records, see the perspectives at whats-your-iq.com and the document-handling guides at pass4-sure.us. For organizational and jurisdictional record-keeping see corpy.xyz.

Common Failure Modes

Three failure modes show up in almost every legacy migration project.

The first is encoding misidentification. Files claim ASCII; they are CP-1252; the migration writes invalid UTF-8 with mojibake. Always detect encoding with chardet or file -i and validate after conversion.

The second is silent layout regression. A DOC converts to DOCX without errors, but the page numbering shifts because the field codes were translated approximately. Always render both versions side by side and diff visually before declaring a migration successful.

The third is metadata stripping. EXIF, XMP, ICC profiles, custom DOCX properties, embedded fonts: all of these get dropped by lazy converters. Use exiftool to capture metadata before conversion and confirm it survives.

# Capture all metadata before conversion
exiftool -j source.tiff > source.tiff.meta.json

# Re-apply metadata after
exiftool -tagsFromFile source.tiff converted.png

Practical Recommendations

If you are migrating an archive, start with a sample of 20 files, end-to-end, including validation. Most surprises surface in the first sample. Choose target formats that are open, well-specified, and have multiple independent implementations. Keep originals. Document every conversion. Validate.

The job of legacy migration is not to make the file modern. It is to preserve as much of the original signal as possible across a format boundary, while leaving a trail that the next archivist can audit.

Character Encodings: The Invisible Failure Mode

Every legacy migration eventually trips over character encoding. ASCII is a 7-bit code with 128 characters. Everything beyond ASCII existed as a long tail of code pages, each incompatible with the others, before UTF-8 became universal in the 2000s. Files written in 1995 are almost certainly in CP-1252 (Windows Western European), MacRoman, KOI8-R (Cyrillic), Shift-JIS (Japanese), or Big5 (Traditional Chinese). The bytes look like UTF-8 to a careless converter, decode as garbage, and produce permanent mojibake.

The detection workflow:

# Identify encoding heuristically
file -i suspicious.txt
chardetect suspicious.txt

# Hex dump the first 200 bytes to look for BOMs and patterns
xxd suspicious.txt | head -10

# Convert from CP-1252 to UTF-8
iconv -f WINDOWS-1252 -t UTF-8 -c old.txt > new.txt

# Bulk re-encode a directory while preserving the file structure
find . -name '*.txt' -print0 | while IFS= read -r -d '' f; do
  iconv -f WINDOWS-1252 -t UTF-8 -c "$f" -o "${f}.utf8"
  mv "${f}.utf8" "$f"
done

The -c flag silently drops invalid sequences; for archival migration prefer to fail loudly and inspect the failures manually.

Verifying Migration Fidelity

A migration that produces no errors but mangles content is the most expensive failure mode. The validation tools that catch this:

ToolValidatesDomain
JHOVEFormat conformanceMany (PDF, TIFF, WAV, etc.)
veraPDFPDF/A conformancePDF/A-1, A-2, A-3
MediaConchVideo format conformanceFFV1, Matroska, audiovisual
pngcheckPNG validityPNG only
flac --testFLAC bitstream validityFLAC
ImageMagick identify -verbosePixel-level inspectionMost images
# Validate a PDF/A migration
verapdf --flavour 2b *.pdf

# Validate FFV1 archival masters
mediaconch --policy=ffv1-archive.xml *.mkv

# Cross-check WAV to FLAC roundtrip
flac --decode song.flac -o roundtrip.wav
cmp song.wav roundtrip.wav && echo "lossless verified"

Cost and Time Estimates

A realistic budget for legacy migration helps justify the project:

VolumeFormat mixEngineering timeCompute time
1000 documents (DOC, RTF)Office1 week2 hours
10000 photographs (TIFF, BMP)Image2 weeks1 day
1000 hours of audio (WAV, AIFF)Audio1 week6 hours
100 hours of video (DV, MOV)Video2 weeks3 days
1 million database rows (DBF, MDB)Tabular2 to 4 weeks1 hour
Mixed corporate archive (1 TB)All3 to 6 months1 to 2 weeks
The compute time is rarely the constraint. The engineering time is dominated by sample inspection, edge-case handling, and validation. Budget accordingly.
  1. ISO 19005-1:2005. Document management, Electronic document file format for long-term preservation, Part 1: Use of PDF 1.4 (PDF/A-1).
  2. ISO 26300-1:2015. Open Document Format for Office Applications (OpenDocument) v1.2.
  3. ISO 29500-1:2016. Information technology, Document description and processing languages, Office Open XML File Formats.
  4. RFC 9639. FLAC: Free Lossless Audio Codec Format and Tools. Internet Engineering Task Force, December 2024.
  5. Library of Congress. Sustainability of Digital Formats. https://www.loc.gov/preservation/digital/formats/
  6. Rosenthal, David S. H. "Format Obsolescence: Assessing the Threat and the Defenses." Library Hi Tech, vol. 28, no. 2, 2010. DOI: 10.1108/07378831011076648.
  7. JHOVE Project. Format-aware identification, validation, and characterization framework. http://jhove.openpreservation.org/
  8. Niedermair, Klaus. "Long-Term Preservation of Digital Documents." D-Lib Magazine, 2009.