Audio conversion is one of those operations that looks trivial in a graphical converter and reveals its complexity the moment you care about the result. Sample rate conversion without proper dither produces audible quantization noise. Bit depth reduction without dither sounds harsh on quiet passages. Channel downmixing without gain compensation clips. Lossy-to-lossy conversion compounds artifacts. Resampling between non-integer ratios introduces aliasing if the filter is wrong. Every conversion has a quality consequence, and getting professional results requires knowing which knobs to set and which to leave alone.

This guide walks through the conversions that actually matter for music, podcast, audiobook, and video production work. The reference tool is ffmpeg, which handles every mainstream codec and container, runs on every operating system, and exposes the parameters that affect quality. Graphical converters are often built on ffmpeg internally; learning the command-line invocation lets you reproduce any conversion exactly and script bulk operations.

The Conversion Quality Hierarchy

Not all conversions are equal in their quality cost. A short hierarchy from cheapest to most expensive in audible quality terms.

The cheapest conversion is changing the container without re-encoding. Repackaging a FLAC file's metadata, putting an existing AAC stream into a different MP4 wrapper, or extracting raw PCM from a WAV file all happen at the byte level with no quality cost.

# Container change without re-encoding
ffmpeg -i input.m4a -c:a copy output.mp4   # AAC stream copied
ffmpeg -i input.flac -c:a copy output.ogg  # FLAC into OGG container

The next cheapest is lossless transcoding between lossless formats. Converting WAV to FLAC or FLAC to ALAC produces bit-exact output at smaller size with zero audible difference. The work is purely arithmetic.

# Lossless transcoding
ffmpeg -i input.wav -c:a flac -compression_level 8 output.flac
ffmpeg -i input.flac -c:a alac output.m4a

Sample rate and bit depth conversion within lossless representations is mostly free if done correctly with proper dither and resampling. The arithmetic introduces tiny numerical errors that are below the noise floor of even 24-bit recordings.

Lossless to lossy conversion is the first step that introduces audible loss. The encoder discards information based on psychoacoustic models. Done at sufficient bitrate, the loss is inaudible on consumer playback. Done at low bitrate, the loss is obvious.

Lossy to lossy conversion is the worst common case. Each step compounds artifacts. A file that has been transcoded MP3 to AAC to MP3 sounds visibly worse than either single conversion would produce.

"Every conversion is a tax on the signal. The only conversion that does not pay is the one you do not do." Bob Ludwig, mastering engineer

Sample Rate Conversion

Sample rate conversion (SRC) changes how many samples per second represent the audio. Converting 96 kHz to 48 kHz halves the data; converting 44.1 kHz to 48 kHz changes the time base. The math behind SRC involves anti-aliasing low-pass filtering before downsampling and reconstruction filtering after upsampling.

The quality of the conversion depends on the filter design. A poor filter introduces audible aliasing or muddies the high frequency response. Modern resamplers including ffmpeg's swresample, libsoxr (the SoX resampler), and Apple's CoreAudio Resampler all produce excellent results when invoked with their high-quality settings.

# High-quality resampling with ffmpeg using swresample defaults
ffmpeg -i input-96k.wav -ar 48000 output-48k.wav

# Higher quality with explicit soxr backend
ffmpeg -i input-96k.wav -af "aresample=resampler=soxr:precision=28" \
       -ar 48000 output-48k.wav

The precision parameter controls SoX's filter quality. The default of 20 is excellent; 28 is reference quality. The CPU cost of higher precision is small for offline conversion.

Some sample rate conversions are mathematically exact and others are not. 96 kHz to 48 kHz is exact (factor of 2). 88.2 kHz to 44.1 kHz is exact (factor of 2). 48 kHz to 44.1 kHz requires resampling at a 147:160 ratio, which is fractional and demands a longer filter. Any sample rate conversion is technically lossy in the sense that the output contains different sample values, but with a good resampler the audible difference is below the noise floor.

Bit Depth Conversion and Dither

Bit depth conversion downward (24-bit to 16-bit) requires dither. Without dither, the truncation produces correlated quantization noise that sounds like fine harshness on quiet passages. Dither adds a small random signal that decorrelates the quantization and turns it into audibly benign white noise.

# Bit depth reduction with triangular dither (TPDF)
ffmpeg -i master-24.wav -sample_fmt s16 \
       -af "aresample=osf=s16:dither_method=triangular_hp" \
       output-16.wav

The triangular_hp dither method applies high-pass shaped triangular probability density function dither. This shapes the dither noise above the most sensitive part of the human hearing range, where it is less audible. For mastering, this is the appropriate choice. For neutral conversion, plain triangular dither is the textbook answer.

Dither MethodSpectrumUse Case
NoneCorrelated quantization noiseAvoid
RectangularFlat, correlatedAvoid
Triangular (TPDF)Flat, decorrelatedNeutral conversion
Triangular HPHigh-pass shapedMastering for music
Noise-shaped (POW-r 1/2/3)Aggressively shapedFinal mastering, choose by content
Bit depth conversion upward (16-bit to 24-bit) is loss-free. It pads the existing samples with zero bits in the lower part of the word. The resulting 24-bit file is larger but contains exactly the same audio information.

Channel Layout Conversion

Channel layout conversion comes up most often as stereo to mono or multichannel to stereo for distribution. Each conversion has correct and incorrect ways to do it.

Stereo to mono should sum the channels with proper gain compensation. The simple sum (L plus R) clips on stereo content with strong center information because the center is now twice as loud. The correct downmix divides by 2 or applies the standard ITU downmix coefficient.

# Stereo to mono with proper gain compensation
ffmpeg -i stereo.wav -ac 1 mono.wav

# Equivalent explicit form
ffmpeg -i stereo.wav -af "pan=mono|c0=0.5*c0+0.5*c1" mono.wav

The -ac 1 flag uses ffmpeg's default downmix, which applies appropriate gain compensation. The explicit form documents what is happening for readers of the script.

Multichannel (5.1, 7.1) to stereo uses the ITU-R BS.775 downmix or the more recent Dolby Pro Logic II downmix. The standard ITU formula:

L_out = L + 0.707*C + 0.707*Ls
R_out = R + 0.707*C + 0.707*Rs

ffmpeg implements this automatically with -ac 2 on a multichannel input. The result is conventional and predictable.

"The downmix is the moment your three-dimensional mix becomes a flat photograph. The choice of downmix is the photographer's choice of lens." Bobby Owsinski, The Mixing Engineer's Handbook

Lossy Encoding: The Settings That Matter

Lossy encoders produce wildly different quality at the same bitrate depending on settings. The two encoders worth knowing in depth are LAME (for MP3) and libopus (for Opus).

LAME's quality is controlled by either bitrate (-b 192 for 192 kbps CBR), variable bitrate quality (-V 0 highest, -V 9 lowest), or average bitrate (--abr 192). For music distribution, -V 0 produces the highest quality at roughly 245 kbps average. For podcast spoken word mono, -V 6 produces excellent results at roughly 96 kbps.

# Music distribution at LAME's highest VBR
lame -V 0 -h master.wav music.mp3

# Lower-bitrate spoken-word mono
lame -V 6 -h -m m master.wav podcast.mp3

# Equivalent through ffmpeg
ffmpeg -i master.wav -codec:a libmp3lame -q:a 0 music.mp3

The -h flag selects high-quality mode, which uses more CPU for marginally better results. The -m m flag forces mono. The -q:a parameter in ffmpeg maps to LAME's -V setting.

libopus quality is controlled by bitrate. Opus's encoder is so well-tuned that the bitrate is the primary parameter. The --application flag selects between three optimization profiles: voip for spoken word, audio for music, and lowdelay for live streaming.

# Music encoding
opusenc --bitrate 96 --application audio master.wav music.opus

# Voice encoding for podcast
opusenc --bitrate 48 --application voip master.wav podcast.opus

Conversion Recipes for Common Scenarios

The recipes below cover the most common audio conversion needs. Each uses ffmpeg for portability.

# Music: WAV master to FLAC for archive
ffmpeg -i master.wav -codec:a flac -compression_level 8 archive.flac

# Music: WAV master to MP3 V0 (highest VBR)
ffmpeg -i master.wav -codec:a libmp3lame -q:a 0 distribution.mp3

# Music: WAV master to AAC 256 kbps in M4A
ffmpeg -i master.wav -codec:a aac -b:a 256k -movflags +faststart distribution.m4a

# Podcast: WAV master to MP3 96 kbps mono
ffmpeg -i master.wav -ac 1 -codec:a libmp3lame -b:a 96k -id3v2_version 3 \
       -metadata title="Episode 47" -metadata artist="Show Name" \
       -metadata album="Show Name" episode.mp3

# Podcast: WAV master to Opus 48 kbps mono
ffmpeg -i master.wav -ac 1 -codec:a libopus -b:a 48k \
       -application voip episode.opus

# Audiobook: WAV master to AAC 64 kbps mono in M4B
ffmpeg -i master.wav -ac 1 -codec:a aac -b:a 64k -f mp4 audiobook.m4b

# Voice memo: WAV to compact OGG
ffmpeg -i memo.wav -codec:a libopus -b:a 32k -application voip memo.ogg

# Downsample 96/24 master to 44.1/16 with proper dither
ffmpeg -i master-96-24.wav -ar 44100 -sample_fmt s16 \
       -af "aresample=resampler=soxr:precision=28:dither_method=triangular_hp" \
       output-44-16.wav

# Stem extraction: pull only the left channel from a stereo file
ffmpeg -i stereo.wav -af "pan=mono|c0=c0" left-channel.wav

# Loudness normalization to -16 LUFS for podcast
ffmpeg -i master.wav -af "loudnorm=I=-16:TP=-1.0:LRA=11" \
       -codec:a libmp3lame -b:a 96k normalized.mp3

# Trim silence from start and end of recording
ffmpeg -i raw.wav -af "silenceremove=start_periods=1:start_duration=1:start_threshold=-50dB:detection=peak,aformat=dblp,areverse,silenceremove=start_periods=1:start_duration=1:start_threshold=-50dB:detection=peak,aformat=dblp,areverse" \
       trimmed.wav

# Bulk convert all WAVs in a directory to FLAC
for f in *.wav; do
  ffmpeg -i "$f" -codec:a flac -compression_level 8 "${f%.wav}.flac"
done

When Conversions Go Wrong

A list of the failure modes that bite real audio pipelines.

The clipping downmix. Stereo content with strong center information clipped on simple sum-to-mono. Fix: use -ac 1 or explicit gain-compensated downmix.

The aliased downsampling. 96 kHz to 44.1 kHz with a poor resampler produces audible aliasing in high frequencies. Fix: use the soxr resampler with precision 28 or higher.

The dithered-to-already-dithered. Adding dither when going from 24-bit to 16-bit twice (once in mastering, once in conversion) doubles the noise floor. Fix: dither only at the final conversion to the lowest bit depth.

The lossy-to-lossless conversion. Converting MP3 back to WAV does not recover the discarded information. The resulting WAV is lossy audio in a lossless wrapper. Fix: keep masters as PCM or lossless and re-encode from those.

The broken metadata. Some converters strip or corrupt embedded metadata. Fix: copy metadata explicitly with -map_metadata 0 in ffmpeg, or use a dedicated tagger after conversion.

The wrong loudness target. Mastering for one platform and delivering to another produces files that play too loud or too soft. Fix: master to the loudness target of the primary platform and accept the others' normalization.

Failure ModeSymptomFix
Sum-to-mono clippingDistortion on dialogueUse -ac 1 for gain-compensated downmix
Truncation without ditherHarsh quantization noise on quiet sectionsApply triangular dither at bit depth reduction
Aliased SRCStrange artifacts in high frequenciesUse soxr resampler with high precision
Lossy chainAudible artifacts compoundConvert from lossless masters
Wrong sample ratePitch shift in playbackVerify rate metadata after conversion
Container without metadataTags missingUse -map_metadata 0 and verify
Wrong loudnessQuiet or loud playbackMaster to platform target

Verifying the Conversion

After a non-trivial conversion, verify the result. The minimum check for a music file is to play the first thirty seconds and the last thirty seconds and compare against the source. The minimum check for a batch is to spot-check three files at random.

For automated verification of properties, ffprobe reports the actual sample rate, bit depth, channel count, and codec of the output file:

ffprobe -v error -select_streams a:0 \
        -show_entries stream=codec_name,sample_rate,bits_per_sample,channels \
        -of default=noprint_wrappers=1 output.wav

For loudness verification, ffmpeg's loudnorm filter in print mode reports the actual integrated loudness:

ffmpeg -i output.wav -af loudnorm=print_format=summary -f null -

For bit-exact verification of lossless conversions, compute and compare audio fingerprints:

ffmpeg -i input.wav -map 0:a -f md5 -
ffmpeg -i output.flac -map 0:a -f md5 -

The MD5 outputs should match for a lossless conversion that preserved samples. The cognitive science research at What's Your IQ on perception suggests that listeners often misattribute differences to the wrong cause; verification by measurement avoids these errors. The note-keeping discipline at When Notes Fly recommends the same kind of checking habit for any conversion: trust the measurement over the impression.

Bulk and Pipeline Conversions

For shows or libraries with many files, a small wrapper script handles bulk conversion with logging and parallelism.

#!/usr/bin/env bash
set -euo pipefail
input_dir="$1"
output_dir="$2"
mkdir -p "$output_dir"
find "$input_dir" -name "*.wav" -print0 | \
  xargs -0 -n1 -P4 -I{} bash -c '
    src="$1"
    dst="'"$output_dir"'"/$(basename "${src%.wav}.flac")
    ffmpeg -loglevel error -y -i "$src" -codec:a flac -compression_level 8 "$dst"
    echo "Converted $src -> $dst"
  ' _ {}

The -P4 flag in xargs runs four conversions in parallel. On a typical workstation with four to twelve CPU cores, parallel conversion completes a library in roughly one quarter the wall time of sequential conversion.

Conversion in CI and Build Pipelines

Audio conversion increasingly happens inside continuous integration pipelines for game development, podcast hosting, and music distribution platforms. The patterns that work in a CI context differ from the patterns that work on an engineer's workstation.

Reproducibility matters more in CI. The same input must produce the same output every time, regardless of the build agent. Pin the ffmpeg version, pin the codec library versions, and capture the exact command line in the build log. A pipeline that mysteriously produces different files on different days is a pipeline that has lost trust.

Logging matters more in CI. Failed conversions on an engineer's machine surface immediately; failed conversions in CI often produce zero-byte output files that propagate through the pipeline. Always check exit codes, validate output file properties with ffprobe, and fail loudly when properties do not match expectations.

Parallelism matters more in CI because build agent time is metered. The xargs and GNU parallel patterns scale well, but be careful about memory: high-quality LAME and Opus encoding at maximum settings can use 200-400 MB per process, which constrains the parallel count on small build agents.

Cache invalidation matters more in CI. Cache the converted files keyed on the input file's hash plus the conversion parameters. A change to the input or to the encoder settings should invalidate the cache; a change to unrelated files should not. Bare timestamp-based caching is fragile.

For related guidance, see audio formats explained choose right format project and understanding mp3 vs flac which audio format to choose.

References

  1. International Telecommunication Union. ITU-R BS.775-3 Multichannel stereophonic sound system with and without accompanying picture. https://www.itu.int/rec/R-REC-BS.775
  1. Internet Engineering Task Force. Definition of the Opus Audio Codec. RFC 6716. https://www.rfc-editor.org/rfc/rfc6716
  1. Xiph.Org Foundation. FLAC Format Specification. https://xiph.org/flac/format.html
  1. ISO/IEC 11172-3:1993 MPEG-1 Audio. https://www.iso.org/standard/22411.html
  1. Lipshitz, S. P., Wannamaker, R. A., and Vanderkooy, J. (1992). Quantization and Dither: A Theoretical Survey. Journal of the Audio Engineering Society. https://www.aes.org/e-lib/browse.cfm?elib=7047
  1. ffmpeg documentation. https://ffmpeg.org/documentation.html
  1. SoX Resampler library (libsoxr). https://sourceforge.net/projects/soxr/
  1. AES Technical Council. AES17-2020 standard method for digital audio engineering measurement. https://www.aes.org/publications/standards/