A podcast network processing a hundred episodes a week, a music library digitizing decades of vinyl, a language-learning company producing graded readings in fourteen languages: these are wildly different operations with the same underlying problem. Many audio files in, consistent audio files out, no human ear involved in the routine cases. The skill is not knowing one audio editor well; it is composing a pipeline that handles loudness normalization, format conversion, metadata preservation, and quality control as a single repeatable process.
This guide focuses on the open-source toolchain that powers most production audio pipelines: ffmpeg, sox, lame, and a handful of measurement utilities. Commercial DAWs are excellent for creative work but are the wrong tool for batch conversion because they are GUI-first and license-bound. A scripted pipeline costs nothing to run on every machine in a render farm.
Why audio batches need different thinking from video
Audio files are small. A 60-minute podcast at 128 kbps AAC is roughly 60 MB. A thousand of them fit on a laptop SSD. This makes audio batches I/O-cheap and CPU-cheap, which means the engineering effort should focus on the parts that are easy to get wrong: loudness, metadata, and format compatibility.
The other distinction is that human ears notice errors video viewers tolerate. A frame dropped in a video is invisible. A click or pop in audio is jarring. A 0.5 dB loudness mismatch between episodes makes listeners reach for the volume knob. Audio batches must hit their quality targets precisely, and the only way to verify that across hundreds of files is automated measurement.
"It is far better to grasp the universe as it really is than to persist in delusion." Carl Sagan, The Demon-Haunted World
Most audio batch failures come from assuming the inputs are uniform when they are not. Probe every input, branch on what you find, and let the script decide.
The four operations every batch must handle
A production audio pipeline performs four distinct operations, in order. Conflating them is the primary source of bugs.
| Stage | What it does | Tool of choice | Failure mode if skipped |
|---|---|---|---|
| Probe | Detect format, channels, sample rate, duration | ffprobe | Pipeline assumes uniformity it does not have |
| Normalize | Bring loudness to a defined target | ffmpeg loudnorm or lufs-meter | Episodes vary in apparent volume |
| Convert | Transcode to delivery format | ffmpeg or lame | Wrong codec, wrong sample rate, wrong channel count |
| Tag | Write or carry metadata | ffmpeg -metadata or eyeD3 | Players show "Track 01 Unknown Artist" |
A minimum viable batch with ffmpeg
Here is the simplest batch that still respects all four stages, written for podcast distribution.
#!/usr/bin/env bash
set -euo pipefail
INPUT_DIR="${1:-./raw}"
OUTPUT_DIR="${2:-./distribute}"
TARGET_LUFS=-16
TARGET_TP=-1
TARGET_LRA=11
mkdir -p "$OUTPUT_DIR"
for src in "$INPUT_DIR"/*.wav; do
base=$(basename "$src" .wav)
out="$OUTPUT_DIR/$base.mp3"
# Single-pass loudnorm with conservative targets
ffmpeg -hide_banner -y -i "$src" \
-af "loudnorm=I=$TARGET_LUFS:TP=$TARGET_TP:LRA=$TARGET_LRA" \
-ar 44100 -ac 2 \
-c:a libmp3lame -b:a 128k \
-map_metadata 0 \
"$out"
done
This works, but it is single-threaded and uses single-pass loudnorm, which can drift up to 1 LU off target on dynamic content. For production, the next two sections fix both issues.
Two-pass loudnorm: the right way to hit a target
The ffmpeg loudnorm filter implements EBU R128 loudness measurement. In single-pass mode it estimates and adjusts in one go. In two-pass mode it measures first, then re-runs with the measurements baked into the filter, landing within 0.1 LU of the target every time.
# Pass 1: measure
stats=$(ffmpeg -hide_banner -i "$src" \
-af "loudnorm=I=-16:TP=-1:LRA=11:print_format=json" \
-f null - 2>&1 | sed -n '/^{/,/^}/p')
input_i=$(echo "$stats" | jq -r '.input_i')
input_tp=$(echo "$stats" | jq -r '.input_tp')
input_lra=$(echo "$stats" | jq -r '.input_lra')
input_thresh=$(echo "$stats" | jq -r '.input_thresh')
target_offset=$(echo "$stats" | jq -r '.target_offset')
# Pass 2: encode with measured values
ffmpeg -hide_banner -y -i "$src" \
-af "loudnorm=I=-16:TP=-1:LRA=11:measured_I=$input_i:measured_TP=$input_tp:measured_LRA=$input_lra:measured_thresh=$input_thresh:offset=$target_offset:linear=true" \
-ar 44100 -ac 2 \
-c:a libmp3lame -b:a 128k \
-map_metadata 0 \
"$out"
The linear=true flag in the second pass uses linear gain rather than dynamic processing, which preserves the source's microdynamics and avoids the "pumping" that single-pass loudnorm can introduce on speech with long pauses.
Codec selection by destination
Audio codec choice is more constrained than video. Three codecs cover 98 percent of real-world batches.
| Destination | Codec | Bitrate | Container | Why |
|---|---|---|---|---|
| Podcast distribution | AAC-LC or MP3 | 96-128 kbps stereo, 64 kbps mono | MP4 or MP3 | Apple, Spotify, RSS readers expect these |
| Music streaming masters | FLAC | Lossless | FLAC | Streaming services derive their lossy outputs from FLAC |
| Voice notes and audiobooks | Opus | 32-64 kbps | OGG or WebM | Best speech quality at low bitrate |
| Broadcast delivery | PCM 24-bit | 1.4 Mbps stereo | WAV or BWF | Required by broadcast ingest |
| Web background audio | Opus or AAC | 64 kbps | WebM or MP4 | Universal browser support |
| Archival masters | FLAC or WAV | Lossless | FLAC or BWF | Bit-exact preservation |
Sample rate and channel handling
Resampling is mathematically not free. The wrong resampler introduces audible aliasing or pre-ringing that no amount of EQ can remove. ffmpeg's default swresample is acceptable but not great; the aresample filter with the soxr library produces SoX-quality results at minimal speed cost.
ffmpeg -i input.wav \
-af "aresample=resampler=soxr:precision=28:dither_method=triangular_hp" \
-ar 44100 -ac 2 \
output.flac
The precision=28 sets SoX to its highest quality mode. The dither_method=triangular_hp adds high-pass triangular dither during bit-depth reduction, which prevents the truncation artifacts that produce a faint hiss at the noise floor.
For channel handling, do not blindly downmix. A surround source downmixed without proper coefficients can lose dialogue or duplicate it. Use ffmpeg's explicit pan filter for known cases:
# Proper 5.1 to stereo downmix preserving center channel intelligibility
ffmpeg -i surround.wav \
-af "pan=stereo|FL=FC+0.30*FL+0.30*BL|FR=FC+0.30*FR+0.30*BR" \
-c:a flac stereo.flac
"Show me your tables, and I shall continue to be mystified. Show me your code, and I shall be lost. Show me your data structures, and I won't usually need your code." Fred Brooks, The Mythical Man-Month
The data structure that matters in an audio pipeline is the manifest: a row per input file with probe results, target settings, output paths, and a status field. Pipelines that lack this manifest cannot recover from interruption and cannot prove what they did to which file.
Metadata: the difference between a podcast and a file
A podcast episode without ID3 tags is a file. A podcast episode with the right tags is a discoverable, searchable, attributable piece of content. The minimum viable tag set:
ffmpeg -i normalized.wav \
-metadata title="Episode 47: The Loudness Wars" \
-metadata artist="The Audio Engineering Show" \
-metadata album="Season 3" \
-metadata album_artist="The Audio Engineering Show" \
-metadata track="47/52" \
-metadata date="2026" \
-metadata genre="Podcast" \
-metadata comment="https://example.com/show/47" \
-metadata description="In this episode we discuss..." \
-i cover.jpg \
-map 0:a -map 1 \
-c:v copy -id3v2_version 3 \
-c:a libmp3lame -b:a 128k \
episode-47.mp3
The -id3v2_version 3 flag is important. ID3v2.4 has better Unicode handling but is poorly supported by older players, including some car stereos and podcast apps that have not been updated since 2018. ID3v2.3 is the safer default for distribution.
For chapters in podcast episodes, write an ffmetadata sidecar and pass it explicitly:
cat > chapters.txt <<'EOF'
;FFMETADATA1
[CHAPTER]
TIMEBASE=1/1000
START=0
END=125000
title=Introduction
[CHAPTER]
TIMEBASE=1/1000
START=125000
END=890000
title=The history of loudness war
[CHAPTER]
TIMEBASE=1/1000
START=890000
END=2400000
title=Modern streaming targets
EOF
ffmpeg -i input.mp3 -i chapters.txt -map_metadata 1 -c copy output.mp3
Apple Podcasts, Overcast, Pocket Casts, and most modern listening apps respect MP4 chapters. RSS feed shownotes are still the more universal option, but in-file chapters give listeners scrubbable navigation.
Parallelism: when it helps and when it does not
Audio batches benefit less from parallelism than video batches because audio encoding rarely saturates a single CPU core. A typical 60-minute podcast episode encodes to MP3 in 8 to 15 seconds on a single core. Running four encodes in parallel gives roughly a 3x speedup, not 4x, because the disk and the LAME encoder share resources.
The right pattern is moderate parallelism (4 to 8 jobs) with a manifest-driven runner.
ls raw/*.wav | parallel -j 6 --joblog audio.log \
'./encode-one.sh {} distribute/{/.}.mp3'
For very large batches (tens of thousands of files), distribute across machines using a queue. The same pattern that drives content-distribution workflows works for audio: a Redis or RabbitMQ queue, a worker pool, and a manifest table.
Quality control without listening to every file
A 200-episode batch is unlistenable. Automated checks cover 95 percent of failure modes; spot-listening covers the rest.
# Verify integrated loudness landed within 0.5 LU of target
measured=$(ffmpeg -hide_banner -i output.mp3 \
-af "ebur128=peak=true" -f null - 2>&1 \
| grep "I:" | tail -1 | awk '{print $2}')
if (( $(echo "$measured < -16.5 || $measured > -15.5" | bc -l) )); then
echo "FAIL: $output measured $measured LUFS (target -16)"
fi
Combine with checks for true peak, clipping, silence at start/end (which usually indicates a bad cut), and unexpected duration changes between input and output (which usually indicates a sample-rate confusion).
| Check | Tool | Threshold | Common cause of failure |
|---|---|---|---|
| Integrated loudness | ffmpeg ebur128 | Within 0.5 LU of target | Skipped loudnorm pass |
| True peak | ffmpeg ebur128 | At or below -1.0 dBTP | Aggressive limiter, no headroom |
| Duration delta | ffprobe | Within 100 ms of source | Sample rate mismatch |
| Leading silence | ffmpeg silencedetect | Below 2 seconds | Bad edit, intro cut wrong |
| Trailing silence | ffmpeg silencedetect | Below 5 seconds | Outro not trimmed |
| Channel layout | ffprobe | Matches expected | Mono source forced to stereo or vice versa |
A real-world podcast pipeline
The pipeline below handles a typical podcast network's daily output: 5 to 20 episodes, each from a different host with different recording setups, all targeting -16 LUFS and 128 kbps MP3 distribution.
#!/usr/bin/env bash
set -euo pipefail
INPUT_DIR="${1:-./incoming}"
OUTPUT_DIR="${2:-./distribute}"
PARALLEL="${3:-6}"
mkdir -p "$OUTPUT_DIR" ./manifest ./logs
process_one() {
local src="$1"
local base
base=$(basename "$src" | sed 's/\.[^.]*$//')
local out="$OUTPUT_DIR/$base.mp3"
local manifest="./manifest/$base.json"
# Probe
ffprobe -v error -show_streams -show_format -of json "$src" > "$manifest"
# Pass 1: measure
local stats
stats=$(ffmpeg -hide_banner -i "$src" \
-af "loudnorm=I=-16:TP=-1:LRA=11:print_format=json" \
-f null - 2>&1 | sed -n '/^{/,/^}/p')
local i tp lra thresh offset
i=$(echo "$stats" | jq -r '.input_i')
tp=$(echo "$stats" | jq -r '.input_tp')
lra=$(echo "$stats" | jq -r '.input_lra')
thresh=$(echo "$stats" | jq -r '.input_thresh')
offset=$(echo "$stats" | jq -r '.target_offset')
# Pass 2: encode
ffmpeg -hide_banner -y -i "$src" \
-af "loudnorm=I=-16:TP=-1:LRA=11:measured_I=$i:measured_TP=$tp:measured_LRA=$lra:measured_thresh=$thresh:offset=$offset:linear=true,aresample=44100" \
-ac 2 -c:a libmp3lame -b:a 128k \
-map_metadata 0 -id3v2_version 3 \
"$out" 2> "./logs/$base.log"
# Verify
local measured
measured=$(ffmpeg -hide_banner -i "$out" -af ebur128 -f null - 2>&1 \
| grep "I:" | tail -1 | awk '{print $2}')
echo "{\"file\":\"$out\",\"measured_lufs\":$measured}" > "./manifest/$base-result.json"
}
export -f process_one
export OUTPUT_DIR
find "$INPUT_DIR" -type f \( -iname "*.wav" -o -iname "*.flac" -o -iname "*.m4a" \) \
| parallel -j "$PARALLEL" --joblog ./logs/batch.log process_one
Run this and you get a folder of distribution-ready MP3s, a per-episode manifest with probe data and measured loudness, and a job log for retry tooling. Total engineering time after the initial setup: zero per episode.
Cross-domain consistency
Audio pipelines often serve content that mirrors patterns in other domains. A network producing study-aid audio for practice-test platforms and educational fillers for music-theory training at When Notes Fly and ambient backgrounds for cafe content at Down Under Cafe can use a single pipeline with per-tenant config. The encoding work is identical; the targets, tags, and output destinations differ.
"If you optimize everything, you will always be unhappy." Donald Knuth, The Art of Computer Programming
Resist the temptation to tune every batch to its tenant's exact preferences if those preferences are within the perceptually transparent range. A unified pipeline that produces good-enough output for everyone is more maintainable than seven custom pipelines.
Sample-rate routing in mixed batches
A batch that handles podcast interviews recorded on different equipment will see 44.1, 48, and occasionally 96 kHz sources in the same folder. A naive resample to 44.1 for everything wastes CPU on already-44.1 sources and risks subtle quality loss.
# Probe and route by sample rate
for src in incoming/*; do
rate=$(ffprobe -v error -select_streams a:0 \
-show_entries stream=sample_rate -of csv=p=0 "$src")
base=$(basename "$src")
if [[ "$rate" == "44100" ]]; then
# No resample needed
ffmpeg -i "$src" -af "loudnorm=I=-16:TP=-1:LRA=11" \
-c:a libmp3lame -q:a 2 -id3v2_version 3 \
"out/$base.mp3"
else
# Resample with high-quality SoX
ffmpeg -i "$src" \
-af "aresample=resampler=soxr:precision=28:dither_method=triangular_hp,loudnorm=I=-16:TP=-1:LRA=11" \
-ar 44100 \
-c:a libmp3lame -q:a 2 -id3v2_version 3 \
"out/$base.mp3"
fi
done
The probe-and-route pattern saves 10 to 20 percent CPU on typical mixed batches and avoids unnecessary quality risk.
Long-form audiobook batches
Audiobook production is its own genre with strict spec requirements. ACX, Audible's submission portal, demands -23 to -18 RMS, peak no higher than -3 dBFS, no noise floor above -60 dB, and 192 kbps CBR MP3 at 44.1 kHz mono. A batch that handles audiobook chapters must verify each chapter against these specs before submission.
ffmpeg -i chapter01.wav \
-af "loudnorm=I=-20:TP=-3:LRA=7,highpass=f=80,lowpass=f=18000" \
-ar 44100 -ac 1 \
-c:a libmp3lame -b:a 192k \
-id3v2_version 3 \
-metadata title="Chapter 01" \
-metadata artist="Author Name" \
-metadata album="Book Title" \
-metadata track="1/24" \
ch01.mp3
The high-pass filter at 80 Hz removes mic stand rumble; the low-pass at 18 kHz removes any residual hiss above the speech-relevant band. ACX rejects submissions where the noise floor between dialogue is too high, and these filters help keep that floor below the -60 dB threshold.
Common mistakes that survive years of practice
Three errors recur. First, encoding to MP3 as the master and then re-encoding for each platform compounds quality loss; always keep a lossless master. Second, single-pass loudnorm leaves episodes within 1 LU of each other but not within 0.1 LU; if your show targets a specific platform, do two-pass. Third, batches that strip metadata save bytes but lose searchability; always carry tags through unless you have a documented reason to strip.
A pipeline that respects these three rules ages gracefully through codec changes, platform changes, and team changes.
References
- EBU Recommendation R 128, "Loudness normalisation and permitted maximum level of audio signals." European Broadcasting Union, 2020.
- ITU-R BS.1770-4, "Algorithms to measure audio programme loudness and true-peak audio level." International Telecommunication Union, 2015.
- ISO/IEC 14496-3:2019, "Information technology - Coding of audio-visual objects - Part 3: Audio." International Organization for Standardization (AAC specification).
- RFC 6716, "Definition of the Opus Audio Codec." Internet Engineering Task Force, 2012. doi:10.17487/RFC6716
- Brandenburg, K., "MP3 and AAC explained." Proceedings of the AES 17th International Conference on High-Quality Audio Coding, 1999.
- Coalson, J., "FLAC - Free Lossless Audio Codec format specification." Available: https://xiph.org/flac/format.html
- ID3.org, "ID3 tag version 2.4.0 - Main Structure." Available: https://id3.org/id3v2.4.0-structure
- Smith, J. O., "Digital Audio Resampling Home Page." Center for Computer Research in Music and Acoustics, Stanford University. Available: https://ccrma.stanford.edu/~jos/resample/
Frequently Asked Questions
Why audio batches need different thinking from video?
Audio files are small. A 60-minute podcast at 128 kbps AAC is roughly 60 MB. A thousand of them fit on a laptop SSD. This makes audio batches I/O-cheap and CPU-cheap, which means the engineering effort should focus on the parts that are easy to get wrong: loudness, metadata, and format compatibility.
Ready to Convert Your Files?
Use our free online file converter supporting 240+ formats. No signup required, fast processing, and secure handling of your files.
Convert Files


