A podcast network processing a hundred episodes a week, a music library digitizing decades of vinyl, a language-learning company producing graded readings in fourteen languages: these are wildly different operations with the same underlying problem. Many audio files in, consistent audio files out, no human ear involved in the routine cases. The skill is not knowing one audio editor well; it is composing a pipeline that handles loudness normalization, format conversion, metadata preservation, and quality control as a single repeatable process.

This guide focuses on the open-source toolchain that powers most production audio pipelines: ffmpeg, sox, lame, and a handful of measurement utilities. Commercial DAWs are excellent for creative work but are the wrong tool for batch conversion because they are GUI-first and license-bound. A scripted pipeline costs nothing to run on every machine in a render farm.

Why audio batches need different thinking from video

Audio files are small. A 60-minute podcast at 128 kbps AAC is roughly 60 MB. A thousand of them fit on a laptop SSD. This makes audio batches I/O-cheap and CPU-cheap, which means the engineering effort should focus on the parts that are easy to get wrong: loudness, metadata, and format compatibility.

The other distinction is that human ears notice errors video viewers tolerate. A frame dropped in a video is invisible. A click or pop in audio is jarring. A 0.5 dB loudness mismatch between episodes makes listeners reach for the volume knob. Audio batches must hit their quality targets precisely, and the only way to verify that across hundreds of files is automated measurement.

"It is far better to grasp the universe as it really is than to persist in delusion." Carl Sagan, The Demon-Haunted World

Most audio batch failures come from assuming the inputs are uniform when they are not. Probe every input, branch on what you find, and let the script decide.

The four operations every batch must handle

A production audio pipeline performs four distinct operations, in order. Conflating them is the primary source of bugs.

StageWhat it doesTool of choiceFailure mode if skipped
ProbeDetect format, channels, sample rate, durationffprobePipeline assumes uniformity it does not have
NormalizeBring loudness to a defined targetffmpeg loudnorm or lufs-meterEpisodes vary in apparent volume
ConvertTranscode to delivery formatffmpeg or lameWrong codec, wrong sample rate, wrong channel count
TagWrite or carry metadataffmpeg -metadata or eyeD3Players show "Track 01 Unknown Artist"
Skipping the probe stage is the single most common batch mistake. A script that assumes every input is 44.1 kHz stereo will silently mangle the one mono 32 kHz lecture recording in the folder.

A minimum viable batch with ffmpeg

Here is the simplest batch that still respects all four stages, written for podcast distribution.

#!/usr/bin/env bash
set -euo pipefail

INPUT_DIR="${1:-./raw}"
OUTPUT_DIR="${2:-./distribute}"
TARGET_LUFS=-16
TARGET_TP=-1
TARGET_LRA=11

mkdir -p "$OUTPUT_DIR"

for src in "$INPUT_DIR"/*.wav; do
  base=$(basename "$src" .wav)
  out="$OUTPUT_DIR/$base.mp3"

  # Single-pass loudnorm with conservative targets
  ffmpeg -hide_banner -y -i "$src" \
    -af "loudnorm=I=$TARGET_LUFS:TP=$TARGET_TP:LRA=$TARGET_LRA" \
    -ar 44100 -ac 2 \
    -c:a libmp3lame -b:a 128k \
    -map_metadata 0 \
    "$out"
done

This works, but it is single-threaded and uses single-pass loudnorm, which can drift up to 1 LU off target on dynamic content. For production, the next two sections fix both issues.

Two-pass loudnorm: the right way to hit a target

The ffmpeg loudnorm filter implements EBU R128 loudness measurement. In single-pass mode it estimates and adjusts in one go. In two-pass mode it measures first, then re-runs with the measurements baked into the filter, landing within 0.1 LU of the target every time.

# Pass 1: measure
stats=$(ffmpeg -hide_banner -i "$src" \
  -af "loudnorm=I=-16:TP=-1:LRA=11:print_format=json" \
  -f null - 2>&1 | sed -n '/^{/,/^}/p')

input_i=$(echo "$stats" | jq -r '.input_i')
input_tp=$(echo "$stats" | jq -r '.input_tp')
input_lra=$(echo "$stats" | jq -r '.input_lra')
input_thresh=$(echo "$stats" | jq -r '.input_thresh')
target_offset=$(echo "$stats" | jq -r '.target_offset')

# Pass 2: encode with measured values
ffmpeg -hide_banner -y -i "$src" \
  -af "loudnorm=I=-16:TP=-1:LRA=11:measured_I=$input_i:measured_TP=$input_tp:measured_LRA=$input_lra:measured_thresh=$input_thresh:offset=$target_offset:linear=true" \
  -ar 44100 -ac 2 \
  -c:a libmp3lame -b:a 128k \
  -map_metadata 0 \
  "$out"

The linear=true flag in the second pass uses linear gain rather than dynamic processing, which preserves the source's microdynamics and avoids the "pumping" that single-pass loudnorm can introduce on speech with long pauses.

Codec selection by destination

Audio codec choice is more constrained than video. Three codecs cover 98 percent of real-world batches.

DestinationCodecBitrateContainerWhy
Podcast distributionAAC-LC or MP396-128 kbps stereo, 64 kbps monoMP4 or MP3Apple, Spotify, RSS readers expect these
Music streaming mastersFLACLosslessFLACStreaming services derive their lossy outputs from FLAC
Voice notes and audiobooksOpus32-64 kbpsOGG or WebMBest speech quality at low bitrate
Broadcast deliveryPCM 24-bit1.4 Mbps stereoWAV or BWFRequired by broadcast ingest
Web background audioOpus or AAC64 kbpsWebM or MP4Universal browser support
Archival mastersFLAC or WAVLosslessFLAC or BWFBit-exact preservation
The single biggest mistake is encoding masters as MP3. Once a file is MP3, it cannot be losslessly returned to PCM, and any further conversion compounds the loss. Always keep a FLAC or WAV master and derive lossy deliverables from it.

Sample rate and channel handling

Resampling is mathematically not free. The wrong resampler introduces audible aliasing or pre-ringing that no amount of EQ can remove. ffmpeg's default swresample is acceptable but not great; the aresample filter with the soxr library produces SoX-quality results at minimal speed cost.

ffmpeg -i input.wav \
  -af "aresample=resampler=soxr:precision=28:dither_method=triangular_hp" \
  -ar 44100 -ac 2 \
  output.flac

The precision=28 sets SoX to its highest quality mode. The dither_method=triangular_hp adds high-pass triangular dither during bit-depth reduction, which prevents the truncation artifacts that produce a faint hiss at the noise floor.

For channel handling, do not blindly downmix. A surround source downmixed without proper coefficients can lose dialogue or duplicate it. Use ffmpeg's explicit pan filter for known cases:

# Proper 5.1 to stereo downmix preserving center channel intelligibility
ffmpeg -i surround.wav \
  -af "pan=stereo|FL=FC+0.30*FL+0.30*BL|FR=FC+0.30*FR+0.30*BR" \
  -c:a flac stereo.flac
"Show me your tables, and I shall continue to be mystified. Show me your code, and I shall be lost. Show me your data structures, and I won't usually need your code." Fred Brooks, The Mythical Man-Month

The data structure that matters in an audio pipeline is the manifest: a row per input file with probe results, target settings, output paths, and a status field. Pipelines that lack this manifest cannot recover from interruption and cannot prove what they did to which file.

Metadata: the difference between a podcast and a file

A podcast episode without ID3 tags is a file. A podcast episode with the right tags is a discoverable, searchable, attributable piece of content. The minimum viable tag set:

ffmpeg -i normalized.wav \
  -metadata title="Episode 47: The Loudness Wars" \
  -metadata artist="The Audio Engineering Show" \
  -metadata album="Season 3" \
  -metadata album_artist="The Audio Engineering Show" \
  -metadata track="47/52" \
  -metadata date="2026" \
  -metadata genre="Podcast" \
  -metadata comment="https://example.com/show/47" \
  -metadata description="In this episode we discuss..." \
  -i cover.jpg \
  -map 0:a -map 1 \
  -c:v copy -id3v2_version 3 \
  -c:a libmp3lame -b:a 128k \
  episode-47.mp3

The -id3v2_version 3 flag is important. ID3v2.4 has better Unicode handling but is poorly supported by older players, including some car stereos and podcast apps that have not been updated since 2018. ID3v2.3 is the safer default for distribution.

For chapters in podcast episodes, write an ffmetadata sidecar and pass it explicitly:

cat > chapters.txt <<'EOF'
;FFMETADATA1

[CHAPTER]
TIMEBASE=1/1000
START=0
END=125000
title=Introduction

[CHAPTER]
TIMEBASE=1/1000
START=125000
END=890000
title=The history of loudness war

[CHAPTER]
TIMEBASE=1/1000
START=890000
END=2400000
title=Modern streaming targets
EOF

ffmpeg -i input.mp3 -i chapters.txt -map_metadata 1 -c copy output.mp3

Apple Podcasts, Overcast, Pocket Casts, and most modern listening apps respect MP4 chapters. RSS feed shownotes are still the more universal option, but in-file chapters give listeners scrubbable navigation.

Parallelism: when it helps and when it does not

Audio batches benefit less from parallelism than video batches because audio encoding rarely saturates a single CPU core. A typical 60-minute podcast episode encodes to MP3 in 8 to 15 seconds on a single core. Running four encodes in parallel gives roughly a 3x speedup, not 4x, because the disk and the LAME encoder share resources.

The right pattern is moderate parallelism (4 to 8 jobs) with a manifest-driven runner.

ls raw/*.wav | parallel -j 6 --joblog audio.log \
  './encode-one.sh {} distribute/{/.}.mp3'

For very large batches (tens of thousands of files), distribute across machines using a queue. The same pattern that drives content-distribution workflows works for audio: a Redis or RabbitMQ queue, a worker pool, and a manifest table.

Quality control without listening to every file

A 200-episode batch is unlistenable. Automated checks cover 95 percent of failure modes; spot-listening covers the rest.

# Verify integrated loudness landed within 0.5 LU of target
measured=$(ffmpeg -hide_banner -i output.mp3 \
  -af "ebur128=peak=true" -f null - 2>&1 \
  | grep "I:" | tail -1 | awk '{print $2}')

if (( $(echo "$measured < -16.5 || $measured > -15.5" | bc -l) )); then
  echo "FAIL: $output measured $measured LUFS (target -16)"
fi

Combine with checks for true peak, clipping, silence at start/end (which usually indicates a bad cut), and unexpected duration changes between input and output (which usually indicates a sample-rate confusion).

CheckToolThresholdCommon cause of failure
Integrated loudnessffmpeg ebur128Within 0.5 LU of targetSkipped loudnorm pass
True peakffmpeg ebur128At or below -1.0 dBTPAggressive limiter, no headroom
Duration deltaffprobeWithin 100 ms of sourceSample rate mismatch
Leading silenceffmpeg silencedetectBelow 2 secondsBad edit, intro cut wrong
Trailing silenceffmpeg silencedetectBelow 5 secondsOutro not trimmed
Channel layoutffprobeMatches expectedMono source forced to stereo or vice versa
A batch that passes all six checks is almost always shippable. Listeners may still find creative complaints, but the technical bar is met.

A real-world podcast pipeline

The pipeline below handles a typical podcast network's daily output: 5 to 20 episodes, each from a different host with different recording setups, all targeting -16 LUFS and 128 kbps MP3 distribution.

#!/usr/bin/env bash
set -euo pipefail

INPUT_DIR="${1:-./incoming}"
OUTPUT_DIR="${2:-./distribute}"
PARALLEL="${3:-6}"

mkdir -p "$OUTPUT_DIR" ./manifest ./logs

process_one() {
  local src="$1"
  local base
  base=$(basename "$src" | sed 's/\.[^.]*$//')
  local out="$OUTPUT_DIR/$base.mp3"
  local manifest="./manifest/$base.json"

  # Probe
  ffprobe -v error -show_streams -show_format -of json "$src" > "$manifest"

  # Pass 1: measure
  local stats
  stats=$(ffmpeg -hide_banner -i "$src" \
    -af "loudnorm=I=-16:TP=-1:LRA=11:print_format=json" \
    -f null - 2>&1 | sed -n '/^{/,/^}/p')

  local i tp lra thresh offset
  i=$(echo "$stats" | jq -r '.input_i')
  tp=$(echo "$stats" | jq -r '.input_tp')
  lra=$(echo "$stats" | jq -r '.input_lra')
  thresh=$(echo "$stats" | jq -r '.input_thresh')
  offset=$(echo "$stats" | jq -r '.target_offset')

  # Pass 2: encode
  ffmpeg -hide_banner -y -i "$src" \
    -af "loudnorm=I=-16:TP=-1:LRA=11:measured_I=$i:measured_TP=$tp:measured_LRA=$lra:measured_thresh=$thresh:offset=$offset:linear=true,aresample=44100" \
    -ac 2 -c:a libmp3lame -b:a 128k \
    -map_metadata 0 -id3v2_version 3 \
    "$out" 2> "./logs/$base.log"

  # Verify
  local measured
  measured=$(ffmpeg -hide_banner -i "$out" -af ebur128 -f null - 2>&1 \
    | grep "I:" | tail -1 | awk '{print $2}')
  echo "{\"file\":\"$out\",\"measured_lufs\":$measured}" > "./manifest/$base-result.json"
}

export -f process_one
export OUTPUT_DIR

find "$INPUT_DIR" -type f \( -iname "*.wav" -o -iname "*.flac" -o -iname "*.m4a" \) \
  | parallel -j "$PARALLEL" --joblog ./logs/batch.log process_one

Run this and you get a folder of distribution-ready MP3s, a per-episode manifest with probe data and measured loudness, and a job log for retry tooling. Total engineering time after the initial setup: zero per episode.

Cross-domain consistency

Audio pipelines often serve content that mirrors patterns in other domains. A network producing study-aid audio for practice-test platforms and educational fillers for music-theory training at When Notes Fly and ambient backgrounds for cafe content at Down Under Cafe can use a single pipeline with per-tenant config. The encoding work is identical; the targets, tags, and output destinations differ.

"If you optimize everything, you will always be unhappy." Donald Knuth, The Art of Computer Programming

Resist the temptation to tune every batch to its tenant's exact preferences if those preferences are within the perceptually transparent range. A unified pipeline that produces good-enough output for everyone is more maintainable than seven custom pipelines.

Sample-rate routing in mixed batches

A batch that handles podcast interviews recorded on different equipment will see 44.1, 48, and occasionally 96 kHz sources in the same folder. A naive resample to 44.1 for everything wastes CPU on already-44.1 sources and risks subtle quality loss.

# Probe and route by sample rate
for src in incoming/*; do
  rate=$(ffprobe -v error -select_streams a:0 \
    -show_entries stream=sample_rate -of csv=p=0 "$src")
  base=$(basename "$src")
  if [[ "$rate" == "44100" ]]; then
    # No resample needed
    ffmpeg -i "$src" -af "loudnorm=I=-16:TP=-1:LRA=11" \
      -c:a libmp3lame -q:a 2 -id3v2_version 3 \
      "out/$base.mp3"
  else
    # Resample with high-quality SoX
    ffmpeg -i "$src" \
      -af "aresample=resampler=soxr:precision=28:dither_method=triangular_hp,loudnorm=I=-16:TP=-1:LRA=11" \
      -ar 44100 \
      -c:a libmp3lame -q:a 2 -id3v2_version 3 \
      "out/$base.mp3"
  fi
done

The probe-and-route pattern saves 10 to 20 percent CPU on typical mixed batches and avoids unnecessary quality risk.

Long-form audiobook batches

Audiobook production is its own genre with strict spec requirements. ACX, Audible's submission portal, demands -23 to -18 RMS, peak no higher than -3 dBFS, no noise floor above -60 dB, and 192 kbps CBR MP3 at 44.1 kHz mono. A batch that handles audiobook chapters must verify each chapter against these specs before submission.

ffmpeg -i chapter01.wav \
  -af "loudnorm=I=-20:TP=-3:LRA=7,highpass=f=80,lowpass=f=18000" \
  -ar 44100 -ac 1 \
  -c:a libmp3lame -b:a 192k \
  -id3v2_version 3 \
  -metadata title="Chapter 01" \
  -metadata artist="Author Name" \
  -metadata album="Book Title" \
  -metadata track="1/24" \
  ch01.mp3

The high-pass filter at 80 Hz removes mic stand rumble; the low-pass at 18 kHz removes any residual hiss above the speech-relevant band. ACX rejects submissions where the noise floor between dialogue is too high, and these filters help keep that floor below the -60 dB threshold.

Common mistakes that survive years of practice

Three errors recur. First, encoding to MP3 as the master and then re-encoding for each platform compounds quality loss; always keep a lossless master. Second, single-pass loudnorm leaves episodes within 1 LU of each other but not within 0.1 LU; if your show targets a specific platform, do two-pass. Third, batches that strip metadata save bytes but lose searchability; always carry tags through unless you have a documented reason to strip.

A pipeline that respects these three rules ages gracefully through codec changes, platform changes, and team changes.

References

  1. EBU Recommendation R 128, "Loudness normalisation and permitted maximum level of audio signals." European Broadcasting Union, 2020.
  2. ITU-R BS.1770-4, "Algorithms to measure audio programme loudness and true-peak audio level." International Telecommunication Union, 2015.
  3. ISO/IEC 14496-3:2019, "Information technology - Coding of audio-visual objects - Part 3: Audio." International Organization for Standardization (AAC specification).
  4. RFC 6716, "Definition of the Opus Audio Codec." Internet Engineering Task Force, 2012. doi:10.17487/RFC6716
  5. Brandenburg, K., "MP3 and AAC explained." Proceedings of the AES 17th International Conference on High-Quality Audio Coding, 1999.
  6. Coalson, J., "FLAC - Free Lossless Audio Codec format specification." Available: https://xiph.org/flac/format.html
  7. ID3.org, "ID3 tag version 2.4.0 - Main Structure." Available: https://id3.org/id3v2.4.0-structure
  8. Smith, J. O., "Digital Audio Resampling Home Page." Center for Computer Research in Music and Acoustics, Stanford University. Available: https://ccrma.stanford.edu/~jos/resample/