Most developer time on file workflows is wasted on the same five problems: rebuilding things that did not change, processing files in serial that could run in parallel, scripts that only work on one machine, transforms that fail halfway and leave the workspace in an undefined state, and pipelines that nobody can rerun a year later because the tool versions drifted. The fix for each of these is well-understood and decades old. This article is a working developer's guide to applying the techniques that make file workflows fast, parallel, reproducible, and resumable, with concrete code in Make, shell, and Python.
The Principles That Matter
Five principles underlie every robust file workflow.
Idempotence. A transform that depends only on its inputs and produces the same output every time. Running it twice is harmless. Running it after a crash is safe. The opposite (in-place mutation, hidden state) is the cause of most pipeline rot.
Incremental builds. Only rebuild outputs whose inputs changed. The cost of rebuilding a 10000-file static site is enormous; the cost of rebuilding the 12 files that changed is trivial. Make has done this correctly since 1976.
Parallelism. File transforms are usually embarrassingly parallel. Saturate the available cores. A four-hour serial conversion is a one-hour conversion on a four-core machine if you let it run that way.
Reproducibility. Pinned tool versions, deterministic outputs, no dependence on environment that is not declared. The same input plus the same pipeline equals the same output, today and a year from now.
Observability. Logs, timings, exit codes, output verification. A pipeline that succeeds silently is indistinguishable from a pipeline that failed silently.
"First, solve the problem. Then, write the code. The mistake most engineers make with workflow code is treating it like throwaway scripts when it is in fact the most-run code they own." Donald Knuth, The Art of Computer Programming, applied to build pipelines.
Make Is Still the Right Tool
For file-transform pipelines, GNU Make remains the simplest, fastest, most-debugged tool available. It tracks file timestamps, handles dependency graphs, parallelizes with -j, supports pattern rules, and is installed everywhere.
A representative Makefile for a content pipeline:
SRC_DIR := src
BUILD := build
SRCS := $(wildcard $(SRC_DIR)/*.md)
HTMLS := $(patsubst $(SRC_DIR)/%.md,$(BUILD)/%.html,$(SRCS))
PDFS := $(patsubst $(SRC_DIR)/%.md,$(BUILD)/%.pdf,$(SRCS))
.PHONY: all clean
all: $(HTMLS) $(PDFS)
$(BUILD)/%.html: $(SRC_DIR)/%.md template.html | $(BUILD)
pandoc -s --template=template.html -o $@ $<
$(BUILD)/%.pdf: $(SRC_DIR)/%.md | $(BUILD)
pandoc --pdf-engine=xelatex -o $@ $<
$(BUILD):
mkdir -p $(BUILD)
clean:
rm -rf $(BUILD)
Run with make -j8 and the pipeline parallelizes to eight cores. Run again and only the files whose source changed get rebuilt. Make handles all of it without configuration.
Shell Pipelines Done Right
For ad-hoc transforms, bash pipelines remain unmatched in expressive density. The discipline that distinguishes sustainable shell scripts from throwaway ones:
#!/bin/bash
# Always start every workflow script with these
set -euo pipefail
IFS=$'\n\t'
# -e exit on error
# -u undefined variables fail
# -o pipefail catch failures inside pipelines
# IFS controls word splitting
# Always quote variables
input_dir="${1:?usage: $0 <input_dir>}"
output_dir="${2:-./out}"
mkdir -p "$output_dir"
# Process files null-delimited to handle spaces and newlines in names
find "$input_dir" -name '*.png' -print0 | \
while IFS= read -r -d '' f; do
name=$(basename "$f" .png)
avifenc --speed 4 -a cq-level=22 "$f" "$output_dir/$name.avif"
done
The -print0 plus read -d '' pattern handles every legal filename, including ones with newlines and spaces. The defaults of unquoted shell scripts handle approximately 60 percent of filenames correctly; this pattern handles 100 percent.
Parallelism Patterns
Three tools cover the parallel-conversion case.
# GNU parallel: simplest for ad-hoc work
find ./photos -name '*.jpg' -print0 | \
parallel -0 -j 8 \
'avifenc --speed 4 {} ./avif/{/.}.avif'
# xargs: portable, sufficient for simple cases
find ./photos -name '*.jpg' -print0 | \
xargs -0 -P 8 -I{} sh -c \
'avifenc --speed 4 "$1" "./avif/$(basename "$1" .jpg).avif"' _ {}
# make -j: declarative and incremental
# (see Makefile above; -j8 parallelizes automatically)
The choice depends on whether the work is interactive (parallel for one-shot), reproducible (Make for repeated builds), or scriptable across systems (xargs for portability where parallel is not installed).
"The bottleneck of any pipeline is not where the engineer thinks it is. Profile before parallelizing. The cost of parallelizing the wrong stage is rewrites, not speedups." Kent Beck, Extreme Programming Explained, applied to build optimization.
Content-Addressable Caching
A pattern that scales well past Make: hash the inputs of a transform, use the hash as the cache key, store the output keyed by the hash. Identical inputs anywhere in the pipeline produce a cache hit. Bazel, Nix, and Docker layer caching all use this pattern.
A minimal implementation in shell:
#!/bin/bash
set -euo pipefail
CACHE_DIR="${CACHE_DIR:-./.cache}"
mkdir -p "$CACHE_DIR"
cached_convert() {
local input="$1"
local output="$2"
local cmd="$3"
# Hash the input file plus the command used
local key
key=$(printf '%s\n' "$cmd" | sha256sum | cut -d' ' -f1)
local input_hash
input_hash=$(sha256sum "$input" | cut -d' ' -f1)
local cache_key="$CACHE_DIR/$key-$input_hash"
if [[ -f "$cache_key" ]]; then
cp "$cache_key" "$output"
return 0
fi
# Cache miss: run the command and store
eval "$cmd"
cp "$output" "$cache_key"
}
# Usage
cached_convert ./photo.jpg ./photo.avif \
'avifenc --speed 4 ./photo.jpg ./photo.avif'
In production, Bazel and Nix do this with sandboxing, garbage collection, and remote cache servers. The pattern is the same; the implementation is more rigorous.
A Comparative Table of Workflow Tools
| Tool | Strengths | Weaknesses | Use when |
|---|---|---|---|
| GNU Make | Universal, fast, parallel, incremental | Cryptic syntax, no cross-platform paths | File transforms, build pipelines |
| Bazel | Hermetic builds, remote cache, polyglot | Heavyweight, learning curve | Multi-language monorepos |
| just | Like Make but cleaner syntax | No incremental builds | Task running, not file builds |
| Taskfile | YAML-defined tasks | Less expressive | Cross-platform task running |
| Nix | Reproducible to the bit, sandboxed | Steep learning curve | Reproducible system builds |
| Snakemake | Python integration, rule-based | Bioinformatics-flavored | Scientific data pipelines |
| Airflow | DAG scheduling, observability | Heavy, infrastructure required | ETL with scheduling |
Reproducibility: The Pinning Discipline
Workflows that work today and break next month almost always fail because some dependency moved. The mitigations:
Pin tool versions explicitly. A tools.txt or flake.nix file declares the exact ffmpeg, ImageMagick, pandoc versions the pipeline expects. CI installs from that file.
Containerize the toolchain. A Dockerfile that builds the converter image is the ultimate version pin. The image hash is the pipeline's reproducibility token.
FROM debian:12-slim AS converter
RUN apt-get update && apt-get install -y --no-install-recommends \
imagemagick=8:6.9.11.60+dfsg-1.6+deb12u1 \
ffmpeg=7:5.1.6-0+deb12u1 \
pandoc=2.17.1.1-2 \
libreoffice=4:7.4.7-1 \
qpdf=11.3.0-1+deb12u1 \
exiftool=12.40+dfsg-1 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /work
ENTRYPOINT ["/bin/bash"]
Lock locale and timezone. LC_ALL=C and TZ=UTC in the workflow environment. Different locales sort differently, format dates differently, and produce non-deterministic output that breaks reproducibility.
Sort directory listings. find ... | sort not find ... Filesystem order is non-deterministic.
# Reproducible directory iteration
find ./input -name '*.md' -print0 | sort -z | \
while IFS= read -r -d '' f; do
process "$f"
done
"Reproducibility is the property that distinguishes science from anecdote, and engineering from cargo cult. A build pipeline that is not reproducible is a wish, not an artifact." Donald Knuth, paraphrased from his correspondence on TeX font determinism.
Error Handling and Resumability
A 10-hour batch conversion that fails on file 9847 of 10000 should not have to start over. The patterns that make this work:
Per-file output directory. One output file per input file, written atomically (write to .tmp, rename on success). Reruns skip files whose output already exists.
process_one() {
local input="$1" output="$2"
if [[ -f "$output" ]]; then
return 0 # already done, skip
fi
local tmp="${output}.tmp.$$"
if convert "$input" "$tmp"; then
mv "$tmp" "$output"
else
rm -f "$tmp"
return 1
fi
}
Checkpoint files. A done.list file appended after each success. Reruns read it and skip completed inputs. Make's timestamp tracking does this automatically; for non-Make pipelines you build it explicitly.
Independent failure isolation. A failure in one file should not crash the whole pipeline. parallel --halt now,fail=10 or explicit catch in the per-file function.
Logging and Observability
A pipeline that runs for an hour and prints "done" tells you nothing about what happened. Useful logging structure:
log() {
printf '%s [%s] %s\n' "$(date -Iseconds)" "$1" "$2" >&2
}
process_with_logging() {
local input="$1" output="$2"
local start
start=$(date +%s%N)
log INFO "start $input"
if process_one "$input" "$output"; then
local end
end=$(date +%s%N)
local ms=$(( (end - start) / 1000000 ))
log INFO "done $input in ${ms}ms"
else
log ERROR "fail $input"
return 1
fi
}
For long-running pipelines, structured logging (JSON lines) plus a log shipper (Vector, Fluent Bit) and a viewer (Grafana, Datadog) is worth the setup. The cost of building observability is paid back the first time you need to debug a multi-hour failure at 3 AM.
A Reference Pipeline
A complete file-conversion pipeline that demonstrates all the principles:
# Image conversion pipeline with caching, parallelization, idempotence
SHELL := /bin/bash
.SHELLFLAGS := -euo pipefail -c
export LC_ALL := C
export TZ := UTC
SRC_DIR := src
OUT_DIR := dist
SRCS := $(shell find $(SRC_DIR) -name '*.png' | sort)
AVIFS := $(patsubst $(SRC_DIR)/%.png,$(OUT_DIR)/%.avif,$(SRCS))
JPGS := $(patsubst $(SRC_DIR)/%.png,$(OUT_DIR)/%.jpg,$(SRCS))
WEBPS := $(patsubst $(SRC_DIR)/%.png,$(OUT_DIR)/%.webp,$(SRCS))
.PHONY: all clean stats
all: $(AVIFS) $(JPGS) $(WEBPS)
$(OUT_DIR)/%.avif: $(SRC_DIR)/%.png | $(OUT_DIR)
@mkdir -p $(dir $@)
@avifenc --speed 4 -a cq-level=22 "$<" "$@.tmp" \
&& mv "$@.tmp" "$@"
$(OUT_DIR)/%.jpg: $(SRC_DIR)/%.png | $(OUT_DIR)
@mkdir -p $(dir $@)
@magick "$<" -strip -sampling-factor 4:2:0 -quality 82 "$@.tmp" \
&& mv "$@.tmp" "$@"
$(OUT_DIR)/%.webp: $(SRC_DIR)/%.png | $(OUT_DIR)
@mkdir -p $(dir $@)
@cwebp -quiet -q 82 "$<" -o "$@.tmp" \
&& mv "$@.tmp" "$@"
$(OUT_DIR):
mkdir -p $@
stats:
@echo "Source PNG count: $$(echo $(SRCS) | wc -w)"
@echo "Source PNG bytes: $$(du -sb $(SRC_DIR) | cut -f1)"
@echo "Output AVIF bytes: $$(du -sb $(OUT_DIR) 2>/dev/null | cut -f1)"
clean:
rm -rf $(OUT_DIR)
Invoke with make -j$(nproc) for full parallelism. Inputs that have not changed do not get reconverted. Failed conversions leave no half-written output. Running it again resumes where it stopped.
For broader patterns in workflow design across content domains, see the operational notes at whennotesfly.com, the workflow templates at evolang.info, and the certification-path build patterns at pass4-sure.us.
Practical Recommendations
Use Make for file pipelines. Use containers for reproducibility. Pin tool versions. Set LC_ALL=C and TZ=UTC. Write atomically. Hash inputs for caching. Parallelize with -j. Log structured. Sort directory listings. Test with the empty input case and the one-file case before scaling up.
The difference between a workflow that lasts five years and one that breaks in five weeks is rarely about what the workflow does. It is about whether the engineer applied these principles or skipped them. Most of the techniques cost a few extra minutes of engineering up front and save days of debugging later.
Handling Failure Modes Gracefully
A file workflow encounters failure types that demand different responses.
Transient failures. Network blips, busy filesystems, temporary tool unavailability. Retry with exponential backoff.
Permanent failures. Malformed input, missing tool, permission denied. Log loudly and move on; do not block the rest of the pipeline.
Catastrophic failures. Out of disk, out of memory, kernel panic. Alert and halt; the pipeline cannot continue safely.
# Retry wrapper with exponential backoff
retry() {
local max=5 delay=1
for i in $(seq 1 $max); do
if "$@"; then return 0; fi
sleep $delay
delay=$((delay * 2))
done
return 1
}
# Use it for flaky operations
retry curl -fsSL https://api.example.com/data -o data.json
retry rsync -av src/ remote:dst/
Categorize failures by their error class, not by the specific error message. A nonzero exit code from convert could mean "file is corrupt" (permanent) or "ImageMagick policy denied a delegate" (configuration). Treat them differently.
Cross-Platform Path Handling
Workflows that must run on Linux, macOS, and Windows hit path separators, line endings, and case sensitivity. The robust patterns:
# Use forward slashes; Git Bash on Windows handles them
path="$HOME/work/data/input.csv"
# Normalize line endings explicitly
dos2unix data.csv
# or
sed -i 's/\r$//' data.csv
# Be case-sensitive even on case-insensitive filesystems
shopt -s nocaseglob # if you want case-insensitive
shopt -u nocaseglob # default, recommended
For workflows that must be truly cross-platform, prefer Python or Go over bash. The path handling, environment management, and error handling are easier to get right.
from pathlib import Path
import subprocess
def convert(src: Path, dst: Path) -> None:
dst.parent.mkdir(parents=True, exist_ok=True)
tmp = dst.with_suffix(dst.suffix + '.tmp')
try:
subprocess.run(
['avifenc', '--speed', '4', str(src), str(tmp)],
check=True, capture_output=True, timeout=120
)
tmp.replace(dst)
except subprocess.CalledProcessError as e:
tmp.unlink(missing_ok=True)
raise RuntimeError(f"avifenc failed for {src}: {e.stderr.decode()}") from e
if __name__ == '__main__':
for src in Path('src').rglob('*.png'):
dst = Path('out') / src.relative_to('src').with_suffix('.avif')
convert(src, dst)
CI Integration
Workflows that run in CI need extra care because the CI environment differs from local development in subtle ways.
| CI gotcha | Symptom | Fix |
|---|---|---|
| Missing tool | Command not found | Pin in container or install step |
| Different locale | Sort order changes | LC_ALL=C in env |
| Different timezone | Timestamp comparisons fail | TZ=UTC in env |
| Slow disk | I/O-bound jobs run slowly | Use ramdisk for intermediates |
| Limited cores | Parallelism saturates | Detect with nproc, scale with -j |
| Network restrictions | External tools fail | Vendor or cache dependencies |
| Cache invalidation | Stale outputs ship | Hash-based cache keys |
# GitHub Actions example with proper environment
jobs:
build:
runs-on: ubuntu-latest
env:
LC_ALL: C
TZ: UTC
steps:
- uses: actions/checkout@v4
- name: Cache build outputs
uses: actions/cache@v4
with:
path: build/
key: build-${{ hashFiles('src/**') }}
- name: Build
run: make -j$(nproc) all
- name: Validate
run: make stats
The hash-based cache key ensures the cache invalidates when sources change but reuses outputs across CI runs that share inputs. Many CI failures come from caches that did not invalidate when they should have.
- Feldman, Stuart I. "Make: A Program for Maintaining Computer Programs." Software: Practice and Experience, vol. 9, no. 4, April 1979, pp. 255 to 265. DOI: 10.1002/spe.4380090402.
- GNU Project. GNU Make Manual. https://www.gnu.org/software/make/manual/
- Beck, Kent. Extreme Programming Explained: Embrace Change. 2nd ed., Addison-Wesley, 2004. ISBN 978-0321278654.
- Knuth, Donald E. The Art of Computer Programming, Volume 1: Fundamental Algorithms. 3rd ed., Addison-Wesley, 1997.
- Dolstra, Eelco. The Purely Functional Software Deployment Model. PhD thesis, Utrecht University, 2006.
- Tange, Ole. "GNU Parallel: The Command-Line Power Tool." USENIX ;login:, vol. 36, no. 1, February 2011.
- Mecklenburg, Robert. Managing Projects with GNU Make. 3rd ed., O'Reilly, 2004. ISBN 978-0596006105.
- Bazel Authors. Bazel Build System Documentation. https://bazel.build/docs
Ready to Convert Your Files?
Use our free online file converter supporting 240+ formats. No signup required, fast processing, and secure handling of your files.
Convert Files

