TTS Docker stack crash-looping after disk full — Redis AOF corruption
Environment
- The Things Stack (lorawan-stack) via Docker Compose
- Redis:
redis:latest(with ReJSON/RediSearch modules) - Host: Ubuntu 24 on VirtualBox
- Deployment: self-hosted
What happened
My host disk hit 98% capacity, and shortly after, two containers started crash-looping:
thingsstack-stack-1— restarting every ~30 secondsthingsstack-redis-1— restarting every ~10 seconds
The lorawan-stack logs showed what looked like a DNS issue:
error:pkg/errors:net_dns (lookup redis on 127.0.0.11:53: no such host)
But digging into the Redis container logs revealed the real culprit:
# Bad file format reading the append only file appendonly.aof.103.incr.aof:
make a backup of your AOF file, then use ./redis-check-aof --fix <filename.manifest>
So the disk filling up left a corrupt write mid-AOF, Redis refused to start, and that took down the whole stack — the DNS error was just a symptom of Redis being down.
I was able to recover by running redis-check-aof --fix against the manifest using a separate docker run (since docker exec doesn’t work when the container is crash-looping), which truncated the corrupt tail and got everything back up.
What I’m trying to understand
-
Why exactly does a full disk cause AOF corruption? Is it that Redis is mid-write when the OS rejects further writes, leaving a partial record? Or is something else going on at the filesystem level?
-
Why does Redis refuse to start rather than just ignoring the corrupt tail? I’d expect a warning, not a hard failure — is this configurable?
-
What’s the right way to prevent this? I’m thinking disk usage alerts, a
maxmemorypolicy, and maybe enablingBGSAVEsnapshots as a fallback — but I’d love to hear what others are actually doing in production self-hosted TTS setups.
Any advice appreciated — especially around Redis config tuning for this kind of deployment.
Running Redis 7+ multi-part AOF format (files under appendonlydir/)