Today, lemmy.amxl.com suffered an outage because the rootful Lemmy podman container crashed out, and wouldn’t restart.
Fixing it turned out to be more complicated than I expected, so I’m documenting the steps here in case anyone else has a similar issue with a podman container.
I tried restarting it, but got an unexpected error the internal IP address (which I hand assign to containers) was already in use, despite the fact it wasn’t running.
I create my Lemmy services with podman-compose
, so I deleted the Lemmy services with podman-compose down
, and then re-created them with podman-compose up
- that usually fixes things when they are really broken. But this time, I got a message like:
level=error msg=“"IPAM error: requested ip address 172.19.10.11 is already allocated to container ID 36e1a622f261862d592b7ceb05db776051003a4422d6502ea483f275b5c390f2"”
The only problem is that the referenced container actually didn’t exist at all in the output of podman ps -a
- in other words, podman thought the IP address was in use by a container that it didn’t know anything about! The IP address has effectively been ‘leaked’.
After digging into the internals, and a few false starts trying to track down where the leaked info was kept, I found it was kept in a BoltDB file at /run/containers/networks/ipam.db
- that’s apparently the ‘IP allocation’ database. Now, the good thing about /run
is it is wiped on system restart - although I didn’t really want to restart all my containers just to fix Lemmy.
BoltDB doesn’t come with a lot of tools, but you can install a TUI editor like this: go install github.com/br0xen/boltbrowser@latest
.
I made a backup of /run/containers/networks/ipam.db
just in case I screwed it up.
Then I ran sudo ~/go/bin/boltbrowser /run/containers/networks/ipam.db
to open the DB (this will lock the DB and stop any containers starting or otherwise changing IP statuses until you exit).
I found the networks that were impacted, and expanded the bucket (BoltDB has a hierarchy of buckets
, and eventually you get key/value pairs) for those networks, and then for the CIDR ranges the leaked IP was in. In that list, I found a record with a value equal to the container that didn’t actually exist. I used D to tell boltbrowser to delete that key/value pair. I also cleaned up under ids - where this time the key was the container ID that no longer existed - and repeated for both networks my container was in.
I then exited out of boltbrowser with q
.
After that, I brought my Lemmy containers back up with podman-compose up -d
- and everything then worked cleanly.
It’s a custom nginx proxy to the kube api. Too long to get into it. I was hired to move this giant cluster that started as a lab and make it production ready.
Thanks for the feedback
Ah, ok, yeah seems very custom. I guess it must predate Ingress.
No problem, good luck!