A large red emergency stop button

Reversibility Is the MVP: How to Build Infra You Can Kill in 60 Seconds

Every deploy you can undo in one command is a deploy you'll actually try.

At 10:47 this morning I disabled an entire agent platform with three systemctl commands and one database flag. Everything went quiet: no agents listening, no web UI serving, no tile in the portal rail. Reversing it is another three commands.

This is the most underrated infrastructure property I know of. Reversibility. The ability to kill anything in 60 seconds, and bring it back in 60 seconds, without losing state.

Every deploy you can undo in one command is a deploy you'll actually try. The moment rollback costs more than five minutes, you stop experimenting. This post is the pattern set I use to keep that cost low.

Pattern 1: soft-hide flags over row deletion

When I wanted to stop showing the Hermes tile in my portal rail, I did not DELETE FROM services WHERE id = 'svc-hermes'. I did:

UPDATE _plugin_storage
SET data = jsonb_set(data::jsonb, '{hidden}', 'true'::jsonb)::text
WHERE plugin_id = 'service-portal' AND id = 'svc-hermes';

The row is still there. All the configured metadata - label, icon, color, path, ACL bindings, notes - is preserved. To bring the tile back: same UPDATE, flip to 'false'.

Compare with DELETE + re-INSERT. Deletion loses:

- The exact timestamps (createdAt, updatedAt) that let me correlate with other events.

- Any drift from defaults (custom color, custom icon, custom order).

- Referential relationships - ACL bindings, notes, injectSidebar flag.

To reverse a deletion you reconstruct from memory. To reverse a hidden flag you flip a bit.

This pattern applies everywhere: cron job 'enabled=false' over removed-from-jobs-json, feature flag 'rolled out=0%' over feature-flag-doesn't-exist, systemd 'disabled' over service-file-deleted.

Pattern 2: systemd enable --now / disable --now symmetry

systemd's enable --now and disable --now are perfectly symmetric operations. Either fully activates a unit (start + enable for boot-persistence), or fully deactivates it (stop + disable). One command, one logical operation.

# Off
systemctl --user disable --now anna-gateway yui-gateway hermes-web

# Back on
systemctl --user enable --now anna-gateway yui-gateway hermes-web

The unit files stay on disk either way. The binaries, the venv, the config - all untouched. Only the 'is this currently running' and 'should this start at boot' flags change.

If you ever find yourself writing a script that mv's unit files into a disabled/ directory, or that deletes units to 'cleanly' remove a service, you're fighting systemd. The enabled/disabled state is the abstraction.

Anti-pattern: 'let's rm -rf the service directory to uninstall.' Great, now to reinstall you need to redownload, reconfigure, re-authenticate. Sixty seconds of fake cleanliness, an afternoon of recovery when you change your mind.

Pattern 3: Traefik file-provider keeps routes alive when backends die

Traefik's file provider loads routers and services from YAML. Those definitions stay loaded whether or not the backends are healthy. When I stop the hermes-web service, Traefik's hermes-web@file router is still enabled - it just fails on request with a 502 because the upstream at 127.0.0.1:9119 is dead.

This is what you want. To re-enable, I start the backend. Traefik's passive health check picks it up on the next request. No Traefik reload, no route reconstruction, no certificate re-provisioning.

Compare with docker-provider labels. If your router definition lives as a label on the container, stopping the container removes the router. Starting the container re-provisions it. Certificates are fine (Traefik caches), but the router goes through a full 'new route discovered' cycle. That cycle has historically been a source of transient 404s in high-churn environments.

File-provider for stable routes, docker-provider for ephemeral workloads. Know which you're using and why.

Pattern 4: git stash as infrastructure snapshot

Before a risky upstream pull on Hermes - 689 commits ahead of my local, with 7 uncommitted local patches - my instinct was branch-and-rebase. But the faster reversible pattern is stash:

cd ~/.hermes/hermes-agent
git stash push -u -m 'calegix-local-mods-2026-04-19'
git fetch origin
git checkout main
git reset --hard v2026.4.16
# ... test the new version ...
# If it works: leave the stash, optionally pop later.
# If it breaks: git reset --hard 722331a5  (pre-update commit)
#               git stash pop

The stash is a snapshot. Named, retrievable, survives across branches. git stash list shows me exactly what's in there.

git stash list right now shows:

stash@{0}: On main: calegix-local-mods-2026-04-19
stash@{1}: On main: local-patches-pre-v0.9.0-update
stash@{2}: WIP on main: fad3f338 fix: patch...

A log of my last three 'before doing something risky' states. Any of them restorable with git stash pop stash@{N}.

Compare with a feature branch. Great for code changes you intend to keep. Overkill for 'I want to try something and maybe throw it away.' Branches demand a cleanup step later. Stashes don't.

Pattern 5: configuration lives in files, not arguments

Every knob I turn by SSHing in and modifying a live process is a knob I cannot reverse cheaply. Every knob I turn by editing a config file and sending a SIGHUP is fully reversible: the old config is in my git history.

For Hermes config, this means:

- Agent behavior: jobs.json in git

- Credentials: .env files with a .env.example template in git

- Skills: a directory of MD files in git

- Nothing consequential configured via kubectl edit or hermes config set where the state lives only in the running binary's memory

When I want to turn off the daily-summary cron job, I comment out four lines in jobs.json, git commit, restart gateway. The off-state is a commit. The on-state is a commit. Which state I'm in is whatever HEAD points at.

The principle

The common thread across all five: never put yourself in a state where the recovery procedure depends on human memory, ad-hoc typing, or 'figure it out then.'

The off state is a diff. The on state is the diff reversed. If the cost of applying the diff is measured in seconds, you will experiment. If the cost is measured in tickets-to-ops and change-advisory-board-meetings, you will avoid change - and that's where the real bugs compound, not in the experiments you couldn't try.

Reversibility is permission to iterate. It's the single most important quality of a platform you enjoy working on.

What to audit in your own stack

For every service you operate, ask:

- What's the one-command off? What's the one-command on?

- When I 'delete' a config object in the UI, does the system actually delete it, or flag it hidden? What happens on undelete?

- If I lose the running process's memory state right now, is there anything in the system I can't reconstruct from files in git?

If any of those answers require more than one command, one flag, or one commit - that's your next refactor target.

Related posts

No comments yet