Restart-on-Crash for Long-Running AI Agents
Cap respawns at 5 in any 60-second window. Beyond that, mark the agent crashed-out and stop. This catches transient model 500s and network hiccups while preventing infinite hot-loop spam when a config bug means the agent dies on startup.
The first time you leave Claude Code running overnight you discover something: it sometimes dies. Anthropic's API 500s. Your network drops. The agent tries to git push against a remote that's down for maintenance and the wrapper script exits non-zero. You wake up to a terminated process and a half-done task.
The naive fix — "respawn whenever it dies" — has the opposite failure: a typo in the command means the process dies in 50ms, gets respawned, dies, gets respawned, and now you've burned 6,000 spawns and used your token quota for the day on initialization.
The 5-in-60s ceiling
The simplest workable rule: respawn until you hit 5 deaths in any rolling 60-second window. After that, stop. This handles both ends:
- Transient errors. Model 500 → respawn → success. Counter increments to 1, decays to 0 over the next minute.
- Hard failures. Bad config → instant death → respawn → instant death. Counter hits 5 within seconds. Stop. Mark crashed-out.
The rolling window is the trick. Count deaths within the last 60 seconds, not since the start of time. An agent that runs for an hour, dies, runs another hour, dies — that's not a crash loop, even if it's the 100th death of its life.
Implementation sketch
type Restarter struct {
deaths []time.Time // ring buffer
max int // 5
window time.Duration // 60s
}
func (r *Restarter) ShouldRespawn() bool {
now := time.Now()
cutoff := now.Add(-r.window)
fresh := r.deaths[:0]
for _, t := range r.deaths {
if t.After(cutoff) { fresh = append(fresh, t) }
}
r.deaths = fresh
return len(r.deaths) < r.max
}
func (r *Restarter) RecordDeath() { r.deaths = append(r.deaths, time.Now()) }
That's it. Five lines of state.
What gets preserved
Across a respawn:
- Working directory (don't change it)
- Environment variables (don't change them)
- The command itself (verbatim — if the user changed it, that's a new spawn)
- Sandbox profile (regenerated from the same capability tokens)
Not preserved:
- The PTY's scrollback buffer (it's a new PTY)
- The agent's in-memory state (it's a new process)
- Any uncommitted work the agent had in flight
The last bullet is why this is "restart-on-crash" not "snapshot and restore." For agents whose work is durable (writes to disk, commits to git, sends emails), restart is fine. For agents whose work is in-memory only, restart loses progress — you need to design the agent to checkpoint.
Backoff after the ceiling
When the agent hits the 5-in-60s ceiling, don't respawn. Don't quietly suppress either. Mark the agent crashed-out in the dashboard. Send a notification. Log every death's exit code so the user can see "this is what kept happening."
The user can manually re-enable from the UI. That's the right escape hatch — it's an explicit human gesture saying "I think the underlying issue is fixed, try again."
What about exponential backoff?
Tempting but wrong for this case. Exponential backoff is for "the resource will probably come back, give it time" — like a database reconnect. A crashing agent is more like "either it works in the first second or it doesn't." If it doesn't work in 5 quick respawns, it's not a timing issue; it's a configuration issue. Stop and let the human look.
Edge case: long-running then dies cleanly
An agent that runs successfully for 47 minutes and then exits with code 0 — was that a crash? Probably not, but you don't know. Default policy: don't respawn on exit code 0. That's the agent saying "I'm done." Respawn on non-zero exits and on signal-induced deaths. Make this configurable per-agent for the cases where exit-code-0 means "I want to be reborn."
Edge case: machine reboots
If the daemon is running under systemd / launchd / Windows service manager, the daemon comes back after reboot. Agents marked autoRestart get respawned with their original command. The 60-second window resets. So a reboot is a free respawn — the agent gets up to 5 deaths in the new window before stopping.
What we ship
Celistra applies this exact rule to every agent spawned with autoRestart: true. The 30-day SQLite history records every death, every respawn, every final crashed-out state. The dashboard shows respawn count next to runtime — at a glance you can tell which agents are flapping.
FAQ
Can I tune the 5/60s ceiling?
Yes — per-agent config. Some workloads (cron-like) want 1/24h, some (websocket reconnectors) want 30/60s. The default of 5/60s was chosen to fit AI-agent shapes.
What about graceful shutdown?
On manual stop (kill button or SIGTERM via celistrad ctl), the daemon doesn't respawn. Auto-restart only kicks in on unexpected exits.
Does this replay stdin?
No. The new PTY starts fresh. If your agent expected piped input, capture it in the spawn command (cat input.txt | claude ...) so the new spawn re-receives it.
How does this interact with sandbox restart-to-apply?
Sandbox profile regen happens on every spawn (it's nearly free). If you grant a new capability mid-life, the auto-restart respawns under the new profile next time the agent dies — or you can manually restart it to pick up the change.