Playbook: debug-a-failed-run
Intent
Take a non-zero agentspec exit, classify it by exit code + error fragment, choose the remediation.
Preconditions
- Access to the user's command (exact argv), stderr line, ACLI envelope on stdout, and — for exit 3 — the execution record under
<workdir>/.agentspec/records/.
- The manifest and (if applicable) lockfile the run used.
Steps
- Classify by exit code first. This narrows the failure class before you read any message text.
| Exit |
Class |
Happened at |
0 |
Success |
— |
1 |
Parse / IO / resolution |
Before the runner spawned anything |
2 |
Validation / policy |
Post-resolve, pre-spawn |
3 |
Runtime |
Inside the spawned CLI |
127 |
Binary not found at exec time |
Inside sandbox; runtime CLI missing from PATH |
- Parse the ACLI envelope. agentspec failures are structured:
{ "ok": false, "command": "<cmd>", "error": { "code": "<CODE>", "message": "…", "hint": "…" }, "meta": {…} }
Match on error.code, not on error.message (messages can change; codes are the contract).
- Apply the table below.
Failure modes
Exit 1 — pre-resolution
error.code / fragment |
Cause |
Remedy |
FILE_NOT_FOUND / failed to read <path> |
Missing manifest |
Check the path; confirm working dir |
INVALID_MANIFEST / validation error for AgentManifest |
Schema violation |
Run agentspec validate — it prints the field-level errors |
NO_MODEL_RESOLVED / No model could be resolved |
No preferred entry matches runtime + auth |
agentspec resolve --verbose shows the chain; install a CLI or set the required env key |
AMBIGUOUS_AGENT_PATH / Unknown agent format |
File doesn't end in .agent and isn't a dir |
Rename, or point at the directory containing a manifest.agent |
Exit 2 — policy rejection
error.code / fragment |
Cause |
Remedy |
ISOLATION_UNAVAILABLE / refusing to run without isolation |
Tight trust + bwrap missing, no --unsafe-no-isolation |
Install bwrap; or --via=none --unsafe-no-isolation if the trust really is permissive |
MANIFEST_LOCK_MISMATCH / manifest hash … does not match lock's |
Manifest edited after lock written |
Re-lock if intentional; revert edit otherwise |
LOCK_NOT_SIGNED |
--require-signed passed but lock has no signature envelope |
Re-lock with --sign-key-env <VAR> |
LOCK_SIG_INVALID |
Signature verifies as malformed or doesn't match pubkey |
Wrong pubkey or lock was tampered. Do not proceed without investigating |
PUBKEY_MALFORMED |
The --pubkey <hex> arg isn't 32 bytes of valid hex |
Distinct from signature check; fix the key argument |
COST_BUDGET_EXCEEDED (pre-flight estimate) |
observability.cost_limit lower than estimated usage |
Raise the limit or narrow the plan |
Exit 3 — runtime failure
The runtime CLI itself exited non-zero. Read the line following Launching <runtime>... for the provider's own error.
| Fragment |
Cause |
Remedy |
401 Unauthorized / 403 Forbidden |
Auth token expired or wrong |
Check which env var the resolver picked (Auth: env.<NAME> in the run banner); rotate |
rate_limit / 429 |
Provider throttling |
Back off, reduce concurrency, or swap to next preferred entry |
timeout with sandbox network: none |
Stage declared no network; provider call blocked |
Widen trust.network to allowed or confirm offline mode is intended |
bwrap: execvp <binary>: No such file or directory |
On host, binary exists; inside sandbox, ELF interpreter unreachable |
v0.5.1+ fixes the /lib64 symlink case. Earlier: update, or widen trust.filesystem |
| Runtime-specific stack trace |
Bug in the runtime CLI |
Reproduce outside agentspec (claude-code -p "…") to confirm |
Exit 127 — exec not found inside sandbox
| Fragment |
Cause |
Remedy |
failed to spawn command |
Runtime CLI missing from PATH inside bwrap's view |
Confirm CLI is under a bound system path (/usr/bin, /nix/store). Under very tight trust, include it in trust.scope: |
Diagnostic signals from the record
Even a failed run produces an execution record. Two fields are load-bearing:
{ "exit_code": 3, "outcome": "runtime_error", "duration_ms": 42 }
duration_ms range |
Likely story |
< 100 |
Sandbox or CLI startup failed before the provider call; check stderr |
100 – 2000 |
Auth validation or early rejection |
> 2000 |
Reached the provider; read the runtime's own error |
When the record says success but output is wrong
Records track exit code + duration, not LLM output quality. Enable observability.trace: true in the manifest and pair with the runtime's own logging (e.g. claude-code --verbose, Gemini CLI traces). AgentSpec's job is "did the process run under the declared policy;" it does not judge the LLM's reasoning.
Verification
# Did the run finish? Exit code is the fastest signal.
echo $?
# Structured envelope if JSON output was requested.
agentspec run … --output json | python3 -c "import sys,json; e=json.load(sys.stdin); print(e.get('ok'), e.get('error', {}).get('code'))"
# Record for any run, regardless of exit:
agentspec records show <run-id>
See also