Resumed slash dispatch was still dropping back to prose for several JSON-capable local commands, which forced automation to special-case direct CLI invocations versus --resume flows. This routes resumed local-command handlers through the same structured JSON payloads used by direct status, sandbox, inventory, version, and init commands, and records the inventory parity audit result in the roadmap. Constraint: Text-mode resumed output must stay unchanged for existing shell users Rejected: Teach callers to scrape resumed text output | brittle and defeats the JSON contract Confidence: high Scope-risk: narrow Reversibility: clean Directive: When a direct local command has a JSON renderer, keep resumed slash dispatch on the same serializer instead of adding one-off format branches Tested: cargo fmt --check; cargo test --workspace; cargo clippy --workspace --all-targets -- -D warnings Not-tested: Live provider-backed REPL resume flows outside the local test harness
20 KiB
ROADMAP.md
Clawable Coding Harness Roadmap
Goal
Turn claw-code into the most clawable coding harness:
- no human-first terminal assumptions
- no fragile prompt injection timing
- no opaque session state
- no hidden plugin or MCP failures
- no manual babysitting for routine recovery
This roadmap assumes the primary users are claws wired through hooks, plugins, sessions, and channel events.
Definition of "clawable"
A clawable harness is:
- deterministic to start
- machine-readable in state and failure modes
- recoverable without a human watching the terminal
- branch/test/worktree aware
- plugin/MCP lifecycle aware
- event-first, not log-first
- capable of autonomous next-step execution
Current Pain Points
1. Session boot is fragile
- trust prompts can block TUI startup
- prompts can land in the shell instead of the coding agent
- "session exists" does not mean "session is ready"
2. Truth is split across layers
- tmux state
- clawhip event stream
- git/worktree state
- test state
- gateway/plugin/MCP runtime state
3. Events are too log-shaped
- claws currently infer too much from noisy text
- important states are not normalized into machine-readable events
4. Recovery loops are too manual
- restart worker
- accept trust prompt
- re-inject prompt
- detect stale branch
- retry failed startup
- classify infra vs code failures manually
5. Branch freshness is not enforced enough
- side branches can miss already-landed main fixes
- broad test failures can be stale-branch noise instead of real regressions
6. Plugin/MCP failures are under-classified
- startup failures, handshake failures, config errors, partial startup, and degraded mode are not exposed cleanly enough
7. Human UX still leaks into claw workflows
- too much depends on terminal/TUI behavior instead of explicit agent state transitions and control APIs
Product Principles
- State machine first — every worker has explicit lifecycle states.
- Events over scraped prose — channel output should be derived from typed events.
- Recovery before escalation — known failure modes should auto-heal once before asking for help.
- Branch freshness before blame — detect stale branches before treating red tests as new regressions.
- Partial success is first-class — e.g. MCP startup can succeed for some servers and fail for others, with structured degraded-mode reporting.
- Terminal is transport, not truth — tmux/TUI may remain implementation details, but orchestration state must live above them.
- Policy is executable — merge, retry, rebase, stale cleanup, and escalation rules should be machine-enforced.
Roadmap
Phase 1 — Reliable Worker Boot
1. Ready-handshake lifecycle for coding workers
Add explicit states:
spawningtrust_requiredready_for_promptprompt_acceptedrunningblockedfinishedfailed
Acceptance:
- prompts are never sent before
ready_for_prompt - trust prompt state is detectable and emitted
- shell misdelivery becomes detectable as a first-class failure state
2. Trust prompt resolver
Add allowlisted auto-trust behavior for known repos/worktrees.
Acceptance:
- trusted repos auto-clear trust prompts
- events emitted for
trust_requiredandtrust_resolved - non-allowlisted repos remain gated
3. Structured session control API
Provide machine control above tmux:
- create worker
- await ready
- send task
- fetch state
- fetch last error
- restart worker
- terminate worker
Acceptance:
- a claw can operate a coding worker without raw send-keys as the primary control plane
Phase 2 — Event-Native Clawhip Integration
4. Canonical lane event schema
Define typed events such as:
lane.startedlane.readylane.prompt_misdeliverylane.blockedlane.redlane.greenlane.commit.createdlane.pr.openedlane.merge.readylane.finishedlane.failedbranch.stale_against_main
Acceptance:
- clawhip consumes typed lane events
- Discord summaries are rendered from structured events instead of pane scraping alone
5. Failure taxonomy
Normalize failure classes:
prompt_deliverytrust_gatebranch_divergencecompiletestplugin_startupmcp_startupmcp_handshakegateway_routingtool_runtimeinfra
Acceptance:
- blockers are machine-classified
- dashboards and retry policies can branch on failure type
6. Actionable summary compression
Collapse noisy event streams into:
- current phase
- last successful checkpoint
- current blocker
- recommended next recovery action
Acceptance:
- channel status updates stay short and machine-grounded
- claws stop inferring state from raw build spam
Phase 3 — Branch/Test Awareness and Auto-Recovery
7. Stale-branch detection before broad verification
Before broad test runs, compare current branch to main and detect if known fixes are missing.
Acceptance:
- emit
branch.stale_against_main - suggest or auto-run rebase/merge-forward according to policy
- avoid misclassifying stale-branch failures as new regressions
8. Recovery recipes for common failures
Encode known automatic recoveries for:
- trust prompt unresolved
- prompt delivered to shell
- stale branch
- compile red after cross-crate refactor
- MCP startup handshake failure
- partial plugin startup
Acceptance:
- one automatic recovery attempt occurs before escalation
- the attempted recovery is itself emitted as structured event data
9. Green-ness contract
Workers should distinguish:
- targeted tests green
- package green
- workspace green
- merge-ready green
Acceptance:
- no more ambiguous "tests passed" messaging
- merge policy can require the correct green level for the lane type
Phase 4 — Claws-First Task Execution
10. Typed task packet format
Define a structured task packet with fields like:
- objective
- scope
- repo/worktree
- branch policy
- acceptance tests
- commit policy
- reporting contract
- escalation policy
Acceptance:
- claws can dispatch work without relying on long natural-language prompt blobs alone
- task packets can be logged, retried, and transformed safely
11. Policy engine for autonomous coding
Encode automation rules such as:
- if green + scoped diff + review passed -> merge to dev
- if stale branch -> merge-forward before broad tests
- if startup blocked -> recover once, then escalate
- if lane completed -> emit closeout and cleanup session
Acceptance:
- doctrine moves from chat instructions into executable rules
12. Claw-native dashboards / lane board
Expose a machine-readable board of:
- repos
- active claws
- worktrees
- branch freshness
- red/green state
- current blocker
- merge readiness
- last meaningful event
Acceptance:
- claws can query status directly
- human-facing views become a rendering layer, not the source of truth
Phase 5 — Plugin and MCP Lifecycle Maturity
13. First-class plugin/MCP lifecycle contract
Each plugin/MCP integration should expose:
- config validation contract
- startup healthcheck
- discovery result
- degraded-mode behavior
- shutdown/cleanup contract
Acceptance:
- partial-startup and per-server failures are reported structurally
- successful servers remain usable even when one server fails
14. MCP end-to-end lifecycle parity
Close gaps from:
- config load
- server registration
- spawn/connect
- initialize handshake
- tool/resource discovery
- invocation path
- error surfacing
- shutdown/cleanup
Acceptance:
- parity harness and runtime tests cover healthy and degraded startup cases
- broken servers are surfaced as structured failures, not opaque warnings
Immediate Backlog (from current real pain)
Priority order: P0 = blocks CI/green state, P1 = blocks integration wiring, P2 = clawability hardening, P3 = swarm-efficiency improvements.
P0 — Fix first (CI reliability)
- Isolate
render_diff_reporttests into tmpdir — done:render_diff_report_for()tests run in temp git repos instead of the live working tree, and targetedcargo test -p rusty-claude-cli render_diff_report -- --nocapturenow stays green during branch/worktree activity - Expand GitHub CI from single-crate coverage to workspace-grade verification — done:
.github/workflows/rust-ci.ymlnow runscargo test --workspaceplus fmt/clippy at the workspace level - Add release-grade binary workflow — done:
.github/workflows/release.ymlnow builds tagged Rust release artifacts for the CLI - Add container-first test/run docs — done:
Containerfile+docs/container.mddocument the canonical Docker/Podman workflow for build, bind-mount, andcargo test --workspaceusage - Surface
doctor/ preflight diagnostics in onboarding docs and help — done: README + USAGE now putclaw doctor//doctorin the first-run path and point at the built-in preflight report - Automate branding/source-of-truth residue checks in CI — done:
.github/scripts/check_doc_source_of_truth.pyand thedoc-source-of-truthCI job now block stale repo/org/invite residue in tracked docs and metadata - Eliminate warning spam from first-run help/build path — done: current
cargo run -q -p rusty-claude-cli -- --helprenders clean help output without a warning wall before the product surface - Promote
doctorfrom slash-only to top-level CLI entrypoint — done:claw doctoris now a local shell entrypoint with regression coverage for direct help and health-report output - Make machine-readable status commands actually machine-readable — done:
claw --output-format json statusandclaw --output-format json sandboxnow emit structured JSON snapshots instead of prose tables - Unify legacy config/skill namespaces in user-facing output — done: skills/help JSON/text output now present
.clawas the canonical namespace and collapse legacy roots behind.claw-shaped source ids/labels - Honor JSON output on inventory commands like
skillsandmcp— done: direct CLI inventory commands now honor--output-format jsonwith structured payloads for both skills and MCP inventory - Audit
--output-formatcontract across the whole CLI surface — done: direct CLI commands now honor deterministic JSON/text handling across help/version/status/sandbox/agents/mcp/skills/bootstrap-plan/system-prompt/init/doctor, with regression coverage inoutput_format_contract.rsand resumed/statusJSON coverage
P1 — Next (integration wiring, unblocks verification)
2. Add cross-module integration tests — done: 12 integration tests covering worker→recovery→policy, stale_branch→policy, green_contract→policy, reconciliation flows
3. Wire lane-completion emitter — done: lane_completion module with detect_lane_completion() auto-sets LaneContext::completed from session-finished + tests-green + push-complete → policy closeout
4. Wire SummaryCompressor into the lane event pipeline — done: compress_summary_text() feeds into LaneEvent::Finished detail field in tools/src/lib.rs
P2 — Clawability hardening (original backlog)
5. Worker readiness handshake + trust resolution — done: WorkerStatus state machine with Spawning → TrustRequired → ReadyForPrompt → PromptAccepted → Running lifecycle, trust_auto_resolve + trust_gate_cleared gating
6. Prompt misdelivery detection and recovery — done: prompt_delivery_attempts counter, PromptMisdelivery event detection, auto_recover_prompt_misdelivery + replay_prompt recovery arm
7. Canonical lane event schema in clawhip — done: LaneEvent enum with Started/Blocked/Failed/Finished variants, LaneEvent::new() typed constructor, tools/src/lib.rs integration
8. Failure taxonomy + blocker normalization — done: WorkerFailureKind enum (TrustGate/PromptDelivery/Protocol/Provider), FailureScenario::from_worker_failure_kind() bridge to recovery recipes
9. Stale-branch detection before workspace tests — done: stale_branch.rs module with freshness detection, behind/ahead metrics, policy integration
10. MCP structured degraded-startup reporting — done: McpManager degraded-startup reporting (+183 lines in mcp_stdio.rs), failed server classification (startup/handshake/config/partial), structured failed_servers + recovery_recommendations in tool output
11. Structured task packet format — done: task_packet.rs module with TaskPacket struct, validation, serialization, TaskScope resolution (workspace/module/single-file/custom), integrated into tools/src/lib.rs
12. Lane board / machine-readable status API — done: Lane completion hardening + LaneContext::completed auto-detection + MCP degraded reporting surface machine-readable state
13. Session completion failure classification — done: WorkerFailureKind::Provider + observe_completion() + recovery recipe bridge landed
14. Config merge validation gap — done: config.rs hook validation before deep-merge (+56 lines), malformed entries fail with source-path context instead of merged parse errors
15. MCP manager discovery flaky test — done: manager_discovery_report_keeps_healthy_servers_when_one_server_fails now runs as a normal workspace test again after repeated stable passes, so degraded-startup coverage is no longer hidden behind #[ignore]
- Commit provenance / worktree-aware push events — done:
LaneCommitProvenancenow carries branch/worktree/canonical-commit/supersession metadata in lane events, anddedupe_superseded_commit_events()is applied before agent manifests are written so superseded commit events collapse to the latest canonical lineage - Orphaned module integration audit — done:
runtimenow keepssession_controlandtrust_resolverbehind#[cfg(test)]until they are wired into a real non-test execution path, so normal builds no longer advertise dead clawability surface area. - Context-window preflight gap — done: provider request sizing now emits
context_window_blockedbefore oversized requests leave the process, using a model-context registry instead of the old naive max-token heuristic. - Subcommand help falls through into runtime/API path — done:
claw doctor --help,claw status --help,claw sandbox --help, and nestedmcp/skillshelp are now intercepted locally without runtime/provider startup, with regression tests covering the direct CLI paths. - Session state classification gap (working vs blocked vs finished vs truly stale) — done: agent manifests now derive machine states such as
working,blocked_background_job,blocked_merge_conflict,degraded_mcp,interrupted_transport,finished_pending_report, andfinished_cleanable, and terminal-state persistence records commit provenance plus derived state so downstream monitoring can distinguish quiet progress from truly idle sessions. - Resumed
/statusJSON parity gap — dogfooding shows freshclaw status --output-format jsonnow emits structured JSON, but resumed slash-command status still leaks through a text-shaped path in at least one dispatch path. Local CI-equivalent repro failsrust/crates/rusty-claude-cli/tests/resume_slash_commands.rs::resumed_status_command_emits_structured_json_when_requestedwithexpected value at line 1 column 1, so resumed automation can receive text where JSON was explicitly requested. Action: unify fresh vs resumed/statusrendering through one output-format contract and add regression coverage so resumed JSON output is guaranteed valid. - Opaque failure surface for session/runtime crashes — repeated dogfood-facing failures can currently collapse to generic wrappers like
Something went wrong while processing your request. Please try again, or use /new to start a fresh session.without exposing whether the fault was provider auth, session corruption, slash-command dispatch, render failure, or transport/runtime panic. This blocks fast self-recovery and turns actionable clawability bugs into blind retries. Action: preserve a short user-safe failure class (provider_auth,session_load,command_dispatch,render,runtime_panic, etc.), attach a local trace/session id, and ensure operators can jump from the chat-visible error to the exact failure log quickly. doctor --output-format jsoncheck-level structure gap — done:claw doctor --output-format jsonnow keeps the human-readablemessage/reportwhile also emitting structured per-check diagnostics (name,status,summary,details, plus typed fields like workspace paths and sandbox fallback data), with regression coverage inoutput_format_contract.rs.- Plugin lifecycle init/shutdown test flakes under workspace-parallel execution — dogfooding surfaced that
build_runtime_runs_plugin_lifecycle_init_and_shutdowncan fail undercargo test --workspacewhile passing in isolation because sibling tests race on tempdir-backed shell init script paths. This is test brittleness rather than a code-path regression, but it still destabilizes CI confidence and wastes diagnosis cycles. Action: isolate temp resources per test robustly (unique dirs + no shared cwd assumptions), audit cleanup timing, and add a regression guard so the plugin lifecycle test remains stable under parallel workspace execution. - Resumed
/sandboxJSON parity gap — done: directclaw sandbox --output-format jsonalready emitted structured JSON, but resumedclaw --output-format json --resume <session> /sandboxstill fell back to prose because resumed slash dispatch only emitted JSON for/status. The resumed/sandboxpath now reuses the same JSON envelope as the direct CLI command, with regression coverage inrust/crates/rusty-claude-cli/tests/resume_slash_commands.rs. - Resumed inventory JSON parity gap for
/mcpand/skills— done: resumed slash-command inventory calls now honor--output-format jsonvia the same structured renderers as directclaw mcp/claw skills, with regression coverage for resumedlistoutput under an isolated config home. P3 — Swarm efficiency - Swarm branch-lock protocol — done:
branch_lock::detect_branch_lock_collisions()now detects same-branch/same-scope and nested-module collisions before parallel lanes drift into duplicate implementation - Commit provenance / worktree-aware push events — done: lane event provenance now includes branch/worktree/superseded/canonical lineage metadata, and manifest persistence de-dupes superseded commit events before downstream consumers render them
Suggested Session Split
Session A — worker boot protocol
Focus:
- trust prompt detection
- ready-for-prompt handshake
- prompt misdelivery detection
Session B — clawhip lane events
Focus:
- canonical lane event schema
- failure taxonomy
- summary compression
Session C — branch/test intelligence
Focus:
- stale-branch detection
- green-level contract
- recovery recipes
Session D — MCP lifecycle hardening
Focus:
- startup/handshake reliability
- structured failed server reporting
- degraded-mode runtime behavior
- lifecycle tests/harness coverage
Session E — typed task packets + policy engine
Focus:
- structured task format
- retry/merge/escalation rules
- autonomous lane closure behavior
MVP Success Criteria
We should consider claw-code materially more clawable when:
- a claw can start a worker and know with certainty when it is ready
- claws no longer accidentally type tasks into the shell
- stale-branch failures are identified before they waste debugging time
- clawhip reports machine states, not just tmux prose
- MCP/plugin startup failures are classified and surfaced cleanly
- a coding lane can self-recover from common startup and branch issues without human babysitting
Short Version
claw-code should evolve from:
- a CLI a human can also drive
to:
- a claw-native execution runtime
- an event-native orchestration substrate
- a plugin/hook-first autonomous coding harness