From 64e058f72091aa4487555b66d1640a6c3155e81d Mon Sep 17 00:00:00 2001 From: Yeachan-Heo Date: Thu, 16 Apr 2026 02:50:54 +0000 Subject: [PATCH] refresh --- ROADMAP.md | 647 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 647 insertions(+) diff --git a/ROADMAP.md b/ROADMAP.md index b07efdc..51e834b 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -88,6 +88,25 @@ Acceptance: - trust prompt state is detectable and emitted - shell misdelivery becomes detectable as a first-class failure state +### 1.5. First-prompt acceptance SLA +After `ready_for_prompt`, expose whether the first task was actually accepted within a bounded window instead of leaving claws in a silent limbo. + +Emit typed signals for: +- `prompt.sent` +- `prompt.accepted` +- `prompt.acceptance_delayed` +- `prompt.acceptance_timeout` + +Track at least: +- time from `ready_for_prompt` -> first prompt send +- time from first prompt send -> `prompt_accepted` +- whether acceptance required retry or recovery + +Acceptance: +- clawhip can distinguish `worker is ready but idle` from `prompt was sent but not actually accepted` +- long silent gaps between ready-state and first-task execution become machine-visible +- recovery can trigger on acceptance timeout before humans start scraping panes + ### 2. Trust prompt resolver Add allowlisted auto-trust behavior for known repos/worktrees. @@ -109,6 +128,23 @@ Provide machine control above tmux: Acceptance: - a claw can operate a coding worker without raw send-keys as the primary control plane +### 3.5. Boot preflight / doctor contract +Before spawning or prompting a worker, run a machine-readable preflight that reports whether the lane is actually safe to start. + +Preflight should check and emit typed results for: +- repo/worktree existence and expected branch +- branch freshness vs base branch +- trust-gate likelihood / allowlist status +- required binaries and control sockets +- plugin discovery / allowlist / startup eligibility +- MCP config presence and server reachability expectations +- last-known failed boot reason, if any + +Acceptance: +- claws can fail fast before launching a doomed worker +- a blocked start returns a short structured diagnosis instead of forcing pane-scrape triage +- clawhip can summarize `why this lane did not even start` without inferring from terminal noise + ## Phase 2 — Event-Native Clawhip Integration ### 4. Canonical lane event schema @@ -130,6 +166,551 @@ Acceptance: - clawhip consumes typed lane events - Discord summaries are rendered from structured events instead of pane scraping alone +### 4.5. Session event ordering + terminal-state reconciliation +When the same session emits contradictory lifecycle events (`idle`, `error`, `completed`, transport/server-down) in close succession, claw-code must expose a deterministic final truth instead of making downstream claws guess. + +Required behavior: +- attach monotonic sequence / causal ordering metadata to session lifecycle events +- classify which events are terminal vs advisory +- reconcile duplicate or out-of-order terminal events into one canonical lane outcome +- distinguish `session terminal state unknown because transport died` from a real `completed` + +Acceptance: +- clawhip can survive `completed -> idle -> error -> completed` noise without double-reporting or trusting the wrong final state +- server-down after a session event burst surfaces as a typed uncertainty state rather than silently rewriting history +- downstream automation has one canonical terminal outcome per lane/session + +### 4.6. Event provenance / environment labeling +Every emitted event should say whether it came from a live lane, synthetic test, healthcheck, replay, or system transport layer so claws do not mistake test noise for production truth. + +Required fields: +- event source kind (`live_lane`, `test`, `healthcheck`, `replay`, `transport`) +- environment / channel label +- emitter identity +- confidence / trust level for downstream automation + +Acceptance: +- clawhip can ignore or down-rank test pings without heuristic text matching +- synthetic/system events do not contaminate lane status or trigger false follow-up automation +- event streams remain machine-trustworthy even when test traffic shares the same channel + +### 4.7. Session identity completeness at creation time +A newly created session should not surface as `(untitled)` or `(unknown)` for fields that orchestrators need immediately. + +Required behavior: +- emit stable title, workspace/worktree path, and lane/session purpose at creation time +- if any field is not yet known, emit an explicit typed placeholder reason rather than a bare unknown string +- reconcile later-enriched metadata back onto the same session identity without creating ambiguity + +Acceptance: +- clawhip can route/triage a brand-new session without waiting for follow-up chatter +- `(untitled)` / `(unknown)` creation events no longer force humans or bots to guess scope +- session creation events are immediately actionable for monitoring and ownership decisions + +### 4.8. Duplicate terminal-event suppression +When the same session emits repeated `completed`, `failed`, or other terminal notifications, claw-code should collapse duplicates before they trigger repeated downstream reactions. + +Required behavior: +- attach a canonical terminal-event fingerprint per lane/session outcome +- suppress or coalesce repeated terminal notifications within a reconciliation window +- preserve raw event history for audit while exposing only one actionable terminal outcome downstream +- surface when a later duplicate materially differs from the original terminal payload + +Acceptance: +- clawhip does not double-report or double-close based on repeated terminal notifications +- duplicate `completed` bursts become one actionable finish event, not repeated noise +- downstream automation stays idempotent even when the upstream emitter is chatty + +### 4.9. Lane ownership / scope binding +Each session and lane event should declare who owns it and what workflow scope it belongs to, so unrelated external/system work does not pollute claw-code follow-up loops. + +Required behavior: +- attach owner/assignee identity when known +- attach workflow scope (e.g. `claw-code-dogfood`, `external-git-maintenance`, `infra-health`, `manual-operator`) +- mark whether the current watcher is expected to act, observe only, or ignore +- preserve scope through session restarts, resumes, and late terminal events + +Acceptance: +- clawhip can say `out-of-scope external session` without humans adding a prose disclaimer +- unrelated session churn does not trigger false claw-code follow-up or blocker reporting +- monitoring views can filter to `actionable for this claw` instead of mixing every session on the host + +### 4.10. Nudge acknowledgment / dedupe contract +Periodic clawhip nudges should carry enough state for claws to know whether the current prompt is new work, a retry, or an already-acknowledged heartbeat. + +Required behavior: +- attach nudge id / cycle id and delivery timestamp +- expose whether the current claw has already acknowledged or responded for that cycle +- distinguish `new nudge`, `retry nudge`, and `stale duplicate` +- allow downstream summaries to bind a reported pinpoint back to the triggering nudge id + +Acceptance: +- claws do not keep manufacturing fresh follow-ups just because the same periodic nudge reappeared +- clawhip can tell whether silence means `not yet handled` or `already acknowledged in this cycle` +- recurring dogfood prompts become idempotent and auditable across retries + +### 4.11. Stable roadmap-id assignment for newly filed pinpoints +When a claw records a new pinpoint/follow-up, the roadmap surface should assign or expose a stable tracking id immediately instead of leaving the item as anonymous prose. + +Required behavior: +- assign a canonical roadmap id at filing time +- expose that id in the structured event/report payload +- preserve the same id across later edits, reorderings, and summary compression +- distinguish `new roadmap filing` from `update to existing roadmap item` + +Acceptance: +- channel updates can reference a newly filed pinpoint by stable id in the same turn +- downstream claws do not need heuristic text matching to figure out whether a follow-up is new or already tracked +- roadmap-driven dogfood loops stay auditable even as the document is edited repeatedly + +### 4.12. Roadmap item lifecycle state contract +Each roadmap pinpoint should carry a machine-readable lifecycle state so claws do not keep rediscovering or re-reporting items that are already active, resolved, or superseded. + +Required behavior: +- expose lifecycle state (`filed`, `acknowledged`, `in_progress`, `blocked`, `done`, `superseded`) +- attach last state-change timestamp +- allow a new report to declare whether it is a first filing, status update, or closure +- preserve lineage when one pinpoint supersedes or merges into another + +Acceptance: +- clawhip can tell `new gap` from `existing gap still active` without prose interpretation +- completed or superseded items stop reappearing as if they were fresh discoveries +- roadmap-driven follow-up loops become stateful instead of repeatedly stateless + +### 4.13. Multi-message report atomicity +A single dogfood/lane update should be representable as one structured report payload, even if the chat surface ends up rendering it across multiple messages. + +Required behavior: +- assign one report id for the whole update +- bind `active_sessions`, `exact_pinpoint`, `concrete_delta`, and `blocker` fields to that same report id +- expose message-part ordering when the chat transport splits the report +- allow downstream consumers to reconstruct one canonical update without scraping adjacent chat messages heuristically + +Acceptance: +- clawhip and other claws can parse one logical update even when Discord delivery fragments it into several posts +- partial/misordered message bursts do not scramble `pinpoint` vs `delta` vs `blocker` +- dogfood reports become machine-reliable summaries instead of fragile chat archaeology + +### 4.14. Cross-claw pinpoint dedupe / merge contract +When multiple claws file near-identical pinpoints from the same underlying failure, the roadmap surface should merge or relate them instead of letting duplicate follow-ups accumulate as separate discoveries. + +Required behavior: +- compute or expose a similarity/dedupe key for newly filed pinpoints +- allow a new filing to link to an existing roadmap item as `same_root_cause`, `related`, or `supersedes` +- preserve reporter-specific evidence while collapsing the canonical tracked issue +- surface when a later filing is genuinely distinct despite similar wording + +Acceptance: +- two claws reporting the same gap do not automatically create two independent roadmap items +- roadmap growth reflects real new findings instead of duplicate observer churn +- downstream monitoring can see both the canonical item and the supporting duplicate evidence without losing auditability + +### 4.15. Pinpoint evidence attachment contract +Each filed pinpoint should carry structured supporting evidence so later implementers do not have to reconstruct why the gap was believed to exist. + +Required behavior: +- attach evidence references such as session ids, message ids, commits, logs, stack traces, or file paths +- label each attachment by evidence role (`repro`, `symptom`, `root_cause_hint`, `verification`) +- preserve bounded previews for human scanning while keeping a canonical reference for machines +- allow evidence to be added after filing without changing the pinpoint identity + +Acceptance: +- roadmap items stay actionable after chat scrollback or session context is gone +- implementation lanes can start from structured evidence instead of rediscovering the original failure +- prioritization can weigh pinpoints by evidence quality, not just prose confidence + +### 4.16. Pinpoint priority / severity contract +Each filed pinpoint should expose a machine-readable urgency/severity signal so claws can separate immediate execution blockers from lower-priority clawability hardening. + +Required behavior: +- attach priority/severity fields (for example `p0`/`p1`/`p2` or `critical`/`high`/`medium`/`low`) +- distinguish user-facing breakage, operator-only friction, observability debt, and long-tail hardening +- allow priority to change as new evidence lands without changing the pinpoint identity +- surface why the priority was assigned (blast radius, reproducibility, automation breakage, merge risk) + +Acceptance: +- clawhip can rank fresh pinpoints without relying on prose urgency vibes +- implementation queues can pull true blockers ahead of reporting-only niceties +- roadmap dogfood stays focused on the most damaging clawability gaps first + +### 4.17. Pinpoint-to-implementation handoff contract +A filed pinpoint should be able to turn into an execution lane without a human re-translating the same context by hand. + +Required behavior: +- expose a structured handoff packet containing objective, suspected scope, evidence refs, priority, and suggested verification +- mark whether the pinpoint is `implementation_ready`, `needs_repro`, or `needs_triage` +- preserve the link between the roadmap item and any spawned execution lane/worktree/PR +- allow later execution results to update the original pinpoint state instead of forking separate unlinked narratives + +Acceptance: +- a claw can pick up a filed pinpoint and start implementation with minimal re-interpretation +- roadmap items stop being dead prose and become executable handoff units +- follow-up loops can see which pinpoints have already turned into real execution lanes + +### 4.18. Report backpressure / repetitive-summary collapse +Periodic dogfood reporting should avoid re-broadcasting the full known gap inventory every cycle when only a small delta changed. + +Required behavior: +- distinguish `new since last report` from `still active but unchanged` +- emit compact delta-first summaries with an optional expandable full state +- track per-channel/reporting cursor so repeated unchanged items collapse automatically +- preserve one canonical full snapshot elsewhere for audit/debug without flooding the live channel + +Acceptance: +- new signal does not get buried under the same repeated backlog list every cycle +- claws and humans can scan the latest update for actual change instead of re-reading the whole inventory +- recurring dogfood loops become low-noise without losing auditability + +### 4.19. No-change / no-op acknowledgment contract +When a dogfood cycle produces no new pinpoint, no new delta, and no new blocker, claws should be able to acknowledge that cycle explicitly without pretending a fresh finding exists. + +Required behavior: +- expose a structured `no_change` / `noop` outcome for a reporting cycle +- bind that outcome to the triggering nudge/report id +- distinguish `checked and unchanged` from `not yet checked` +- preserve the last meaningful pinpoint/delta reference without re-filing it as new work + +Acceptance: +- recurring nudges do not force synthetic novelty when the real answer is `nothing changed` +- clawhip can tell `handled, no delta` apart from silence or missed handling +- dogfood loops become honest and low-noise when the system is stable + +### 4.20. Observation freshness / staleness-age contract +Every reported status, pinpoint, or blocker should carry an explicit observation timestamp/age so downstream claws can tell fresh state from stale carry-forward. + +Required behavior: +- attach observed-at timestamp and derived age to active-session state, pinpoints, and blockers +- distinguish freshly observed facts from carried-forward prior-cycle state +- allow freshness TTLs so old observations degrade from `current` to `stale` automatically +- surface when a report contains mixed freshness windows across its fields + +Acceptance: +- claws do not mistake a 2-hour-old observation for current truth just because it reappeared in the latest report +- stale carried-forward state is visible and can be down-ranked or revalidated +- dogfood summaries remain trustworthy even when some fields are unchanged across many cycles + +### 4.21. Fact / hypothesis / confidence labeling +Dogfood reports should distinguish confirmed observations from inferred root-cause guesses so downstream claws do not treat speculation as settled truth. + +Required behavior: +- label each reported claim as `observed_fact`, `inference`, `hypothesis`, or `recommendation` +- attach a confidence score or confidence bucket to non-fact claims +- preserve which evidence supports each claim +- allow a later report to promote a hypothesis into confirmed fact without changing the underlying pinpoint identity + +Acceptance: +- claws can tell `we saw X happen` from `we think Y caused it` +- speculative root-cause text does not get mistaken for machine-trustworthy state +- dogfood summaries stay honest about uncertainty while remaining actionable + +### 4.22. Negative-evidence / searched-and-not-found contract +When a dogfood cycle reports that something was not found (no active sessions, no new delta, no repro, no blocker), the report should also say what was checked so absence is machine-meaningful rather than empty prose. + +Required behavior: +- attach the checked surfaces/sources for negative findings (sessions, logs, roadmap, state file, channel window, etc.) +- distinguish `not observed in checked scope` from `unknown / not checked` +- preserve the query/window used for the negative observation when relevant +- allow later reports to invalidate an earlier negative finding if the search scope was incomplete + +Acceptance: +- `no blocker` and `no new delta` become auditable conclusions rather than unverifiable vibes +- downstream claws can tell whether absence means `looked and clean` or `did not inspect` +- stable dogfood periods stay trustworthy without overclaiming certainty + +### 4.23. Field-level delta attribution +Even in delta-first reporting, claws still need to know exactly which structured fields changed between cycles instead of inferring change from prose. + +Required behavior: +- emit field-level change markers for core report fields (`active_sessions`, `pinpoint`, `delta`, `blocker`, lifecycle state, priority, freshness) +- distinguish `changed`, `unchanged`, `cleared`, and `carried_forward` +- preserve previous value references or hashes when useful for machine comparison +- allow one report to contain both changed and unchanged fields without losing per-field status + +Acceptance: +- downstream claws can tell precisely what changed this cycle without diffing entire message bodies +- delta-first summaries remain compact while still being machine-comparable +- recurring reports stop forcing text-level reparse just to answer `what actually changed?` + +### 4.24. Report schema versioning / compatibility contract +As structured dogfood reports evolve, the reporting surface needs explicit schema versioning so downstream claws can parse new fields safely without silent breakage. + +Required behavior: +- attach schema version to each structured report payload +- define additive vs breaking field changes +- expose compatibility guidance for consumers that only understand older schemas +- preserve a minimal stable core so basic parsing survives partial upgrades + +Acceptance: +- downstream claws can reject, warn on, or gracefully degrade unknown schema versions instead of misparsing silently +- adding new reporting fields does not randomly break existing automation +- dogfood reporting can evolve quickly without losing machine trust + +### 4.25. Consumer capability negotiation for structured reports +Schema versioning alone is not enough if different claws consume different subsets of the reporting surface. The producer should know what the consumer can actually understand. + +Required behavior: +- let downstream consumers advertise supported schema versions and optional field families/capabilities +- allow producers to emit a reduced-compatible payload when a consumer cannot handle richer report fields +- surface when a report was downgraded for compatibility vs emitted in full fidelity +- preserve one canonical full-fidelity representation for audit/debug even when a downgraded view is delivered + +Acceptance: +- claws with older parsers can still consume useful reports without silent field loss being mistaken for absence +- richer report evolution does not force every consumer to upgrade in lockstep +- reporting remains machine-trustworthy across mixed-version claw fleets + +### 4.26. Self-describing report schema surface +Even with versioning and capability negotiation, downstream claws still need a machine-readable way to discover what fields and semantics a report version actually contains. + +Required behavior: +- expose a machine-readable schema/field registry for structured report payloads +- document field meanings, enums, optionality, and deprecation status in a consumable format +- let consumers fetch the schema for a referenced report version/capability set +- preserve stable identifiers for fields so docs, code, and live payloads point at the same schema truth + +Acceptance: +- new consumers can integrate without reverse-engineering example payloads from chat logs +- schema drift becomes detectable against a declared source of truth +- structured report evolution stays fast without turning every integration into brittle archaeology + +### 4.27. Audience-specific report projection +The same canonical dogfood report should be projectable into different consumer views (clawhip, Jobdori, human operator) without each consumer re-summarizing the full payload from scratch. + +Required behavior: +- preserve one canonical structured report payload +- support consumer-specific projections/views (for example `delta_brief`, `ops_audit`, `human_readable`, `roadmap_sync`) +- let consumers declare preferred projection shape and verbosity +- make the projection lineage explicit so a terse view still points back to the canonical report + +Acceptance: +- Jobdori/Clawhip/humans do not keep rebroadcasting the same full inventory in slightly different prose +- each consumer gets the right level of detail without inventing its own lossy summary layer +- reporting noise drops while the underlying truth stays shared and auditable + +### 4.28. Canonical report identity / content-hash anchor +Once multiple projections and summaries exist, the system needs a stable identity anchor proving they all came from the same underlying report state. + +Required behavior: +- assign a canonical report id plus content hash/fingerprint to the full structured payload +- include projection-specific metadata without changing the canonical identity of unchanged underlying content +- surface when two projections differ because the source report changed vs because only the rendering changed +- allow downstream consumers to detect accidental duplicate sends of the exact same report payload + +Acceptance: +- claws can verify that different audience views refer to the same underlying report truth +- duplicate projections of identical content do not look like new state changes +- report lineage remains auditable even as the same canonical payload is rendered many ways + +### 4.29. Projection invalidation / stale-view cache contract +If the canonical report changes, previously emitted audience-specific projections must be identifiable as stale so downstream claws do not keep acting on an old rendered view. + +Required behavior: +- bind each projection to the canonical report id + content hash/version it was derived from +- mark projections as superseded when the underlying canonical payload changes +- expose whether a consumer is viewing the latest compatible projection or a stale cached one +- allow cheap regeneration of projections without minting fake new report identities + +Acceptance: +- claws do not mistake an old `delta_brief` view for current truth after the canonical report was updated +- projection caching reduces noise/compute without increasing stale-action risk +- audience-specific views stay safely linked to the freshness of the underlying report + +### 4.30. Projection-time redaction / sensitivity labeling +As canonical reports accumulate richer evidence, projections need an explicit policy for what can be shown to which audience without losing machine trust. + +Required behavior: +- label report fields/evidence with sensitivity classes (for example `public`, `internal`, `operator_only`, `secret`) +- let projections redact, summarize, or hash sensitive fields according to audience policy while preserving the canonical report intact +- expose when a projection omitted or transformed data for sensitivity reasons +- preserve enough stable identity/provenance that redacted projections can still be correlated with the canonical report + +Acceptance: +- richer canonical reports do not force all audience views to leak the same detail level +- consumers can tell `field absent because redacted` from `field absent because nonexistent` +- audience-specific projections stay safe without turning into unverifiable black boxes + +### 4.31. Redaction provenance / policy traceability +When a projection redacts or transforms data, downstream consumers should be able to tell which policy/rule caused it rather than treating redaction as unexplained disappearance. + +Required behavior: +- attach redaction reason/policy id to transformed or omitted fields +- distinguish policy-based redaction from size truncation, compatibility downgrade, and source absence +- preserve auditable linkage from the projection back to the canonical field classification +- allow operators to review which projection policy version produced the visible output + +Acceptance: +- claws can tell *why* a field was hidden, not just that it vanished +- redacted projections remain operationally debuggable instead of opaque +- sensitivity controls stay auditable as reporting/projection policy evolves + +### 4.32. Deterministic projection / redaction reproducibility +Given the same canonical report, schema version, consumer capability set, and projection policy, the emitted projection should be reproducible byte-for-byte (or canonically equivalent) so audits and diffing do not drift on re-render. + +Required behavior: +- make projection/redaction output deterministic for the same inputs +- surface which inputs participate in projection identity (schema version, capability set, policy version, canonical content hash) +- distinguish content changes from nondeterministic rendering noise +- allow canonical equivalence checks even when transport formatting differs + +Acceptance: +- re-rendering the same report for the same audience does not create fake deltas +- audit/debug workflows can reproduce why a prior projection looked the way it did +- projection pipelines stay machine-trustworthy under repeated regeneration + +### 4.33. Projection golden-fixture / regression lock +Once structured projections become deterministic, claw-code still needs regression fixtures that lock expected outputs so report rendering changes cannot slip in unnoticed. + +Required behavior: +- maintain canonical fixture inputs covering core report shapes, redaction classes, and capability downgrades +- snapshot or equivalence-test expected projections for supported audience views +- make intentional rendering/schema changes update fixtures explicitly rather than drifting silently +- surface which fixture set/version validated a projection pipeline change + +Acceptance: +- projection regressions get caught before downstream claws notice broken or drifting output +- deterministic rendering claims stay continuously verified, not assumed +- report/projection evolution remains fast without sacrificing machine-trustworthy stability + +### 4.34. Downstream consumer conformance test contract +Producer-side fixture coverage is not enough if real downstream claws still parse or interpret the reporting contract incorrectly. The ecosystem needs a way to verify consumer behavior against the declared report schema/projection rules. + +Required behavior: +- define conformance cases for consumers across schema versions, capability downgrades, redaction states, and no-op cycles +- provide a machine-runnable consumer test kit or fixture bundle +- distinguish parse success from semantic correctness (for example: correctly handling `redacted` vs `missing`, `stale` vs `current`) +- surface which consumer/version last passed the conformance suite + +Acceptance: +- report-contract drift is caught at the producer/consumer boundary, not only inside the producer +- downstream claws can prove they understand the structured reporting surface they claim to support +- mixed claw fleets stay interoperable without relying on optimism or manual spot checks + +### 4.35. Provisional-status dedupe / in-flight acknowledgment suppression +When a claw emits temporary status such as `working on it`, `please wait`, or `adding a roadmap gap`, repeated provisional notices should not flood the channel unless something materially changed. + +Required behavior: +- fingerprint provisional/in-flight status updates separately from terminal or delta-bearing reports +- suppress repeated provisional messages with unchanged meaning inside a short reconciliation window +- allow a new provisional update through only when progress state, owner, blocker, or ETA meaningfully changes +- preserve raw repeats for audit/debug without exposing each one as a fresh channel event + +Acceptance: +- monitoring feeds do not churn on duplicate `please wait` / `working on it` messages +- consumers can tell the difference between `still in progress, unchanged` and `new actionable update` +- in-flight acknowledgments remain useful without drowning out real state transitions + +### 4.36. Provisional-status escalation timeout +If a provisional/in-flight status remains unchanged for too long, the system should stop treating it as harmless noise and promote it back into an actionable stale signal. + +Required behavior: +- attach timeout/TTL policy to provisional states +- escalate prolonged unchanged provisional status into a typed stale/blocker signal +- distinguish `deduped because still fresh` from `deduped too long and now suspicious` +- surface which timeout policy triggered the escalation + +Acceptance: +- `working on it` does not suppress visibility forever when real progress stalled +- consumers can trust provisional dedupe without losing long-stuck work +- low-noise monitoring still resurfaces stale in-flight states at the right time + +### 4.37. Policy-blocked action handoff +When a requested action is disallowed by branch/merge/release policy (for example direct `main` push), the system should expose a structured refusal plus the next safe execution path instead of leaving only freeform prose. + +Required behavior: +- classify policy-blocked requests with a typed reason (`main_push_forbidden`, `release_requires_owner`, etc.) +- attach the governing policy source and actor scope when available +- emit a safe fallback path (`create branch`, `open PR`, `request owner approval`, etc.) +- allow downstream claws/operators to distinguish `blocked by policy` from `blocked by technical failure` + +Acceptance: +- policy refusals become machine-actionable instead of dead-end chat text +- claws can pivot directly to the safe alternative workflow without re-triaging the same request +- monitoring/reporting can separate governance blocks from actual product/runtime defects + +### 4.38. Policy exception / owner-approval token contract +For actions that are normally blocked by policy but can be allowed with explicit owner approval, the approval path should be machine-readable instead of relying on ambiguous prose interpretation. + +Required behavior: +- represent policy exceptions as typed approval grants or tokens scoped to action/repo/branch/time window +- bind the approval to the approving actor identity and policy being overridden +- distinguish `no approval`, `approval pending`, `approval granted`, and `approval expired/revoked` +- let downstream claws verify an approval artifact before executing the otherwise-blocked action + +Acceptance: +- exceptional approvals stop depending on fuzzy chat interpretation +- claws can safely execute policy-exception flows without confusing them with ordinary blocked requests +- governance stays auditable even when owner-authorized exceptions occur + +### 4.39. Approval-token replay / one-time-use enforcement +If policy-exception approvals become machine-readable tokens, they also need replay protection so one explicit exception cannot be silently reused beyond its intended scope. + +Required behavior: +- support one-time-use or bounded-use approval grants where appropriate +- record token consumption against the exact action/repo/branch/commit scope it authorized +- reject replay, scope expansion, or post-expiry reuse with typed policy errors +- surface whether an approval was unused, consumed, partially consumed, expired, or revoked + +Acceptance: +- one owner-approved exception cannot quietly authorize repeated or broader dangerous actions +- claws can distinguish `valid approval present` from `approval already spent` +- governance exceptions remain auditable and non-replayable under automation + +### 4.40. Approval-token delegation / execution chain traceability +If one actor approves an exception and another claw/bot/session executes it, the system should preserve the delegation chain so policy exceptions remain attributable end-to-end. + +Required behavior: +- record approver identity, requesting actor, executing actor, and any intermediate relay/orchestrator hop +- preserve the delegation chain on approval verification and token consumption events +- distinguish direct self-use from delegated execution +- surface when execution occurs through an unexpected or unauthorized delegate + +Acceptance: +- policy-exception execution stays attributable even across bot/session hops +- audits can answer `who approved`, `who requested`, and `who actually used it` +- delegated exception flows remain governable instead of collapsing into generic bot activity + +### 4.41. Token-optimization / repo-scope guidance contract +New users hit token burn and context bloat immediately, but the product surface does not clearly explain how repo scope, ignored paths, and working-directory choice affect clawability. + +Required behavior: +- explicitly document whether `.clawignore` / `.claudeignore` / `.gitignore` are honored, and how +- surface a simple recommendation to start from the smallest useful subdirectory instead of the whole monorepo when possible +- provide first-run guidance for excluding heavy/generated directories (`node_modules`, `dist`, `build`, `.next`, coverage, logs, dumps, generated reports`) +- make token-saving repo-scope guidance visible in onboarding/help rather than buried in external chat advice + +Acceptance: +- new users can answer `how do I stop dragging junk into context?` from product docs/help alone +- first-run confusion about ignore files and repo scope drops sharply +- clawability improves before users burn tokens on obviously-avoidable junk + +### 4.42. Workspace-scope weight preview / token-risk preflight +Before a user starts a session in a repo, claw-code should surface a lightweight estimate of how heavy the current workspace is and why it may be costly. + +Required behavior: +- inspect the current working tree for high-risk token sinks (huge directories, generated artifacts, vendored deps, logs, dumps) +- summarize likely context-bloat sources before deep indexing or first large prompt flow +- recommend safer scope choices (e.g. narrower subdirectory, ignore patterns, cleanup targets) +- distinguish `workspace looks clean` from `workspace is likely to burn tokens fast` + +Acceptance: +- users get an early warning before accidentally dogfooding the entire junkyard +- token-saving guidance becomes situational and concrete, not just generic docs +- onboarding catches avoidable repo-scope mistakes before they turn into cost/perf complaints + +### 4.43. Safer-scope quick-apply action +After warning that the current workspace is too heavy, claw-code should offer a direct way to adopt the safer scope instead of leaving the user to manually reinterpret the advice. + +Required behavior: +- turn scope recommendations into actionable choices (e.g. switch to subdirectory, generate ignore stub, exclude detected heavy paths) +- preview what would be included/excluded before applying the change +- preserve an easy path back to the original broader scope +- distinguish advisory suggestions from user-confirmed scope changes + +Acceptance: +- users can go from `this workspace is too heavy` to `use this safer scope` in one step +- token-risk preflight becomes operational guidance, not just warning text +- first-run users stop getting stuck between diagnosis and manual cleanup + ### 5. Failure taxonomy Normalize failure classes: - `prompt_delivery` @@ -148,6 +729,20 @@ Acceptance: - blockers are machine-classified - dashboards and retry policies can branch on failure type +### 5.5. Transport outage vs lane failure boundary +When the control server or transport goes down, claw-code should distinguish host-level outage from lane-local failure instead of letting all active lanes look broken in the same vague way. + +Required behavior: +- emit typed transport outage events separate from lane failure events +- annotate impacted lanes with dependency status (`blocked_by_transport`) rather than rewriting them as ordinary lane errors +- preserve the last known good lane state before transport loss +- surface outage scope (`single session`, `single worker host`, `shared control server`) + +Acceptance: +- clawhip can say `server down blocked 3 lanes` instead of pretending 3 independent lane failures happened +- recovery policies can restart transport separately from lane-local recovery recipes +- postmortems can separate infra blast radius from actual code-lane defects + ### 6. Actionable summary compression Collapse noisy event streams into: - current phase @@ -159,6 +754,23 @@ Acceptance: - channel status updates stay short and machine-grounded - claws stop inferring state from raw build spam +### 6.5. Blocked-state subphase contract +When a lane is `blocked`, also expose the exact subphase where progress stopped, rather than forcing claws to infer from logs. + +Subphases should include at least: +- `blocked.trust_prompt` +- `blocked.prompt_delivery` +- `blocked.plugin_init` +- `blocked.mcp_handshake` +- `blocked.branch_freshness` +- `blocked.test_hang` +- `blocked.report_pending` + +Acceptance: +- `lane.blocked` carries a stable subphase enum + short human summary +- clawhip can say "blocked at MCP handshake" or "blocked waiting for trust clear" without pane scraping +- retries can target the correct recovery recipe instead of treating all blocked states the same + ## Phase 3 — Branch/Test Awareness and Auto-Recovery ### 7. Stale-branch detection before broad verification @@ -182,6 +794,22 @@ Acceptance: - one automatic recovery attempt occurs before escalation - the attempted recovery is itself emitted as structured event data +### 8.5. Recovery attempt ledger +Expose machine-readable recovery progress so claws can see what automatic recovery has already tried, what is still running, and why escalation happened. + +Ledger should include at least: +- recovery recipe id +- attempt count +- current recovery state (`queued`, `running`, `succeeded`, `failed`, `exhausted`) +- started/finished timestamps +- last failure summary +- escalation reason when retries stop + +Acceptance: +- clawhip can report `auto-recover tried prompt replay twice, then escalated` without log archaeology +- operators can distinguish `no recovery attempted` from `recovery already exhausted` +- repeated silent retry loops become visible and auditable + ### 9. Green-ness contract Workers should distinguish: - targeted tests green @@ -249,6 +877,21 @@ Acceptance: - claws can query status directly - human-facing views become a rendering layer, not the source of truth +### 12.5. Running-state liveness heartbeat +When a lane is marked `working` or otherwise in-progress, emit a lightweight liveness heartbeat so claws can tell quiet progress from silent stall. + +Heartbeat should include at least: +- current phase/subphase +- seconds since last meaningful progress +- seconds since last heartbeat +- current active step label +- whether background work is expected + +Acceptance: +- clawhip can distinguish `quiet but alive` from `working state went stale` +- stale detection stops depending on raw pane churn alone +- long-running compile/test/background steps stay machine-visible without log scraping + ## Phase 5 — Plugin and MCP Lifecycle Maturity ### 13. First-class plugin/MCP lifecycle contract @@ -428,6 +1071,8 @@ Model name prefix now wins unconditionally over env-var presence. Regression tes 32. **OpenAI-compatible provider/model-id passthrough is not fully literal** — **verified no-bug on 2026-04-09**: `resolve_model_alias()` only matches bare shorthand aliases (`opus`/`sonnet`/`haiku`) and passes everything else through unchanged, so `openai/gpt-4` reaches the dispatch layer unmodified. `strip_routing_prefix()` at `openai_compat.rs:732` then strips only recognised routing prefixes (`openai`, `xai`, `grok`, `qwen`) so the wire model is the bare backend id. No fix needed. **Original filing below.** +42. **Hook JSON failure opacity: invalid hook output does not surface the offending payload/context** — dogfooding on 2026-04-13 in the live `clawcode-human` lane repeatedly hit `PreToolUse/PostToolUse/Stop hook returned invalid ... JSON output` while the operator had no immediate visibility into which hook emitted malformed JSON, what raw stdout/stderr came back, or whether the failure was hook-formatting breakage vs prompt-misdelivery fallout. This turns a recoverable hook/schema bug into generic lane fog. **Impact.** Lanes look blocked/noisy, but the event surface is too lossy to classify whether the next action is fix the hook serializer, retry prompt delivery, or ignore a harmless hook-side warning. **Concrete delta landed now.** Recorded as an Immediate Backlog item so the failure is tracked explicitly instead of disappearing into channel scrollback. **Recommended fix shape:** when hook JSON parse fails, emit a typed hook failure event carrying hook phase/name, command/path, exit status, and a redacted raw stdout/stderr preview (bounded + safe), plus a machine class like `hook_invalid_json`. Add regression coverage for malformed-but-nonempty hook output so the surfaced error includes the preview instead of only `invalid ... JSON output`. + 32. **OpenAI-compatible provider/model-id passthrough is not fully literal** — dogfooded 2026-04-08 via live user in #claw-code who confirmed the exact backend model id works outside claw but fails through claw for an OpenAI-compatible endpoint. The gap: `openai/` prefix is correctly used for **transport selection** (pick the OpenAI-compat client) but the **wire model id** — the string placed in `"model": "..."` in the JSON request body — may not be the literal backend model string the user supplied. Two candidate failure modes: **(a)** `resolve_model_alias()` is called on the model string before it reaches the wire — alias expansion designed for Anthropic/known models corrupts a user-supplied backend-specific id; **(b)** the `openai/` routing prefix may not be stripped before `build_chat_completion_request` packages the body, so backends receive `openai/gpt-4` instead of `gpt-4`. **Fix shape:** cleanly separate transport selection from wire model id. Transport selection uses the prefix; wire model id is the user-supplied string minus only the routing prefix — no alias expansion, no prefix leakage. **Trace path for next session:** (1) find where `resolve_model_alias()` is called relative to the OpenAI-compat dispatch path; (2) inspect what `build_chat_completion_request` puts in `"model"` for an `openai/some-backend-id` input. **Source:** live user in #claw-code 2026-04-08, confirmed exact model id works outside claw, fails through claw for OpenAI-compat backend. 33. **OpenAI `/responses` endpoint rejects claw's tool schema: `object schema missing properties` / `invalid_function_parameters`** — **done at `e7e0fd2` on 2026-04-09**. Added `normalize_object_schema()` in `openai_compat.rs` which recursively walks JSON Schema trees and injects `"properties": {}` and `"additionalProperties": false` on every object-type node (without overwriting existing values). Called from `openai_tool_definition()` so both `/chat/completions` and `/responses` receive strict-validator-safe schemas. 3 unit tests added. All api tests pass. **Original filing below.** @@ -500,6 +1145,8 @@ Model name prefix now wins unconditionally over env-var presence. Regression tes 63. **Droid session completion semantics broken: code arrives after "status: completed"** — dogfooded 2026-04-12. Ultraclaw droid sessions (use-droid via acpx) report `session.status: completed` before file writes are fully flushed/synced to the working tree. Discovered +410 lines of "late-arriving" droid output that appeared after I had already assessed 8 sessions as "no code produced." This creates false-negative assessments and duplicate work. **Fix shape:** (a) droid agent should only report completion after explicit file-write confirmation (fsync or existence check); (b) or, claw-code should expose a `pending_writes` status that indicates "agent responded, disk flush pending"; (c) lane orchestrators should poll for file changes for N seconds after completion before final assessment. **Blocker:** none. Source: Jobdori ultraclaw dogfood 2026-04-12. +64. **ACP/Zed editor integration entrypoint is too implicit** — dogfooded 2026-04-13 from a user request for a `-acp` parameter to support ACP protocol integration in editor-first workflows such as Zed. The gap is not generic "please add another integration" churn; it is a **discoverability and launch-contract problem**. Right now the product surface does not make it obvious whether ACP is already supported, how an editor should invoke claw-code, or whether a dedicated flag/mode exists at all. That forces evaluators into repo archaeology instead of giving them a crisp editor-facing invocation contract. **Fix shape:** either (a) add an explicit ACP/editor entrypoint such as `--acp` / `acp serve` with help text that states the contract, or (b) if the protocol path already exists, surface it prominently in CLI help/README with a concrete Zed/editor integration example so users do not have to guess. **Acceptance bar:** an editor-first user can answer "how do I launch claw-code for ACP/Zed?" from `claw --help` or the first screen of docs without reading source. **Blocker:** none; currently recorded as a roadmap follow-up because the repo-local entrypoint was not obvious during dogfood. + 64. **Artifact provenance is post-hoc narration, not structured events** — **done (verified 2026-04-12):** completed lane persistence in `rust/crates/tools/src/lib.rs` now attaches structured `artifactProvenance` metadata to `lane.finished`, including `sourceLanes`, `roadmapIds`, `files`, `diffStat`, `verification`, and `commitSha`, while keeping the existing `lane.commit.created` provenance event intact. Regression coverage locks a successful completion payload that carries roadmap ids, file paths, diff stat, verification states, and commit sha without relying on prose re-parsing. **Original filing below.** 65. **Backlog-scanning team lanes emit opaque stops, not structured selection outcomes** — **done (verified 2026-04-12):** completed lane persistence in `rust/crates/tools/src/lib.rs` now recognizes backlog-scan selection summaries and records structured `selectionOutcome` metadata on `lane.finished`, including `chosenItems`, `skippedItems`, `action`, and optional `rationale`, while preserving existing non-selection and review-lane behavior. Regression coverage locks the structured backlog-scan payload alongside the earlier quality-floor and review-verdict paths. **Original filing below.**