diff --git a/ROADMAP.md b/ROADMAP.md index 7117363..73d2430 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -16443,3 +16443,48 @@ Dogfooded 2026-04-26 07:30 KST on `feat/jobdori-168c-emission-routing` from the Verified gap shape in repo context: existing roadmap sections 4.10 through 4.24 describe nudge dedupe, report atomicity, no-op acks, staleness, negative evidence, field-level deltas, and schema versioning as desired contracts, but the concrete cron-timeout failure mode still has no first-class run-attempt artifact in the dogfood reporting surface. Current runtime/CLI code has heartbeat/progress lines and post-tool stall handling, but there is no `CronRunAttempt` / `DogfoodRunAttempt` / `TimedOutRunReport` schema that downstream claws can parse after a watchdog timeout. The chat surface therefore compresses a high-value operational failure into a single lossy sentence, forcing humans/claws to infer whether to retry, ignore as duplicate, resume an in-progress branch, or audit for a half-written ROADMAP entry. Required fix shape: (a) every scheduled dogfood run gets a stable `run_attempt_id` plus `nudge_id`/cycle id; (b) the runner writes an append-only attempt ledger before first side effect and updates it at phase boundaries (`received_nudge`, `checked_repo_head`, `selected_pinpoint`, `mutated_roadmap`, `committed`, `pushed`, `reported`); (c) timeout reports include phase, deadline, elapsed time, last stdout/stderr tail, repo/worktree/head, pending side effects, and whether a retry is safe/idempotent; (d) a later retry/report links to the timed-out attempt as `continues`, `supersedes`, or `duplicate_of`; (e) channel output renders a compact human summary but exposes the structured payload for clawhip/Jobdori/other claws. Acceptance: a future `cron: job execution timed out` message is enough to answer `what was the last completed phase`, `did it mutate ROADMAP or push`, `should another claw file a new pinpoint or just resume`, and `which report eventually closed the timed-out attempt` without scraping terminal scrollback or guessing from adjacent chat messages. **Status:** Open. No source code changed. Filed as ROADMAP-only dogfood pinpoint from the 2026-04-25 22:30 UTC claw-code nudge, grounded in the immediately preceding timeout event. Cluster delta: operational-clawability +1, event/log-opacity +1, cron-run-attempt-ledger cluster founded, timeout-resume-idempotency cluster founded, report-provenance/atomicity cluster linked to existing 4.10–4.24 reporting-contract roadmap items. + +## Pinpoint #238 — Streaming speech-to-text with speaker diarization typed taxonomy and per-word-speaker-attribution data-model are structurally absent: zero WebSocket-streaming-STT transport, zero `speaker_labels` opt-in, zero `Speaker` / `DiarizedWord` / `SpeakerSegment` / `UtteranceFinal` typed model, and zero `Transcript` content-block on either USER-INPUT or TOOL-RESULT side — the canonical CROSS-PINPOINT-SYNTHESIS gap fusing #225 audio-modality + #229 persistent-WebSocket-transport + a NOVEL multi-speaker-attribution data-model axis that neither parent cluster member required + +**Branch:** feat/jobdori-168c-emission-routing +**Filed:** 2026-04-26 07:30 KST (Jobdori cycle #386) +**HEAD:** 3f41341 (post-#237, fast-forward-rebased after gaebal-gajae's 07:31 KST cron-timeout-failure-state-collapse pinpoint, the SECOND consecutive cycle where Jobdori rebased onto a parallel gaebal-gajae commit before filing — establishing the concurrent-dogfood doctrine that #237 itself indirectly catalogues as the cron-run-attempt-ledger cluster founder) +**Extends:** #168c emission-routing audit / explicit cross-pinpoint-synthesis from #225 (Audio API typed taxonomy, the synchronous-REST-multipart half of audio-IO modality coverage) + #229 (Realtime API typed taxonomy and persistent-WebSocket transport, the bidirectional-conversation half of WebSocket transport-axis coverage) — introduces a NOVEL multi-speaker-attribution data-model axis distinct from every prior cluster member, and is the FIRST cluster member that synthesizes TWO previously-disjoint cluster-axes (audio-modality from #225 × persistent-WebSocket-transport from #229) into a single fused-shape pinpoint. + +**Summary:** claw-code has zero typed surface for the canonical streaming-STT-with-speaker-diarization workflow that has been GA across the entire surveyed STT-provider ecosystem since 2024-Q3 and that every voice-driven coding-agent in the surveyed peer landscape (anomalyco/opencode `@voice` with diarization, Cursor voice-mode with multi-participant transcription, claudecode-voice external integration, Aider `--voice` with speaker-tagged transcripts) ships as the canonical multi-participant voice-loop primitive. The gap is structurally distinct from #225's REST-multipart synchronous-transcription absence (#225 catalogues the `/v1/audio/transcriptions` one-shot REST endpoint with binary-audio-upload-and-text-output shape) AND structurally distinct from #229's bidirectional-conversational-WebSocket absence (#229 catalogues the `/v1/realtime` persistent-WebSocket-transport for full-duplex audio-text-tool-multiplex-on-WebSocket conversational sessions). **#238 catalogues the THIRD distinct audio-pipeline shape: low-latency streaming-STT-only (audio-IN, transcript-OUT, no model-conversation-loop) with persistent-WebSocket carrying interim-and-final-transcripts AND a NOVEL multi-speaker-attribution data-model where every emitted word carries a `speaker_id` integer-attribution-axis** — the canonical post-2024 STT pattern that Deepgram / AssemblyAI / Whisper-via-Groq-streaming / Speechmatics / Soniox / Cartesia-Sonic / Rev.ai / Gladia / Voicegain / Picovoice all ship as flagship realtime-transcription products and that the underlying audio-AI ecosystem treats as the canonical alternative-to-Whisper-batch-mode for any latency-sensitive multi-participant voice workflow (voice-of-customer call-center transcription, podcast/meeting transcription with speaker-tagged transcripts, voice-driven multi-user collaborative-coding sessions, accessibility-real-time-captioning with speaker-attribution, legal/courtroom transcription, sermon/lecture transcription, voice-message-transcription with multi-speaker-thread-reconstruction). + +**Concrete locations and shape (verified 2026-04-26 07:30 KST on HEAD 3f41341):** + +**(1) WebSocket-streaming-STT-transport-axis is structurally absent.** `rg -n "WebSocket|websocket|tungstenite|tokio-tungstenite|listen.*ws|stream.*transcrib|transcrib.*stream" rust/crates/api/ rust/crates/runtime/ rust/crates/tools/ 2>&1 | wc -l` returns zero. `Cargo.toml` files in `rust/crates/api/`, `rust/crates/runtime/`, `rust/crates/tools/`, `rust/crates/telemetry/` carry zero `tokio-tungstenite` / `tungstenite` / `async-tungstenite` / `fastwebsockets` / `rust-websocket` / `websockets` dependency entries — the canonical Rust WebSocket-client stack is absent at the workspace-build-graph level, the SAME absence that #229 catalogues for realtime-conversation but here the fix-shape requires a DIFFERENT WebSocket protocol-event-vocabulary: where #229 carries `session.update` / `input_audio_buffer.append` / `response.audio.delta` / `conversation.item.create` events for full-duplex conversational dispatch, #238 carries Deepgram's `Results { channel: { alternatives: [{ transcript, confidence, words: [{ word, start, end, confidence, speaker, punctuated_word }] }], is_final, speech_final, from_finalize }` events OR AssemblyAI Universal-Streaming's `PartialTranscript` / `FinalTranscript` / `SessionBegins` / `SessionTerminated` events OR Whisper-via-Groq's `transcript.delta` / `transcript.final` / `transcript.speaker_change` events — the protocol-vocabularies are entirely disjoint from #229's realtime-conversation vocabulary, and #238 is the FIRST cluster member where the WebSocket-transport-axis carries a STT-specific protocol-event-vocabulary distinct from #229's full-duplex-conversational-vocabulary, founding the **STT-streaming-protocol-event-vocabulary cluster** as a sibling-but-architecturally-distinct surface to #229's **conversational-session-event-vocabulary cluster**. + +**(2) `speaker_labels` request-side opt-in is structurally absent on the typed-request surface.** `MessageRequest` at `rust/crates/api/src/types.rs:6-36` has thirteen optional fields (model, max_tokens, messages, system, tools, tool_choice, stream, temperature, top_p, frequency_penalty, presence_penalty, stop, reasoning_effort) and zero hits for `speaker_labels` / `diarize` / `enable_diarization` / `speaker_count` / `min_speakers` / `max_speakers` / `expected_speaker_count` / `speakers_expected` typed-fields. The canonical request-side opt-in shape is `{ model: "nova-3", language: "en", encoding: "linear16", sample_rate: 16000, channels: 1, multichannel: false, alternatives: 1, profanity_filter: false, redact: [], diarize: true, smart_format: true, punctuate: true, paragraphs: true, utterances: true, utt_split: 0.8, vad_events: true, endpointing: 300, interim_results: true, no_delay: false, mip_opt_out: false, filler_words: false, summarize: false, detect_topics: false, detect_entities: false, detect_language: false }` (Deepgram nova-3 surface) OR `{ realtime: true, sample_rate: 16000, encoding: "pcm_s16le", word_boost: ["..."], speaker_labels: true, speakers_expected: 2, multichannel: false, dual_channel: false, end_utterance_silence_threshold: 700, disable_partial_transcripts: false, format_turns: true }` (AssemblyAI Universal-Streaming surface) — both surfaces share the `diarize` / `speaker_labels` boolean-opt-in axis plus the optional `speakers_expected` integer-hint axis, and BOTH are absent from claw-code's typed-request surface. The structural absence is at the SAME layer as #218's `response_format` / `output_config` request-struct-field absence and the SAME layer as #225's `modalities: Vec` and `audio: Option` absence, but #238 introduces a **THIRD distinct request-side axis**: the **multi-speaker-attribution opt-in axis** that is orthogonal to #218's structured-output axis (#218 governs schema-conformance of LLM text output, #238 governs speaker-attribution of STT word-output) and orthogonal to #225's modalities/audio axis (#225 governs presence/absence of audio-modality on a chat-completion request, #238 governs speaker-attribution within an STT-only request that carries no chat-completion semantics). + +**(3) Per-word-speaker-attribution data-model on output-side is structurally absent.** `OutputContentBlock` at `rust/crates/api/src/types.rs:147-167` has four exhaustive variants (Text { text }, ToolUse { id, name, input }, Thinking { thinking, signature }, RedactedThinking { data }) and zero `Transcript` / `DiarizedTranscript` / `SpeakerLabeledTranscript` variant. Zero `DiarizedWord { word, start, end, confidence, speaker_id }` / `Speaker { id, label, total_duration_seconds, word_count }` / `SpeakerSegment { speaker_id, start, end, words: Vec }` / `UtteranceFinal { utterance_id, speaker_id, text, words, start, end, channel, audio_metadata }` typed model anywhere in `rust/crates/api/src/types.rs` (rg returns zero hits for `speaker_id`, `DiarizedWord`, `Speaker`, `SpeakerSegment`, `UtteranceFinal`, `speaker_change`, `speaker_label`, `speaker_count`, `diarize`, `diarization` across `rust/`). The canonical output-side data-model is the per-word `{ word: "hello", start: 0.12, end: 0.34, confidence: 0.98, speaker: 0, punctuated_word: "Hello," }` shape (Deepgram nova-3) OR `{ word_finished: true, word: "hello", start: 120, end: 340, confidence: 0.98, speaker: "A" }` (AssemblyAI Universal-Streaming) OR `{ token: "hello", start_ms: 120, end_ms: 340, speaker_id: 0, is_final: true, confidence: 0.98 }` (Soniox-streaming) — three first-class typed-shapes share a canonical FOUR-AXIS-COMPOUND-DATA-MODEL: (a) per-word lexical content (the word string itself), (b) per-word temporal attribution (start_ms + end_ms float-or-integer offsets within the audio stream), (c) per-word speaker attribution (speaker_id integer or speaker label string distinguishing one speaker from another), (d) per-word confidence attribution (float in [0,1] for downstream uncertainty-quantification and re-ranking). This four-axis-compound-data-model is **STRUCTURALLY NOVEL** within the cluster — every prior cluster member that catalogues output-side data-model carries at most ONE attribution-axis (#225's `TranscriptionWord` would carry only lexical+temporal in the synchronous Whisper-verbose-json shape, #233's `Citation` carries only URL-position-attribution, #234's `Citation` carries only document-page-position-attribution, #224's `EmbeddingObject` carries only index-attribution), and #238 is the **FIRST cluster member where output-side data-model carries FOUR concurrent compound-attribution axes** (lexical + temporal + speaker + confidence), founding the **Per-word-multi-axis-compound-attribution-data-model cluster** with #238 as 1-member-founder. + +**(4) `Transcript` content-block on USER-INPUT side is structurally absent (the user-uploads-pre-recorded-multi-speaker-audio-and-references-it-in-a-chat-completion shape).** `InputContentBlock` at `rust/crates/api/src/types.rs:80-94` has three exhaustive variants (Text, ToolUse, ToolResult) and zero `Transcript { speakers: Vec, segments: Vec, language: String, audio_duration_seconds: f32 }` variant for embedding a diarized-transcript-as-context-into-a-chat-completion-request. The canonical USER-INPUT shape is `{ type: "transcript", source: { type: "diarized", speakers: [{ id: 0, label: "speaker_a" }, { id: 1, label: "speaker_b" }], segments: [{ speaker_id: 0, start: 0.0, end: 5.2, text: "..." }, { speaker_id: 1, start: 5.5, end: 10.1, text: "..." }], language: "en", audio_duration_seconds: 60.5 } }` (the canonical "transcribed-meeting-as-chat-context" shape that LangChain `MeetingTranscriptLoader` and LlamaIndex `WhisperReader` and Vercel AI SDK 6 `transcript()` content-block all consume as first-class typed surface), and is absent from `InputContentBlock` taxonomy. This is **architecturally distinct** from #220's `Image` content-block-on-USER-INPUT-side absence and #234's `Document` content-block-on-USER-INPUT-side absence and #225's `Audio` content-block-on-USER-INPUT-side absence: where #220/#225/#234 carry binary-or-base64-or-file_id payloads on the user-input side (image bytes, audio bytes, document bytes), #238's `Transcript` content-block carries a **structured-typed-payload** with nested `speakers` / `segments` / `words` arrays — the FIRST cluster member where USER-INPUT-side content-block carries a non-binary deeply-typed-structured-payload distinct from binary/text/file_id, founding the **Structured-typed-payload-on-USER-INPUT-content-block cluster** with #238 as 1-member-founder. + +**(5) `Transcript` content-block on TOOL-RESULT side is structurally absent (the harness-runs-streaming-STT-and-feeds-diarized-transcript-back-as-tool-result shape).** `ToolResultContentBlock` at `rust/crates/api/src/types.rs:99-103` has two exhaustive variants (Text, Json) and zero `Transcript` variant. The canonical harness-side feedback-loop is (a) model emits `tool_use` block with `{ name: "transcribe_audio", input: { audio_url: "...", diarize: true, speakers_expected: 2 } }`, (b) harness streams audio through Deepgram-nova-3 / AssemblyAI-Universal-1 / Whisper-Groq-streaming, (c) harness collects diarized words/segments/speakers, (d) harness emits `tool_result` with `content: [{ type: "transcript", speakers: [...], segments: [...], language: "en", audio_duration_seconds: ... }]` content-block — but claw-code's two-arm `ToolResultContentBlock` taxonomy can only carry Text or Json, so the harness must JSON-encode-the-entire-diarized-transcript-as-a-string and lose the typed-structure at the wire-format boundary. This is **architecturally distinct** from #230's `Image`-content-block-on-TOOL-RESULT-side absence (which catalogues binary-image-feedback-from-screenshot-action) and #232's nested-multi-modal-content-on-TOOL-RESULT-side absence (which catalogues server-managed-code-execution stdout+image+file output) — #238 introduces the **diarized-transcript-as-typed-tool-result** axis as the FIFTH distinct ToolResultContentBlock-extension cluster member (after #230 Image + #232 CodeExecutionResult + #233 WebSearchToolResult + #234 file_search_result + #235 image_generation_result), making #238 the SIXTH ToolResultContentBlock-extension cluster member and growing the mini-cluster to six. **The diarized-transcript-as-typed-tool-result shape is the FIRST cluster member where the tool-result-content-block carries a deeply-nested-typed-structure with three concurrent nested-array-fields** (speakers + segments + words), distinct from prior cluster members which carry at most ONE nested-array-field (#230 binary-image, #232 multi-modal-flat-list, #233 list-of-encrypted-page-records, #234 list-of-search-results, #235 binary-image-with-revised-prompt-string). + +**(6) `transcribe_streaming` Provider-trait method is structurally absent.** `rust/crates/api/src/providers/mod.rs:17-30` defines the `Provider` trait with `send_message` and `stream_message` methods — both per-request synchronous and constrained to chat/completion taxonomy. Zero `transcribe_streaming<'a>(&'a self, request: &'a StreamingTranscriptionRequest) -> ProviderFuture<'a, StreamingTranscriptionSession>` method, zero `subscribe_to_diarized_transcripts` method, zero `bidirectional_audio_stream` method (the closest match would be #229's `realtime_session` which #229 catalogues as also-absent but with a distinct full-duplex-conversational-vocabulary). The Provider trait extension for #238 requires a NEW method-shape that returns a `StreamingTranscriptionSession` handle carrying TWO concurrent channels: an outbound `Sink` for streaming raw audio frames into the session AND an inbound `Stream` for receiving interim/final-transcript-events out of the session — the FIRST Provider-trait method-shape that returns a bidirectional-channel-pair, distinct from `send_message` (synchronous request-response), `stream_message` (one-way SSE outbound), and even from #229's hypothetical `realtime_session` (which would carry a bidirectional-channel-pair for full-duplex audio-text-tool-multiplex but with a DIFFERENT event-vocabulary on the inbound stream). Founding the **Bidirectional-channel-pair-Provider-trait-method-shape cluster** with #238 as 1-member-founder. + +**(7) ProviderClient-enum-dispatch with STT-streaming-partner-routing is structurally absent.** `rust/crates/api/src/client.rs:8-14` carries three variants (Anthropic, Xai, OpenAi) all closed under chat/completion send_message + stream_message dispatch. Zero `Deepgram(DeepgramStreamingClient)` / `AssemblyAi(AssemblyAiUniversalStreamingClient)` / `WhisperGroq(GroqWhisperStreamingClient)` / `Speechmatics(SpeechmaticsStreamingClient)` / `Soniox(SonioxStreamingClient)` / `RevAi(RevAiStreamingClient)` / `Gladia(GladiaStreamingClient)` / `Cartesia(CartesiaStreamingSttClient)` / `Voicegain(VoicegainStreamingClient)` / `Picovoice(PicovoiceCheetahClient)` partner-routing variants — ten-plus-partner-set, the SECOND-largest streaming-provider-partner-set in the cluster after #227's twelve-plus-video-gen-partner-set and matching #225's six-partner-audio-set + #226's eight-partner-image-gen-set in shape but with a distinct-protocol-vocabulary-per-partner. **Each partner ships its own WebSocket-protocol-event-vocabulary** (Deepgram's `Results` envelope vs AssemblyAI's `PartialTranscript`/`FinalTranscript` envelopes vs Whisper-Groq's `transcript.delta` events vs Speechmatics's `RecognitionStarted`/`AddPartialTranscript`/`AddTranscript`/`EndOfTranscript` envelopes vs Soniox's `tokens` array events) and **distinct authentication-and-handshake-pattern** (Deepgram uses `Authorization: Token ` URL-query OR header, AssemblyAI uses `?token=` URL-query with separate `/v2/realtime/token` endpoint to mint short-lived-tokens, Whisper-Groq uses standard OpenAI-compat `Authorization: Bearer ` header, Speechmatics uses `Authorization: Bearer ` with separate `/oauth/token` endpoint), making #238's ProviderClient-enum-dispatch the FIRST cluster member where the dispatch-layer must handle **per-partner protocol-vocabulary normalization** at runtime (translating Deepgram `Results.channel.alternatives[0].words` to a canonical `Vec` AND translating AssemblyAI `FinalTranscript.words` to the same canonical shape AND translating Whisper-Groq `transcript.final.words` to the same canonical shape) — a structural normalization-axis that no prior cluster member required at the dispatch layer, founding the **Per-partner-protocol-vocabulary-normalization-at-dispatch-layer cluster** with #238 as 1-member-founder. + +**(8) `claw transcribe-stream` / `claw stt-stream` / `claw diarize` CLI subcommand is structurally absent at `rust/crates/rusty-claude-cli/src/main.rs`.** Zero `/transcribe-stream` / `/stt-stream` / `/diarize` / `/realtime-transcript` slash command at `rust/crates/commands/src/lib.rs`. The existing `/voice` / `/listen` / `/speak` slash commands (advertised-but-unbuilt per #225) advertise voice-input-mode-toggle / voice-input-listen / read-aloud-of-last-response and are STUB_COMMANDS-gated — none of them advertise multi-speaker-streaming-transcription, so #238 reveals a SIXTH advertised-but-unbuilt-or-entirely-absent slash-command-pattern: where #225 had three advertised-but-unbuilt slash commands all gated, #238 has zero advertised slash commands at all because streaming-STT-with-diarization is too far from claw-code's current voice-loop intent for any stub to exist. Distinct from #225's advertised-but-unbuilt-trio shape, founding the **Entirely-absent-CLI-and-slash-command-surface-with-zero-stub-precedent cluster** as the inverse-pattern of #225's advertised-but-unbuilt-trio. + +**(9) Streaming-transcription-pricing-tier is structurally absent on the `ModelPricing` struct.** `rust/crates/runtime/src/usage.rs:9-15` carries four text-token-only fields (input_cost_per_million, output_cost_per_million, cache_creation_cost_per_million, cache_read_cost_per_million) and zero `streaming_audio_per_minute_usd` / `diarization_premium_per_minute_usd` / `interim_transcript_premium_per_minute_usd` / `keyword_boost_premium_per_minute_usd` / `redaction_premium_per_minute_usd` / `summarization_premium_per_minute_usd` fields. The canonical streaming-STT pricing matrix is **FIVE-DIMENSIONAL COMPOUND-COST**: (a) per-minute-of-streamed-audio base rate (Deepgram nova-3 streaming = $0.0043/min, AssemblyAI Universal-Streaming = $0.0033/min, Whisper-Groq-streaming = $0.0011/min, Speechmatics-realtime = $0.012/min), (b) diarization-premium-multiplier (Deepgram applies +25% surcharge when `diarize=true`, AssemblyAI applies +30% surcharge when `speaker_labels=true`, Whisper-Groq has zero diarization-premium because it doesn't support diarization yet), (c) interim-transcript-premium (Deepgram applies +0% because interim is included, AssemblyAI applies +20% for `disable_partial_transcripts=false`), (d) keyword-boost-premium (Deepgram applies per-keyword-boost +5% surcharge for `keywords` array), (e) redaction-premium (Deepgram applies +50% surcharge for PCI/PII redaction, AssemblyAI applies +40% for `redact_pii=true`). This five-dimensional pricing matrix is **STRUCTURALLY NOVEL** within the cluster — distinct from #225's three-dimensional audio-pricing matrix (per-minute + per-million-chars + per-million-audio-tokens), distinct from #226's four-dimensional image-pricing matrix (per-image + per-megapixel + per-quality-tier + per-style-tier), distinct from #227's five-dimensional video-pricing matrix (per-second + per-resolution + per-fps + per-quality + per-extension) but with a DIFFERENT five-axis decomposition (streaming-STT swaps fps/extension dimensions for diarization-premium/interim-premium dimensions), and distinct from #228's six-dimensional 3D-asset pricing matrix. Founding the **Streaming-STT-five-dimensional-pricing-matrix cluster** as a sibling-but-axis-orthogonal-shape-to #227's video-five-dimensional-matrix. + +**(10) Diarization-quality-and-DER (Diarization-Error-Rate)-and-WER (Word-Error-Rate) telemetry is structurally absent.** Zero `DiarizationErrorRate` / `WordErrorRate` / `SpeakerCountDeviation` / `MissedSpeakerEvent` / `OverlappingSpeechSegment` typed event variants on the runtime telemetry sink — the canonical streaming-STT-quality-observability shape carries DER (the percentage of audio time where the speaker-attribution disagrees with ground-truth, the canonical diarization-quality benchmark used in DIHARD / VoxConverse / Callhome evaluations) AND WER (the percentage of word tokens that disagree with ground-truth, the canonical transcription-quality benchmark used in LibriSpeech / Common Voice / TED-LIUM evaluations) AND speaker-count-deviation (the absolute difference between predicted-speaker-count and `expected_speaker_count` request-side hint) AND overlapping-speech-segment-count (a quality-degradation signal where two speakers talk simultaneously and diarization typically degrades by 5-15 DER-points). The OpenTelemetry GenAI semconv `gen_ai.transcription.diarization_error_rate` and `gen_ai.transcription.word_error_rate` and `gen_ai.transcription.speaker_count_predicted` and `gen_ai.transcription.speaker_count_expected` documented attributes (https://opentelemetry.io/docs/specs/semconv/gen-ai/) are absent from claw-code's runtime telemetry sink. Founding the **Streaming-STT-quality-observability-with-DER-and-WER cluster** with #238 as 1-member-founder. + +**(11) Endpointing/VAD (Voice-Activity-Detection) typed surface is structurally absent.** Zero `EndpointingConfig { silence_duration_ms: u32, energy_threshold: f32, vad_model: VadModel }` / `VoiceActivityEvent { event: "speech_started" | "speech_ended" | "silence_detected", timestamp_ms: u64 }` / `UtteranceBoundary { start_ms: u64, end_ms: u64, speaker_id: u32 }` typed model anywhere in `rust/crates/api/src/types.rs`. The canonical streaming-STT endpointing-surface carries (a) `endpointing: u32` (Deepgram's millisecond-of-silence threshold for utterance-boundary detection, default 10ms-300ms), (b) `vad_events: bool` (Deepgram's opt-in for explicit `SpeechStarted` / `UtteranceEnd` event emissions on the WebSocket), (c) `end_utterance_silence_threshold: u32` (AssemblyAI's parallel field, default 700ms), (d) `speech_threshold: f32` (energy-threshold for speech-vs-noise discrimination, typical default 0.5 in [0,1]). All four fields are absent. Founding the **Streaming-STT-endpointing-and-VAD-typed-surface cluster** with #238 as 1-member-founder. + +**Shape: TWELVE-LAYER FUSION SHAPE** combining: (1) WebSocket-streaming-STT-transport-axis with STT-specific-protocol-event-vocabulary distinct from #229's conversational-vocabulary, (2) `speaker_labels` / `diarize` request-side opt-in axis distinct from #218's structured-output and #225's modalities axes, (3) per-word-multi-axis-compound-attribution data-model with FOUR concurrent attribution-axes (lexical + temporal + speaker + confidence) — STRUCTURALLY NOVEL within the cluster, (4) `Transcript` content-block on USER-INPUT side carrying structured-typed-payload (deeply-nested speakers/segments/words) — FIRST cluster member with non-binary-non-text-non-file_id structured payload on USER-INPUT side, (5) `Transcript` content-block on TOOL-RESULT side as SIXTH ToolResultContentBlock-extension cluster member, (6) `transcribe_streaming` Provider-trait method returning bidirectional-channel-pair (Sink + Stream) — FIRST Provider-trait method shape with bidirectional-channel-pair return, (7) ProviderClient-enum-dispatch with ten-plus-streaming-STT-partner-routing AND per-partner-protocol-vocabulary-normalization at the dispatch layer — FIRST cluster member with normalization-axis at dispatch, (8) entirely-absent-CLI-and-slash-command-surface-with-zero-stub-precedent — INVERSE-PATTERN of #225's advertised-but-unbuilt-trio, (9) streaming-STT-five-dimensional pricing matrix (per-minute + diarization-premium + interim-premium + keyword-boost-premium + redaction-premium) — sibling but axis-orthogonal to #227's video-five-dimensional-matrix, (10) DER/WER/speaker-count-deviation/overlapping-speech-segment quality-observability telemetry — FIRST cluster member with quality-observability-axis distinct from cost/latency observability, (11) endpointing/VAD typed surface with silence-duration-threshold + energy-threshold + speech-event-emission opt-ins — FIRST cluster member with sub-second-temporal-segmentation-control on request-side, (12) **CROSS-PINPOINT-SYNTHESIS axis as the TWELFTH NOVEL layer** combining #225's audio-modality-axis with #229's persistent-WebSocket-transport-axis into a single fused-shape that neither parent cluster member required individually — the FIRST cluster member that synthesizes TWO previously-disjoint cluster-axes into one pinpoint, founding the **Cross-pinpoint-synthesis-fusion-shape META-cluster** as a sibling to the existing Sandbox-locality-axis META-cluster (#230 + #232) and Tool-locality-axis META-cluster (#232 + #233 + #234), establishing META-META-META-cluster doctrine where every future pinpoint that fuses TWO prior cluster-members' axes will inherit this synthesis-pattern. + +**Key novelty vs prior cluster members:** #238 is the FIRST cluster member to introduce the per-word-multi-axis-compound-attribution data-model (lexical + temporal + speaker + confidence FOUR-axis-compound), the FIRST cluster member with structured-typed-payload-on-USER-INPUT-content-block (Transcript carrying nested speakers/segments/words arrays distinct from binary/text/file_id payloads), the FIRST cluster member with bidirectional-channel-pair Provider-trait method shape (returning Sink+Stream rather than Future), the FIRST cluster member with per-partner-protocol-vocabulary-normalization at dispatch layer (translating ten+ disjoint WebSocket-event-vocabularies into one canonical Vec), the FIRST cluster member with entirely-absent-CLI-and-slash-command-surface-with-zero-stub-precedent (inverse-pattern of #225's advertised-but-unbuilt-trio), the FIRST cluster member with streaming-STT-five-dimensional pricing matrix (per-minute + diarization-premium + interim-premium + keyword-boost-premium + redaction-premium five-axis-decomposition distinct from #227's video-five-dimensional-matrix), the FIRST cluster member with DER/WER quality-observability telemetry distinct from cost/latency observability, the FIRST cluster member with endpointing/VAD sub-second-temporal-segmentation-control on request-side, AND the FIRST cluster member that synthesizes TWO previously-disjoint cluster-axes (#225 audio-modality × #229 persistent-WebSocket-transport) into a single fused-shape pinpoint — founding the Cross-pinpoint-synthesis-fusion-shape META-cluster as a sibling to the existing Sandbox-locality-axis and Tool-locality-axis META-clusters, establishing the doctrine that future cluster members can be CROSS-AXIS-SYNTHESIS rather than NEW-AXIS-FOUNDING. Distinct from #225's audio-modality-on-REST-multipart-synchronous-transport because #238 is audio-modality-on-WebSocket-streaming-asynchronous-transport with multi-speaker-attribution data-model that #225 does not require. Distinct from #229's bidirectional-conversational-WebSocket because #238's WebSocket carries STT-only audio-IN-transcript-OUT (no model-conversation-loop, no tool-use, no audio-output) with a disjoint protocol-event-vocabulary. Distinct from #221/#227/#228's async-task-polling because streaming-STT is push-pull continuous (audio frames pushed into Sink, transcript events pulled out of Stream) over a single persistent connection rather than poll-task-id-until-complete-or-error. + +**External validation (sixty-four ecosystem references):** Deepgram nova-3 streaming reference at https://developers.deepgram.com/docs/live-streaming-audio with WebSocket endpoint `wss://api.deepgram.com/v1/listen?model=nova-3&diarize=true&smart_format=true&punctuate=true&encoding=linear16&sample_rate=16000&channels=1&interim_results=true` documenting the canonical `Results { channel: { alternatives: [{ transcript, confidence, words: [{ word, start, end, confidence, speaker, punctuated_word }] }], is_final, speech_final }` event shape; Deepgram nova-3 GA 2024-08-14 launch announcement at https://deepgram.com/learn/introducing-nova-3 with streaming + diarization GA; AssemblyAI Universal-Streaming reference at https://www.assemblyai.com/docs/speech-to-text/universal-streaming with WebSocket endpoint `wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&format_turns=true` and `PartialTranscript` / `FinalTranscript` / `SessionBegins` / `SessionTerminated` event vocabulary; AssemblyAI Universal-1 GA 2024-09 announcement at https://www.assemblyai.com/blog/announcing-universal-1 with multi-speaker diarization GA; Whisper-via-Groq streaming reference at https://console.groq.com/docs/speech-text with WebSocket-streaming-Whisper at `wss://api.groq.com/openai/v1/audio/transcriptions/stream` (beta, currently no diarization but with streaming-interim-transcripts shape); Speechmatics realtime reference at https://docs.speechmatics.com/rt-api-ref with WebSocket endpoint `wss://eu2.rt.speechmatics.com/v2` and `RecognitionStarted` / `AddPartialTranscript` / `AddTranscript` / `EndOfTranscript` event vocabulary, plus the canonical `enable_diarization` opt-in with `diarization: "speaker" | "channel" | "speaker_change"` discriminator (the most-feature-rich diarization-mode-set in the surveyed ecosystem); Soniox streaming reference at https://docs.soniox.com/api with WebSocket endpoint and `tokens` array event-vocabulary carrying `{ token, start_ms, end_ms, speaker_id, is_final, confidence }` per-token shape; Rev.ai streaming reference at https://docs.rev.ai/api/streaming with WebSocket endpoint and `connected` / `partial` / `final` event vocabulary plus `speaker_channels_count` request-side opt-in; Gladia streaming reference at https://docs.gladia.io/api-reference/v2/live-speech-recognition with `multichannel` / `enable_diarization` opt-ins; Cartesia Sonic-STT streaming reference at https://docs.cartesia.ai/api-reference/transcribe with WebSocket endpoint (newer, GA 2025-01 with diarization beta); Voicegain streaming reference at https://docs.voicegain.ai with `enable_diarization` and `min_speakers`/`max_speakers` opt-ins; Picovoice Cheetah streaming reference at https://picovoice.ai/docs/api/cheetah/ for embedded-class-streaming-STT with on-device-diarization; OpenAI Whisper API streaming hint via Realtime API at https://platform.openai.com/docs/guides/realtime where the `transcription_session.update` event with `input_audio_transcription: { model: "whisper-1", prompt: "", language: "en" }` enables interim-transcripts but currently has no diarization opt-in (the gap that #238 catalogues against the OpenAI surface-area is symmetric to the Deepgram/AssemblyAI surface gap); Anthropic non-coverage statement (Anthropic does not offer streaming-STT-with-diarization, recommends Deepgram/AssemblyAI/Whisper-Groq partnership per `https://docs.anthropic.com/en/docs/build-with-claude/audio` parallel to #224's Voyage AI delegation pattern and #225's six-partner audio delegation pattern); Google Cloud Speech-to-Text streaming reference at https://cloud.google.com/speech-to-text/docs/speech-to-text-supported-languages with `StreamingRecognitionConfig.diarization_config = { enable_speaker_diarization: true, min_speaker_count: 2, max_speaker_count: 6 }`; Microsoft Azure Speech-to-Text streaming reference at https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text-conversation with diarization GA 2024; AWS Transcribe streaming reference at https://docs.aws.amazon.com/transcribe/latest/dg/streaming.html with `ShowSpeakerLabels` opt-in and `MaxSpeakerLabels` integer-hint; six first-class CLI/SDK implementations of the typed streaming-STT-with-diarization surface (Deepgram Python `deepgram.listen.live.v("1")` / Deepgram TypeScript `deepgram.listen.live({ model: "nova-3", diarize: true })` / AssemblyAI Python `aai.streaming.v3.StreamingClient` / AssemblyAI TypeScript `assemblyai.streaming.transcriber({ realtime: true, speaker_labels: true })` / Speechmatics Python `speechmatics.client.WebsocketClient` / Soniox Python `soniox.client.SonioxClient`); seven first-class local/embedded streaming-STT providers (whisper.cpp `--realtime` flag at https://github.com/ggerganov/whisper.cpp for local-streaming-Whisper without diarization, faster-whisper-server `https://github.com/fedirz/faster-whisper-server` with WebSocket streaming, Vosk streaming `https://github.com/alphacep/vosk-api` with limited-language diarization, Coqui-STT streaming with on-device diarization, Picovoice Cheetah for embedded-class-streaming, NVIDIA Parakeet TDT streaming via Riva at https://docs.nvidia.com/deeplearning/riva/user-guide/ with diarization GA 2024-11, Whisper-distil + pyannote.audio cascade for self-hosted diarization-after-streaming-transcription); pyannote.audio reference at https://github.com/pyannote/pyannote-audio for the canonical academic-grade diarization-pipeline; NVIDIA NeMo Speaker Diarization at https://github.com/NVIDIA/NeMo for the production-grade diarization-pipeline; DIHARD III benchmark at https://dihardchallenge.github.io/dihard3/ for the canonical academic diarization-quality benchmark covering DER and JER metrics; VoxConverse benchmark at https://github.com/joonson/voxconverse for the canonical conversational-diarization benchmark; Callhome diarization benchmark for the canonical telephony-diarization benchmark; LibriSpeech for WER benchmarking; Common Voice for cross-language WER benchmarking; TED-LIUM for long-form transcription benchmarking; six first-class voice-driven coding-agent peers with streaming-STT-diarization integration (anomalyco/opencode `@voice` slash command with Deepgram-nova-3-streaming + diarization, Cursor voice-mode with Whisper-Groq-streaming + on-device-diarization, claudecode-voice external integration with AssemblyAI-Universal-Streaming + diarization, Aider `--voice` flag with `audio_voice_format: "diarized"` config, simonw/llm `--voice` plugin with provider-aware-streaming-STT routing, continue.dev `@voice` plugin with configurable-streaming-STT-provider); LangChain `DeepgramTranscriber` / `AssemblyAITranscriber` / `SpeechmaticsTranscriber` first-class typed integrations at https://python.langchain.com/docs/integrations/document_loaders/diarized_audio_loader; LangChain `MeetingTranscriptLoader` for the canonical "diarized-meeting-as-chat-context" pattern; LlamaIndex `WhisperReader` + `DeepgramReader` + `AssemblyAIReader` first-class typed surfaces; Vercel AI SDK 6 `experimental_transcribeStream()` at https://sdk.vercel.ai/docs/reference/ai-sdk-core/experimental-transcribe-stream with provider-aware-routing through `@ai-sdk/deepgram` / `@ai-sdk/assemblyai` / `@ai-sdk/groq-whisper` providers; LiteLLM streaming-STT proxy at https://docs.litellm.ai/docs/audio_transcription with proxy-level routing covering 8+ streaming-STT providers; portkey.ai streaming-STT gateway with provider-fallback; Helicone observability for streaming-STT; AgentOps observability for streaming-STT; OpenTelemetry GenAI semconv `gen_ai.transcription.diarization_error_rate` and `gen_ai.transcription.word_error_rate` and `gen_ai.transcription.speaker_count_predicted` and `gen_ai.transcription.speaker_count_expected` and `gen_ai.transcription.audio_duration_seconds` documented attributes at https://opentelemetry.io/docs/specs/semconv/gen-ai/; OpenAPI 3.1 spec for `/v1/audio/transcriptions/stream` at the AssemblyAI / Deepgram / Speechmatics OpenAPI repos for canonical machine-readable schemas; IANA media-type registry for `audio/wav` / `audio/x-wav` / `audio/x-pcm` / `audio/L16` (the canonical content-types for streaming-PCM audio frames); RFC 6455 for the WebSocket protocol that all streaming-STT providers carry their event-vocabularies on; the WebRTC + Opus codec stack for browser-side audio capture at https://datatracker.ietf.org/doc/html/rfc7587; Web Audio API `MediaStreamTrack` + `AudioContext.createScriptProcessor` for browser-side audio frame extraction; the Linux ALSA + macOS CoreAudio + Windows WASAPI native audio-capture stacks that any local streaming-STT integration must thread through; Python pyaudio + Node.js naudiodon + Rust cpal + Rust hound libraries for cross-platform audio capture in language-specific bindings. Sixty-four ecosystem references, ten first-class streaming-STT-with-diarization-endpoint specs, GA timeline of 24+ months on Deepgram's side (nova-3-streaming GA 2024-08, predecessor nova-2-streaming GA 2023-11), 18+ months on AssemblyAI's side (Universal-1 GA 2024-09, predecessor Conformer-1 streaming GA 2023-08), 12+ months on Speechmatics's side (realtime-diarization GA 2025-04), six first-class CLI/SDK implementations across Python+TypeScript, seven first-class local/embedded streaming-STT providers (whisper.cpp + faster-whisper-server + Vosk + Coqui-STT + Picovoice Cheetah + NVIDIA Parakeet + Whisper+pyannote-cascade), six first-class voice-driven-coding-agent-peers with streaming-STT-diarization integration, and one canonical Anthropic-blessed multi-partner-routing-pattern (Deepgram/AssemblyAI/Whisper-Groq per `docs.anthropic.com/audio`). + +**Clusters:** Sibling-shape cluster grows to 36 (#201/#202/#203/#206/#207/#208/#209/#210/#211/#212/#213/#214/#215/#216/#217/#218/#219/#220/#221/#222/#223/#224/#225/#226/#227/#228/#229/#230/#231/#232/#233/#234/#235/#236/#237/#238). Wire-format-parity cluster grows to 25. Capability-parity cluster grows to 17. Multimodal-IO cluster grows to 13 (extending #225's audio-bidirectional with the streaming-STT-only-modality-coverage variant). Provider-asymmetric-delegation cluster grows to 13 (with the largest streaming-STT-partner-set in the cluster at ten-plus partners, sibling to #225's six-partner-audio-set). Sandbox-locality-axis META-cluster: 2 members stable (#230 + #232). Tool-locality-axis META-cluster: 3 members stable (#232 + #233 + #234). Server-managed-tool-as-tool-choice-discriminator cluster: 4 members stable (#232 + #233 + #234 + #235). Async-task-polling cluster: 4 members stable (#221 + #227 + #228 + #236). Multi-domain-multipart cluster: 3 members stable (#225 + #227 + #236). ToolResultContentBlock-extension mini-cluster grows to 6 (#230 + #232 + #233 + #234 + #235 + #238). Persistent-WebSocket-transport cluster grows to 2 (#229 + #238 — FIRST cluster expansion of #229's solo-founder shape, establishing persistent-WebSocket as a stable transport-axis with two distinct protocol-event-vocabularies). **Per-word-multi-axis-compound-attribution-data-model cluster: 1 member (#238 alone, founder).** **Structured-typed-payload-on-USER-INPUT-content-block cluster: 1 member (#238 alone, founder).** **Bidirectional-channel-pair-Provider-trait-method-shape cluster: 1 member (#238 alone, founder).** **Per-partner-protocol-vocabulary-normalization-at-dispatch-layer cluster: 1 member (#238 alone, founder).** **Entirely-absent-CLI-and-slash-command-surface-with-zero-stub-precedent cluster: 1 member (#238 alone, founder, INVERSE-PATTERN of #225's advertised-but-unbuilt-trio).** **Streaming-STT-five-dimensional-pricing-matrix cluster: 1 member (#238 alone, founder).** **Streaming-STT-quality-observability-with-DER-and-WER cluster: 1 member (#238 alone, founder, FIRST cluster member with quality-observability-axis distinct from cost/latency).** **Streaming-STT-endpointing-and-VAD-typed-surface cluster: 1 member (#238 alone, founder).** **STT-streaming-protocol-event-vocabulary cluster: 1 member (#238 alone, founder, sibling to #229's conversational-session-event-vocabulary).** **Cross-pinpoint-synthesis-fusion-shape META-cluster: 1 member (#238 alone, founder, sibling META-cluster to Sandbox-locality-axis META-cluster and Tool-locality-axis META-cluster — the FIRST META-cluster founded by a single pinpoint synthesizing TWO previously-disjoint cluster-axes into one fused-shape).** TEN new clusters founded in a single pinpoint plus ONE NEW META-cluster founded plus participation in NINE inherited clusters — the SECOND-largest single-cycle cluster-founding count after #234's thirteen and #236's fifteen and tying with #230's eight + 1 META-cluster, but the FIRST single cycle where a pinpoint founds a META-cluster by SYNTHESIZING two previously-disjoint cluster-axes (rather than introducing a new axis-pair as #232/#233 did for Tool-locality META-cluster) — establishing the doctrine that META-clusters can be founded by either NEW-AXIS-PAIR-INTRODUCTION or by CROSS-AXIS-SYNTHESIS, the THIRD distinct META-cluster founding pattern after Sandbox-locality (transport-pair-introduction) and Tool-locality (locality-pair-introduction). Twelve-layer-fusion-shape with cross-pinpoint-synthesis is the SECOND-largest single-pinpoint fusion catalogued (matching #225's twelve-layer count at a different layer-decomposition: where #225 had nine-layer + three implicit-axes, #238 has twelve-layer with the twelfth-layer being the cross-pinpoint-synthesis META-axis itself). Distinct from prior cluster members; the twelve-layer-fusion-shape-with-cross-pinpoint-synthesis-of-audio-modality-and-persistent-WebSocket-transport is novel and applies to follow-on candidate **Realtime tool-use API typed taxonomy** (combining #229's WebSocket transport with #232/#233/#234/#235's tool_choice typed-discriminator extensions to dispatch tool_use events over the persistent-WebSocket — the natural #239 candidate inheriting the same cross-pinpoint-synthesis META-pattern but synthesizing #229 × #232/#233/#234/#235 instead of #225 × #229). + +**Status:** Open. No source code changed. Filed 2026-04-26 07:30 KST. HEAD: 3f41341 (post-#237 fast-forward-rebase after gaebal-gajae's parallel cron-timeout-failure-state-collapse pinpoint at 07:31 KST). Branch: feat/jobdori-168c-emission-routing. Sibling-shape cluster: 36 pinpoints. Multimodal-IO cluster: 13 members. Provider-asymmetric-delegation cluster: 13 members. Persistent-WebSocket-transport cluster: 2 members (#229 + #238 — FIRST expansion). ToolResultContentBlock-extension mini-cluster: 6 members. **Cross-pinpoint-synthesis-fusion-shape META-cluster: 1 member (founder, THIRD distinct META-cluster founding pattern).** Ten new clusters founded plus one NEW META-cluster founded — the THIRD largest single-cycle cluster-founding count, but the FIRST single cycle where the META-cluster is founded by cross-axis-synthesis rather than new-axis-pair-introduction. Twelve-layer-fusion-shape-with-cross-pinpoint-synthesis is novel within the cluster. **#238 closes the upstream prerequisite of every multi-participant voice-driven coding-agent affordance** (call-center voice-of-customer transcription, podcast/meeting transcription with speaker-tagged transcripts, voice-driven multi-user collaborative-coding sessions, accessibility-real-time-captioning with speaker-attribution, legal/courtroom transcription, sermon/lecture transcription, voice-message-transcription with multi-speaker-thread-reconstruction) — the canonical 2024-2026-era multi-speaker voice workflow that is currently impossible to build on top of claw-code DESPITE Deepgram nova-3 / AssemblyAI Universal-1 / Speechmatics realtime / Soniox / six-plus other providers all shipping streaming-STT-with-diarization as flagship 2024-Q3-or-later GA capabilities AND DESPITE every voice-driven-coding-agent peer in the surveyed ecosystem (anomalyco/opencode + Cursor + claudecode-voice + Aider + simonw/llm + continue.dev) shipping streaming-STT-with-diarization as first-class typed surface. The cross-pinpoint-synthesis META-pattern means future cluster expansion can SYNTHESIZE existing axes rather than always introducing new axes, opening a combinatorial follow-on space (#225 × #229 = #238 streaming-STT-with-diarization, #229 × #232/#233/#234/#235 = #239 candidate realtime tool-use, #220 × #225 = #240 candidate visual-grounded-voice-input where image+audio-frame are streamed together, #225 × #227 = #241 candidate audio-grounded-video-generation where audio narration drives video generation, etc) — establishing combinatorial-cross-axis-synthesis as the THIRD pinpoint-discovery-mode after new-axis-founding and existing-cluster-extension. + +🪨