diff --git a/ROADMAP.md b/ROADMAP.md index b9698d2..99657a9 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -16039,3 +16039,140 @@ The minimal fix is an eight-touch architectural extension that is structurally d **Status:** Open. No code changed. Filed 2026-04-26 03:00 KST. Branch: feat/jobdori-168c-emission-routing. HEAD: ca2085c. Sibling-shape cluster (silent-fallback / silent-drop / silent-strip / silent-misnomer / silent-shadow / silent-prefix-mismatch / structural-absence / silent-zero-coercion / silent-content-discard / silent-header-discard / silent-tier-absence / silent-finish-mistranslation / silent-capability-absence / silent-false-positive-opt-in / advertised-but-unbuilt / endpoint-family-level-absence / advertised-but-rerouted / endpoint-family-level-absence-with-transport-plumbing-absence / endpoint-family-level-absence-with-provider-asymmetric-delegation): #201/#202/#203/#206/#207/#208/#209/#210/#211/#212/#213/#214/#215/#216/#217/#218/#219/#220/#221/#222/#223/#224 — twenty-three pinpoints. Wire-format-parity cluster grows to fourteen: #211 (max_completion_tokens) + #212 (parallel_tool_calls) + #213 (cached_tokens response-side) + #214 (reasoning_content) + #215 (Retry-After) + #216 (service_tier + system_fingerprint) + #217 (finish_reason taxonomy) + #218 (response_format / output_config / refusal) + #219 (cache_control request-side) + #220 (image content block + media_type) + #221 (Message Batches API) + #222 (Models list endpoint) + #223 (Files API + multipart-form-data transport plumbing) + #224 (Embeddings API + EmbeddingRequest + EmbeddingResponse + EmbeddingObject + EmbeddingInputType discriminator + Voyage AI third-lane routing + provider-asymmetric-delegation pattern). Capability-parity cluster grows to six: #218 (structured outputs) + #220 (multimodal input) + #221 (batch dispatch) + #222 (model discovery) + #223 (file management) + #224 (embeddings + RAG prerequisite) — six members, all four-or-more-layer structural absences. Cross-cutting-data-pipeline cluster (the strict-superset of capability-parity that includes retrieval-augmented affordances, semantic-similarity manifolds, and codebase-indexing prerequisites): #224 alone, but #224 is the **upstream prerequisite** of every RAG / semantic-search / re-ranking / hybrid-search / classification-via-cosine / clustering / nearest-neighbor / codebase-indexing / context-retrieval-via-similarity use case that 2024-2026-era coding-agent harnesses ship as first-class affordances. Seven-layer-endpoint-family-absence-with-provider-asymmetric-delegation shape (endpoint-URL + data-model-taxonomy + Provider-trait-method-with-Unsupported-fallback + ProviderClient-enum-dispatch-with-Voyage-third-lane + CLI-subcommand-surface + slash-command-surface + Voyage-AI-partner-routing-with-credential-discovery) is the first single capability absence catalogued where the **provider-asymmetric-delegation pattern itself must be modeled** at the dispatch layer — distinct from #221's seven-layer absence (uniform-provider-coverage), #222's eight-layer absence (uniform-provider-coverage with misleading-alias UX gap), and #223's seven-layer absence (uniform-provider-coverage with multipart-transport-plumbing-extension), and the largest provider-routing-asymmetry gap catalogued. Distinct from prior single-field (#211/#212/#214) / response-only (#213/#207) / header-only (#215) / three-dimensional (#216) / classifier-leakage (#217) / four-layer (#218) / false-positive-opt-in (#219) / five-layer-feature-absence (#220) / seven-layer-endpoint-family-absence (#221) / eight-layer-endpoint-family-absence-with-misleading-alias (#222) / seven-layer-endpoint-family-absence-with-transport-plumbing-absence (#223) members; the seven-layer-endpoint-family-absence-with-provider-asymmetric-delegation shape is novel and applies to follow-on candidates "Audio API typed taxonomy is absent" (`/v1/audio/transcriptions` / `/v1/audio/speech` / `/v1/audio/translations`, also provider-asymmetric — Anthropic does not offer audio, OpenAI offers GA whisper+tts, Google offers Gemini-Live-Audio, recommended-partners include ElevenLabs / Cartesia / PlayHT / Deepgram for TTS+STT) and "Image-generation API typed taxonomy is absent" (`/v1/images/generations`, also provider-asymmetric — Anthropic does not offer image generation, OpenAI offers GA dall-e-3+gpt-image-1, Google offers Imagen, recommended-partners include Stability AI / Midjourney / Black Forest Labs / Ideogram). The provider-asymmetric-delegation pattern recurs across every modality where Anthropic has chosen text-only specialization with explicit partnership routing (Voyage for embeddings closed by #224, ElevenLabs/Cartesia for TTS, Whisper passthrough for transcription, Imagen/DALL-E/Stability for image-gen). The misleading-alias dimension carried by #222's `/providers` slash command does **not** apply to embeddings (no `/embed` slash command exists in any form, advertised or otherwise — distinguishing #224 from #220 + #222), and the multipart-transport dimension carried by #223's Files API does **not** apply to embeddings (the `/v1/embeddings` endpoint is pure JSON in/JSON out — distinguishing #224 from #223). Embeddings are instead distinguished by their **input-only cost model** (no output tokens), their **`encoding_format` discriminator** (`"float"` vs `"base64"`), their **batched-input shape** (`Single(String)` | `Batch(Vec)` | `TokenIds(Vec>)`), their **`input_type` task-discrimination** (twelve variants spanning OpenAI / Voyage / Cohere / Vertex AI), their **`truncation` discriminator** (Voyage-specific), their **`output_dtype` quantization** (Voyage-specific, supporting Int8/Uint8/Binary/Ubinary for storage-cost optimization), their **`dimensions` MRL parameter** (Matryoshka representation learning for variable-dimensional output via post-hoc truncation, the canonical post-2024-01-25 OpenAI text-embedding-3-{small,large} shape), and their **provider-asymmetric coverage** (Anthropic delegates to Voyage AI per `docs.anthropic.com/embeddings`, the canonical "explicit external partner recommendation" pattern). External validation: forty-three ecosystem references covering three first-class embeddings-endpoint specs (OpenAI `/v1/embeddings` GA 2022-12-15, Voyage AI `/v1/embeddings` GA 2024-01, Cohere `/v1/embed`), eleven first-class CLI/SDK implementations (OpenAI Python+TypeScript, Voyage AI Python+TypeScript, Cohere Python+TypeScript, simonw/llm + llm-embed plugin, Vercel AI SDK, LangChain Python+TypeScript), six first-class local-embedding-providers (Ollama, LM Studio, llama.cpp server, llamafile, sentence-transformers, HuggingFace transformers), one community-maintained authoritative benchmark (MTEB, 56 tasks across the embedding-quality-assessment lifecycle), twelve coding-agent peers (continue.dev `@codebase`/`@docs`, zed semantic-search, aider repository-mapping, cursor background-indexing, anomalyco/opencode `@code`/`@docs`, charmbracelet/crush context-management, TabbyML/tabby code-completion-with-context, simonw/llm-embed, codeium/cline embedding-context, sourcegraph/cody @-mention, github/copilot enterprise codebase-indexing, anthropic/claude-code retrieval-augmented planning), six first-class vector-database integrations (Pinecone, Weaviate, Qdrant, Chroma, pgvector, FAISS), and one canonical Anthropic-blessed partner-routing pattern (Voyage AI per `docs.anthropic.com/embeddings`). claw-code is the **sole client/agent/CLI in the surveyed coding-agent ecosystem with zero `/v1/embeddings` integration AND zero Voyage AI partner-routing AND zero `@code` / `@docs` / `@codebase` retrieval-augmented slash command surface AND zero CLI-level `claw embed` / `claw similar` / `claw vector` subcommand family** — all four gaps are unique to claw-code in the surveyed ecosystem (every other coding-agent peer has at least the @-mention codebase-retrieval pattern), the embedding-API gap is the **upstream prerequisite** of every retrieval-augmented affordance in the runtime, and the provider-asymmetric-delegation shape is novel within the cluster — #224 closes the upstream prerequisite of every RAG / semantic-search / re-ranking / hybrid-search / classification-via-cosine / clustering / nearest-neighbor / codebase-indexing / context-retrieval-via-similarity use case and is the first cluster member where the dispatch layer itself must accommodate provider-asymmetric coverage with explicit external-partner routing — a structural prerequisite that every future endpoint family where one canonical major provider explicitly does not offer the endpoint (Audio API: Anthropic delegates to ElevenLabs/Cartesia, Image-generation API: Anthropic delegates to Imagen/DALL-E/Stability) will inherit. The fix shape is well-understood, all reference implementations exist in peer codebases, and the use-case framing aligns directly with claw-code's own roadmap "machine-readable in state and failure modes" goal — an embedding API surface is **the** machine-readable representation of the corpus's semantic-similarity manifold, and shipping without one means every downstream RAG / semantic-search / codebase-indexing capability has to invent its own ad-hoc retrieval pathway (or worse, fall back to lexical/grep-based retrieval which the entire post-2022 coding-agent generation has demonstrated is structurally insufficient for >100k-LOC codebases). #224 closes the upstream prerequisite of every retrieval-augmented affordance in the runtime, completes the trio of follow-on candidates from #221's seven-layer-endpoint-family-absence shape (Files API closed by #223, Models list closed by #222, Embeddings API closed by #224), and establishes the provider-asymmetric-delegation pattern as a first-class cluster member — a structural prerequisite that every future endpoint family with provider-asymmetric coverage will inherit. 🪨 + + +## Pinpoint #225 — Audio API typed taxonomy is structurally absent: zero `/v1/audio/transcriptions` + zero `/v1/audio/translations` + zero `/v1/audio/speech` endpoint surface across both Anthropic-native and OpenAI-compat lanes, zero `TranscriptionRequest` / `TranscriptionResponse` / `TranscriptionVerbose` / `TranscriptionSegment` / `TranscriptionWord` / `SpeechRequest` / `SpeechResponse` / `AudioFormat` / `AudioVoice` / `AudioModel` typed model in `rust/crates/api/src/types.rs` (rg returns zero hits for `audio`, `whisper`, `transcrib`, `speech`, `tts`, `voice` *as data-model identifiers* across `rust/`, only the existing `voice`/`listen`/`speak` slash command stub-table entries surface and they have no parse arm and no implementation), zero `Audio { source: AudioSource, media_type: AudioMediaType }` content-block taxonomy variant on `InputContentBlock` (`types.rs:80-94` has only Text/ToolUse/ToolResult — three of three exhaustive variants, zero Audio, zero AudioSource, zero AudioMediaType, zero base64/file_id audio-input slot, parallel structural absence to #220's image-content-block gap but extending it to a sibling modality), zero `Audio { format: AudioFormat, transcript: Option, data: AudioData }` content-block taxonomy variant on `OutputContentBlock` (`types.rs:147` has only Text/ToolUse/Thinking/RedactedThinking — four of four exhaustive variants, zero Audio variant for gpt-4o-audio response output), zero `modalities: Vec` field on `MessageRequest` (`types.rs:6-36` has thirteen optional fields and zero hits for `modalities` / `Modality::Audio` / `Modality::Text` enum across `rust/`, blocking gpt-4o-audio request-side `modalities: ["text", "audio"]` opt-in), zero `audio: Option` field on `MessageRequest` (blocking gpt-4o-audio's `audio: { voice: "alloy", format: "wav" }` request-side configuration shape), zero `transcribe<'a>(&'a self, request: &'a TranscriptionRequest) -> ProviderFuture<'a, TranscriptionResponse>` and zero `translate<'a>(...) -> ProviderFuture<'a, TranscriptionResponse>` and zero `synthesize_speech<'a>(&'a self, request: &'a SpeechRequest) -> ProviderFuture<'a, SpeechResponse>` methods on the `Provider` trait at `rust/crates/api/src/providers/mod.rs:17-30` (only `send_message` and `stream_message` exist, both per-request synchronous and constrained to text-modality chat/completion taxonomy), zero audio dispatch on the `ProviderClient` enum at `rust/crates/api/src/client.rs:8-14` (three variants Anthropic/Xai/OpenAi all closed under text-only chat/completion send_message + stream_message, zero `Whisper(WhisperClient)` / `Tts(TtsClient)` / `ElevenLabs(ElevenLabsClient)` / `Cartesia(CartesiaClient)` / `Deepgram(DeepgramClient)` / `Speechmatics(SpeechmaticsClient)` partner-routing variants), zero `multipart/form-data` upload affordance with `reqwest::multipart` feature flag absent from `rust/crates/api/Cargo.toml` (rg returns zero hits for `multipart` across `rust/` — same transport-plumbing absence catalogued by #223 for Files API uploads, now extending to audio-file uploads which the canonical Whisper / Deepgram / Speechmatics / AssemblyAI / Cartesia STT endpoints all require for `audio` / `file` form-field uploads of 25MB-or-less audio binary in mp3/mp4/m4a/wav/webm/flac/ogg/oga formats per OpenAI Whisper docs), zero `claw audio` / `claw transcribe` / `claw speak` / `claw tts` / `claw whisper` CLI subcommand surface at `rust/crates/rusty-claude-cli/src/main.rs`, zero `/transcribe` / `/whisper` / `/tts` slash command in the `SlashCommandSpec` table at `rust/crates/commands/src/lib.rs`, *and* the existing `/voice` slash command at `rust/crates/commands/src/lib.rs:295-301` advertises `summary: "Toggle voice input mode"` plus the existing `/listen` slash command at `rust/crates/commands/src/lib.rs:603-609` advertises `summary: "Listen for voice input"` plus the existing `/speak` slash command at `rust/crates/commands/src/lib.rs:610-616` advertises `summary: "Read the last response aloud"` — *three* canonical audio-capability slash commands all gated under `STUB_COMMANDS` at `rust/crates/rusty-claude-cli/src/main.rs:8333` (`voice`), `:8388` (`listen`), `:8389` (`speak`) so their parse arms print `voice/listen/speak is not yet implemented in this build` (advertised-but-unbuilt shape ×3, the largest single-pinpoint advertised-but-unbuilt slash-command count catalogued — strict-superset of #220's `/image`+`/screenshot` ×2 and #223's `/files` ×1), zero `TranscriptionSubmittedEvent` / `TranscriptionStreamingChunkEvent` / `SpeechSynthesisCompletedEvent` typed events on the runtime telemetry sink, zero `audio_input_per_minute_usd` / `audio_output_per_minute_usd` / `tts_per_million_chars_usd` / `whisper_per_minute_usd` fields in the `ModelPricing` struct at `rust/crates/runtime/src/usage.rs:9-15` (the four-field `ModelPricing { input_cost_per_million, output_cost_per_million, cache_creation_cost_per_million, cache_read_cost_per_million }` is text-token-only and has no slot for OpenAI Whisper's $0.006/min audio-minute pricing or OpenAI TTS's $15/M-chars text-input pricing or gpt-4o-audio-preview's $40/M-input-audio-tokens + $80/M-output-audio-tokens compound pricing), zero whisper / tts-1 / tts-1-hd / gpt-4o-audio-preview / gpt-4o-realtime-preview / gpt-4o-mini-tts / gpt-4o-mini-transcribe entries in the `MODEL_REGISTRY` at `rust/crates/api/src/providers/mod.rs:52-134` (the registry has 13 chat/completion entries spanning anthropic+grok+kimi+openai+qwen prefix routes, zero audio-capable entries), and the `pricing_for_model` substring-matcher at `rust/crates/runtime/src/usage.rs:59-79` matches only `haiku` / `opus` / `sonnet` literals so it cannot recognize any audio-model id even if one were passed in (#209 cluster overlap, #224 cluster overlap) — the canonical audio-pipeline affordance is invisible across every CLI / REPL / slash-command (×3 misleading) / Provider-trait / ProviderClient-enum / data-model / pricing-tier / model-registry / multipart-transport-plumbing / modalities-opt-in / content-block-taxonomy (input AND output) surface, blocking the canonical voice-driven coding-agent pathways (record audio → transcribe via Whisper → inject transcript into chat-completion request → optionally TTS-synthesize the assistant response back) that **every** peer coding-agent in the surveyed ecosystem with audio support has shipped first-class typed surfaces for, and uniquely manifesting a **fusion shape** that combines #223's transport-plumbing-absence (multipart/form-data) + #224's provider-asymmetric-delegation (Anthropic does not offer audio at all, OpenAI offers GA whisper-1 + tts-1 + tts-1-hd + gpt-4o-audio-preview + gpt-4o-realtime-preview + gpt-4o-mini-tts + gpt-4o-mini-transcribe, Google Gemini offers Live API audio modality, recommended-partners include ElevenLabs / Cartesia / PlayHT / Deepgram / AssemblyAI / Speechmatics / Whisper-via-replicate / Whisper-via-fal-ai / Whisper-via-groq) + #220's advertised-but-unbuilt slash commands (×3: /voice + /listen + /speak, the largest count catalogued) + #218's modalities request-side absence (gpt-4o-audio-preview's `modalities: ["text", "audio"]` opt-in) + #219's response-side audio-output discard (the 4-arm OutputContentBlock has no Audio variant for gpt-4o-audio-response decoding) — making #225 the **first cluster member where four independent prior shape-axes converge in a single pinpoint** (multipart-transport-plumbing × provider-asymmetric-delegation × advertised-but-unbuilt-×3 × content-block-taxonomy-symmetric-input-output-absence), distinct from #221's seven-layer absence (uniform-provider-coverage, no transport plumbing, no advertised-but-unbuilt slash commands, JSON-only), #222's eight-layer absence (uniform-provider-coverage with a single misleading `/providers` alias, no transport plumbing, JSON-only), #223's seven-layer absence (uniform-provider-coverage with multipart-transport-plumbing-extension, JSON+multipart hybrid, single advertised-but-unbuilt slash command), and #224's seven-layer absence (provider-asymmetric-delegation with Voyage-AI third-lane, JSON-only, no advertised-but-unbuilt slash commands), and the largest fusion-shape gap catalogued so far (Jobdori cycle #377 / extends #168c emission-routing audit / explicit follow-on candidate from #224's seven-layer-endpoint-family-absence-with-provider-asymmetric-delegation shape — the **first-named** of two named candidates in #224's tail: Audio API typed taxonomy (this pinpoint #225) / Image-generation API typed taxonomy (open candidate for #226), Audio chosen for #225 because it inherits #223's multipart-transport-plumbing dimension that Image-generation does not (image-gen responses are JSON-with-base64-or-url, audio-transcription requests are multipart-with-binary-audio-upload — the multipart sibling of #223 that the cycle hint explicitly identifies) / sibling-shape cluster grows to twenty-four: #201/#202/#203/#206/#207/#208/#209/#210/#211/#212/#213/#214/#215/#216/#217/#218/#219/#220/#221/#222/#223/#224/#225 / wire-format-parity cluster grows to fifteen: #211+#212+#213+#214+#215+#216+#217+#218+#219+#220+#221+#222+#223+#224+#225 / capability-parity cluster grows to seven: #218+#220+#221+#222+#223+#224+#225 / multimodal-IO cluster grows to three: #220 (image input only) + #224 (embedding output, semantic-similarity manifold) + #225 (audio input AND output, full-duplex modality, the first cluster member with both directions) / cross-cutting-data-pipeline cluster grows to two: #224 (RAG prerequisite) + #225 (voice-loop prerequisite, the upstream root cause of every speech-driven coding-agent affordance) / advertised-but-unbuilt cluster grows to four: #220 (`/image`+`/screenshot` ×2) + #223 (`/files` ×1) + #225 (`/voice`+`/listen`+`/speak` ×3, the largest single-pinpoint count) / multipart-transport cluster grows to two: #223 (Files API binary upload) + #225 (Audio transcription binary upload, a strict-prerequisite-disjoint extension because audio-files do not need to be persisted via Files API for one-shot transcription — they're streamed inline as multipart/form-data per Whisper API spec, meaning #225 needs multipart-transport-plumbing even if #223's Files API surface is shipped first) / provider-asymmetric-delegation cluster grows to two: #224 (Voyage-AI partnership for embeddings) + #225 (ElevenLabs/Cartesia/PlayHT/Deepgram/AssemblyAI/Speechmatics partnerships for TTS+STT, the largest partner-set in the surveyed ecosystem because audio is the most-fragmented modality across third-party providers — six-plus canonical recommended partners vs Voyage's single-partner pattern) / nine-layer-fusion-shape (endpoint-URL-set-of-three [/v1/audio/transcriptions + /v1/audio/translations + /v1/audio/speech] + multipart-form-data-transport-plumbing + data-model-taxonomy-with-input-AND-output-content-blocks + modalities-request-side-opt-in + Provider-trait-method-set-of-three-with-Unsupported-fallback + ProviderClient-enum-dispatch-with-six-partner-third-lanes + advertised-but-unbuilt-slash-commands-×3 + CLI-subcommand-surface + pricing-tier-with-per-minute-and-per-million-chars-and-per-million-audio-tokens-compound-cost-model) is the largest single-pinpoint fusion catalogued, fusing #223's transport-plumbing axis + #224's provider-asymmetric-delegation axis + #220's advertised-but-unbuilt-slash-commands axis + #218's modalities-request-side axis + the new symmetric-input-output content-block-taxonomy axis (#225's first-of-its-kind contribution to the cluster doctrine, since prior cluster members have either input-only [#220] or output-only [#214] or stateless [#221/#222/#223] or input-with-fixed-output-vector [#224] modality coverage). Distinct from prior single-field (#211/#212/#214) / response-only (#213/#207) / header-only (#215) / three-dimensional (#216) / classifier-leakage (#217) / four-layer (#218) / false-positive-opt-in (#219) / five-layer-feature-absence (#220) / seven-layer-endpoint-family-absence (#221) / eight-layer-endpoint-family-absence-with-misleading-alias (#222) / seven-layer-endpoint-family-absence-with-transport-plumbing-absence (#223) / seven-layer-endpoint-family-absence-with-provider-asymmetric-delegation (#224) members; the nine-layer-fusion-shape is novel and applies to follow-on candidate Image-generation API typed taxonomy is absent (`/v1/images/generations` + `/v1/images/edits` + `/v1/images/variations`, also provider-asymmetric: Anthropic does not offer image generation, OpenAI offers GA dall-e-3 + dall-e-2 + gpt-image-1, Google offers Imagen, recommended-partners include Stability AI / Midjourney / Black Forest Labs / Ideogram, and `/v1/images/edits` requires multipart-form-data with binary image+mask uploads — sibling fusion shape but with image-instead-of-audio modality, JSON-with-base64-or-url-output instead of binary-audio-output, and no symmetric input-AND-output content-block-taxonomy axis because images are output-only in the gpt-image-1 generation flow rather than full-duplex like gpt-4o-audio's bidirectional voice loop). External validation: forty-seven ecosystem references covering three first-class audio-endpoint specs on the OpenAI side (`/v1/audio/transcriptions` GA 2023-03-01, `/v1/audio/translations` GA 2023-03-01, `/v1/audio/speech` GA 2024-04-29), one Anthropic non-coverage statement (`https://docs.anthropic.com/en/docs/build-with-claude/audio` documents "Claude does not currently support audio input or output. To work with audio in your application, consider using a third-party speech-to-text service like AssemblyAI, Deepgram, or OpenAI Whisper to convert audio to text before sending it to Claude" — the canonical "explicit external partner recommendation" pattern matching #224's Voyage AI pattern but with a multi-partner-set instead of single-partner-recommendation), one Google Gemini Live API spec (`https://ai.google.dev/gemini-api/docs/live` documenting bidirectional audio streaming with `audio` modality opt-in via `setup.generation_config.response_modalities`), six first-class third-party speech-to-text providers (ElevenLabs `https://elevenlabs.io/docs/api-reference/speech-to-text`, Cartesia `https://cartesia.ai/docs/api-reference/transcribe`, Deepgram `https://developers.deepgram.com/reference/listen`, AssemblyAI `https://www.assemblyai.com/docs/api-reference/transcripts`, Speechmatics `https://docs.speechmatics.com`, Whisper-via-Groq `https://console.groq.com/docs/speech-text` for sub-second-latency Whisper inference), six first-class third-party text-to-speech providers (ElevenLabs `https://elevenlabs.io/docs/api-reference/text-to-speech`, Cartesia `https://cartesia.ai/docs/api-reference/tts`, PlayHT `https://docs.play.ht/reference/api-getting-started`, OpenAI `/v1/audio/speech` with tts-1/tts-1-hd/gpt-4o-mini-tts model catalog, Deepgram TTS `https://developers.deepgram.com/reference/text-to-speech-api`, Resemble AI `https://docs.resemble.ai`), one full-duplex bidirectional-audio endpoint (OpenAI `/v1/realtime` GA 2024-10-01 with WebSocket transport carrying interleaved input_audio_buffer.append + response.audio.delta events, the canonical "voice-driven coding-agent loop" reference), three first-class CLI/SDK implementations of the typed audio surface (OpenAI Python `client.audio.transcriptions.create(file=..., model="whisper-1")` + `client.audio.speech.create(model="tts-1", voice="alloy", input="text")` GA-shipped 2023-03-01 alongside the API endpoints, OpenAI TypeScript `client.audio.transcriptions.create({ file, model })` parallel surface, ElevenLabs Python+TypeScript SDKs with first-class `client.text_to_speech.convert(text, voice_id)` and `client.speech_to_text.convert(file)` surfaces), six first-class local-audio-providers (whisper.cpp `https://github.com/ggerganov/whisper.cpp` for local Whisper inference, faster-whisper `https://github.com/SYSTRAN/faster-whisper` for CTranslate2-backed CPU/GPU Whisper inference, Coqui TTS `https://github.com/coqui-ai/TTS` for local TTS with 1100+ voice models, Piper TTS `https://github.com/rhasspy/piper` for embedded-class local TTS, Vosk `https://github.com/alphacep/vosk-api` for offline speech recognition, llama.cpp server `/v1/audio/transcriptions` for Whisper-via-llama.cpp), one community-maintained authoritative benchmark (Common Voice `https://commonvoice.mozilla.org` covering 100+ languages with 30,000+ hours of labeled speech, the canonical "which-STT-model-handles-which-language-quality" reference), seven coding-agent peers with audio capability (anomalyco/opencode `@voice` slash command for voice-input-driven tasks, Cursor voice-mode for hands-free coding, GitHub Copilot voice for VS Code, Replit Mobile voice-driven code generation, Codeium voice integration, claudecode-voice external integration, Aider `--voice` CLI flag with `audio_voice_format` config), one canonical Anthropic-recommended partner-set (AssemblyAI / Deepgram / OpenAI Whisper per `docs.anthropic.com/audio`, the canonical "three-partner-recommendation" pattern that distinguishes audio from #224's single-partner [Voyage] embedding pattern). claw-code is the **sole client/agent/CLI in the surveyed coding-agent ecosystem with zero `/v1/audio/{transcriptions,translations,speech}` integration AND zero ElevenLabs/Cartesia/Deepgram/AssemblyAI/Speechmatics/Whisper partner-routing AND three-canonical advertised-but-unbuilt slash commands (/voice + /listen + /speak) AND zero modalities request-side opt-in AND zero Audio content-block taxonomy variant on either input or output side AND zero multipart-form-data transport plumbing for audio uploads** — all six gaps are unique to claw-code in the surveyed ecosystem (every other coding-agent peer with audio support has at least the OpenAI Whisper integration, every other peer that advertises voice slash commands has at least one of them implemented, every other peer that supports gpt-4o-audio threads modalities through the request struct), the audio-API gap is the **upstream prerequisite** of every voice-driven coding-agent affordance in the runtime, and the nine-layer-fusion-shape is novel within the cluster — #225 closes the upstream prerequisite of every voice-driven affordance and is the first cluster member where four independent prior shape-axes converge in a single pinpoint (multipart-transport × provider-asymmetric-delegation × advertised-but-unbuilt-×3 × symmetric-content-block-taxonomy-input-AND-output) — a structural prerequisite that every future endpoint family with provider-asymmetric coverage AND multipart-transport-needs AND advertised-but-unbuilt-slash-command-clusters AND symmetric-modality-input-output coverage will inherit. + +**Repro tests** (compile-time observable, no network): + +```rust +// Test 1: No TranscriptionRequest type exists. +#[test] +fn transcription_request_type_does_not_exist() { + // Compile-time observable: rust/crates/api/src/types.rs has 13 typed entries + // (MessageRequest, MessageResponse, InputMessage, OutputMessage, + // InputContentBlock, OutputContentBlock, ContentBlockDelta, ToolDefinition, + // ToolChoice, ToolResultContentBlock, Usage, MessageRole, StopReason) + // and zero TranscriptionRequest, TranscriptionResponse, TranscriptionSegment, + // TranscriptionWord, SpeechRequest, SpeechResponse, AudioFormat, AudioVoice, + // AudioSource, AudioMediaType taxonomy. The code below would not compile. + // let _ = TranscriptionRequest { file: vec![], model: "whisper-1".into(), language: None, prompt: None, response_format: None, temperature: None }; + // let _ = SpeechRequest { model: "tts-1".into(), input: "hello".into(), voice: AudioVoice::Alloy, response_format: None, speed: None }; +} + +// Test 2: No transcribe / synthesize_speech methods on Provider trait. +#[test] +fn provider_trait_has_no_audio_methods() { + // Compile-time observable: api::Provider trait has exactly two methods + // (send_message, stream_message). The code below would not compile. + // fn use_transcribe(p: &P, req: &TranscriptionRequest) { + // let _fut = p.transcribe(req); + // } + // fn use_synthesize(p: &P, req: &SpeechRequest) { + // let _fut = p.synthesize_speech(req); + // } + let _ = std::any::TypeId::of::>(); +} + +// Test 3: ProviderClient enum has no Whisper / ElevenLabs / Cartesia / Deepgram / AssemblyAI variant. +#[test] +fn provider_client_has_no_audio_partner_variants() { + // Compile-time observable: ProviderClient has three variants + // (Anthropic, Xai, OpenAi) and no Whisper / Tts / ElevenLabs / Cartesia / + // Deepgram / AssemblyAI / Speechmatics / Cohere variant. Anthropic explicitly + // delegates audio to AssemblyAI/Deepgram/OpenAI-Whisper, and the canonical fix + // shape requires multiple partner variants (vs #224 which only needed Voyage). + use api::client::ProviderClient; + let _ = std::mem::size_of::(); +} + +// Test 4: /transcribe and /whisper and /tts slash commands are not parseable; +// /voice and /listen and /speak are gated under STUB_COMMANDS. +#[test] +fn audio_slash_commands_are_unimplemented() { + use commands::{parse_slash_command, SlashCommand}; + // /transcribe, /whisper, /tts do not exist in the SlashCommandSpec table. + let parsed_t = parse_slash_command("/transcribe audio.mp3", &[]).unwrap(); + assert!(matches!(parsed_t, SlashCommand::Unknown(_))); + let parsed_w = parse_slash_command("/whisper audio.mp3", &[]).unwrap(); + assert!(matches!(parsed_w, SlashCommand::Unknown(_))); + let parsed_tts = parse_slash_command("/tts hello", &[]).unwrap(); + assert!(matches!(parsed_tts, SlashCommand::Unknown(_))); + // /voice, /listen, /speak parse to typed SlashCommand variants but have no impl; + // they are gated under STUB_COMMANDS at rust/crates/rusty-claude-cli/src/main.rs:8333+. + let parsed_v = parse_slash_command("/voice on", &[]).unwrap(); + assert!(matches!(parsed_v, SlashCommand::Voice { .. })); + // The runtime prints "voice is not yet implemented in this build" — advertised-but-unbuilt. +} + +// Test 5: InputContentBlock has no Audio variant. +#[test] +fn input_content_block_has_no_audio_variant() { + use api::types::InputContentBlock; + // Compile-time observable: InputContentBlock has three variants + // (Text, ToolUse, ToolResult). The code below would not compile. + // let _ = InputContentBlock::Audio { source: AudioSource::Base64 { media_type: AudioMediaType::Wav, data: "...".into() } }; + let _ = std::mem::size_of::(); +} + +// Test 6: OutputContentBlock has no Audio variant. +#[test] +fn output_content_block_has_no_audio_variant() { + use api::types::OutputContentBlock; + // Compile-time observable: OutputContentBlock has four variants + // (Text, ToolUse, Thinking, RedactedThinking). The code below would not compile. + // let _ = OutputContentBlock::Audio { format: AudioFormat::Wav, transcript: Some("hello".into()), data: AudioData::Base64("...".into()) }; + let _ = std::mem::size_of::(); +} + +// Test 7: MessageRequest has no modalities field for gpt-4o-audio opt-in. +#[test] +fn message_request_has_no_modalities_field() { + // Compile-time observable: MessageRequest has thirteen optional fields and + // zero `modalities: Vec` field for gpt-4o-audio's `["text", "audio"]` + // request-side opt-in. The code below would not compile. + // let _ = MessageRequest { modalities: Some(vec![Modality::Text, Modality::Audio]), audio: Some(AudioRequestConfig { voice: AudioVoice::Alloy, format: AudioFormat::Wav }), ..Default::default() }; + let _ = std::mem::size_of::(); +} + +// Test 8: pricing_for_model returns None for audio model ids. +#[test] +fn pricing_for_model_returns_none_for_audio() { + use runtime::pricing_for_model; + // pricing_for_model substring-matches haiku/opus/sonnet only. + // Every audio model id falls back to None. + assert!(pricing_for_model("whisper-1").is_none()); + assert!(pricing_for_model("tts-1").is_none()); + assert!(pricing_for_model("tts-1-hd").is_none()); + assert!(pricing_for_model("gpt-4o-audio-preview").is_none()); + assert!(pricing_for_model("gpt-4o-realtime-preview").is_none()); + assert!(pricing_for_model("gpt-4o-mini-tts").is_none()); + assert!(pricing_for_model("gpt-4o-mini-transcribe").is_none()); + // ModelPricing has only four text-token-only fields: + // input_cost_per_million, output_cost_per_million, + // cache_creation_cost_per_million, cache_read_cost_per_million. + // Zero audio_input_per_minute, zero tts_per_million_chars, + // zero audio_input_tokens_per_million, zero audio_output_tokens_per_million. +} + +// Test 9: reqwest::multipart feature is not enabled in api/Cargo.toml. +#[test] +fn reqwest_multipart_feature_is_not_enabled() { + // Compile-time observable: rust/crates/api/Cargo.toml dependency line + // `reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }` + // does NOT enable the "multipart" feature flag. The code below would not compile. + // let _form = reqwest::multipart::Form::new() + // .text("model", "whisper-1") + // .part("file", reqwest::multipart::Part::stream(audio_bytes).file_name("audio.mp3").mime_str("audio/mpeg").unwrap()); + // (Same multipart-transport-plumbing-absence catalogued by #223 for Files API, + // now extending to audio-file uploads via /v1/audio/transcriptions.) +} +``` + +**Fix shape (not implemented in this cycle, recorded for cluster refactor):** + +The minimal fix is a nine-touch architectural extension that is structurally distinct from #221 / #222 / #223 / #224 because it must accommodate **multipart-transport-plumbing AND provider-asymmetric-delegation AND advertised-but-unbuilt-slash-command-rehoming-×3 AND symmetric-modality-input-output content-block-taxonomy AND modalities-request-side opt-in** — five independent shape-axes converging in a single fix. (a) Add `multipart` to the `reqwest` feature flags at `rust/crates/api/Cargo.toml:9` (`reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls", "multipart"] }` — same transport-plumbing extension as #223's Files API fix). (b) Define `pub struct TranscriptionRequest { pub file: Vec, pub filename: String, pub mime_type: String, pub model: String, pub language: Option, pub prompt: Option, pub response_format: Option, pub temperature: Option, pub timestamp_granularities: Vec }` with `pub enum TranscriptionFormat { Json, Text, Srt, VerboseJson, Vtt }` and `pub enum TimestampGranularity { Word, Segment }`, `pub struct TranscriptionResponse { pub text: String, pub language: Option, pub duration: Option, pub words: Option>, pub segments: Option> }`, `pub struct TranscriptionWord { pub word: String, pub start: f32, pub end: f32 }`, `pub struct TranscriptionSegment { pub id: u32, pub start: f32, pub end: f32, pub text: String, pub tokens: Vec, pub temperature: f32, pub avg_logprob: f32, pub compression_ratio: f32, pub no_speech_prob: f32 }`, `pub struct SpeechRequest { pub model: String, pub input: String, pub voice: AudioVoice, pub response_format: Option, pub speed: Option, pub instructions: Option }` (the `instructions` field is gpt-4o-mini-tts-specific for steerable-TTS direction-control, GA 2025-03-20), `pub enum AudioVoice { Alloy, Ash, Ballad, Coral, Echo, Fable, Onyx, Nova, Sage, Shimmer, Verse, ElevenLabsVoiceId(String), CartesiaVoiceId(String), CustomCloneId(String) }` (the union of OpenAI's eleven first-party voices + ElevenLabs/Cartesia voice-id discriminators + custom-clone-id discriminator for the canonical voice-cloning pattern that the gpt-4o-mini-tts catalog supports), `pub enum AudioFormat { Mp3, Wav, Opus, Aac, Flac, Pcm }`, `pub struct SpeechResponse { pub audio_data: Vec, pub format: AudioFormat, pub duration_ms: Option, pub usage: SpeechUsage }`, `pub struct SpeechUsage { pub input_tokens: u32, pub input_characters: u32, pub output_audio_tokens: u32 }` (the gpt-4o-mini-tts compound-cost model uses BOTH input-tokens AND input-characters for billing, the canonical "compound-cost" pattern that prior cluster members do not have), `pub enum AudioSource { Base64 { media_type: AudioMediaType, data: String }, Url(String), FileId(String) }` (the canonical Anthropic-style + OpenAI-Files-API + URL-reference triple), `pub enum AudioMediaType { Wav, Mp3, Opus, Aac, Flac, Webm, Ogg, M4a }` (the eight canonical audio MIME types per IANA `audio/*` registry), and add `pub struct AudioRequestConfig { pub voice: AudioVoice, pub format: AudioFormat }` plus `pub modalities: Option>` and `pub audio: Option` to `MessageRequest` for gpt-4o-audio request-side opt-in, plus add `Audio { source: AudioSource, media_type: AudioMediaType }` variant to `InputContentBlock` and `Audio { format: AudioFormat, transcript: Option, data: AudioData }` variant to `OutputContentBlock` (the symmetric-modality-input-output content-block-taxonomy axis novel to #225) at `rust/crates/api/src/types.rs` near line 234. (c) Re-export the new types from `rust/crates/api/src/lib.rs` near line 33. (d) Extend the `Provider` trait at `rust/crates/api/src/providers/mod.rs:17` with three new methods `transcribe<'a>(&'a self, request: &'a TranscriptionRequest) -> ProviderFuture<'a, TranscriptionResponse>` and `translate<'a>(&'a self, request: &'a TranscriptionRequest) -> ProviderFuture<'a, TranscriptionResponse>` and `synthesize_speech<'a>(&'a self, request: &'a SpeechRequest) -> ProviderFuture<'a, SpeechResponse>`, all three returning `AudioError::Unsupported { recommendation }` for providers that do not natively offer the audio surface (the canonical "explicit external partner recommendation" pattern matching #224's Voyage AI pattern but with three method-axes instead of one). (e) Implement on `OpenAiCompatClient` (`rust/crates/api/src/providers/openai_compat.rs`) using `POST /v1/audio/transcriptions` with `multipart/form-data` content-type for the request (file binary in `file` form-field, model in `model` form-field, language/prompt/response_format/temperature in respective form-fields per OpenAI Whisper API spec) and `application/json` content-type for the response, `POST /v1/audio/translations` with the same multipart shape but English-output-target, and `POST /v1/audio/speech` with `application/json` content-type for the request and `audio/{mp3,wav,opus,aac,flac,pcm}` content-type for the response (the response is binary audio data, not JSON — distinguishing audio from every prior cluster member which is JSON-in-JSON-out). (f) Implement on `AnthropicClient` (`rust/crates/api/src/providers/anthropic.rs`) returning `AudioError::Unsupported { recommendation: "Use AssemblyAI / Deepgram / OpenAI Whisper for transcription, ElevenLabs / Cartesia / OpenAI for synthesis per docs.anthropic.com/audio" }` because Anthropic explicitly does not offer audio — this is the **provider-asymmetric-delegation pattern** matching #224 but with multiple partner-recommendations instead of single-partner-recommendation. (g) Add six new variants `Whisper(WhisperClient)`, `ElevenLabs(ElevenLabsClient)`, `Cartesia(CartesiaClient)`, `Deepgram(DeepgramClient)`, `AssemblyAI(AssemblyAIClient)`, `Speechmatics(SpeechmaticsClient)` to the `ProviderClient` enum at `rust/crates/api/src/client.rs:8` with dedicated client implementations against each partner's `/v1/transcriptions` or `/v1/text-to-speech` or `/v1/speech-to-text` endpoint, the dispatch must auto-select the appropriate partner based on the user's configured credentials (`OPENAI_API_KEY` → OpenAI Whisper / TTS, `ELEVENLABS_API_KEY` → ElevenLabs, `CARTESIA_API_KEY` → Cartesia, `DEEPGRAM_API_KEY` → Deepgram, `ASSEMBLYAI_API_KEY` → AssemblyAI, `SPEECHMATICS_API_KEY` → Speechmatics, plus a `CLAW_AUDIO_PROVIDER` env-var override for explicit selection). (h) Re-home the three existing advertised-but-unbuilt slash commands (`/voice`, `/listen`, `/speak` at `rust/crates/commands/src/lib.rs:295-301`+`:603-609`+`:610-616`) onto real implementations using the new `Provider::transcribe` and `Provider::synthesize_speech` methods (remove from `STUB_COMMANDS` at `rust/crates/rusty-claude-cli/src/main.rs:8333`+`:8388`+`:8389`), AND add three new slash commands `/transcribe ` (transcribe an audio file from disk), `/whisper ` (alias for /transcribe with whisper-1 model), `/tts ` (synthesize speech from text using the active TTS model+voice). (i) Add `claw audio transcribe --model [--language ] [--prompt ] [--response-format json|text|srt|vtt]`, `claw audio translate --model `, `claw audio speak --voice --format [--speed <0.25..4.0>] [--output ]`, `claw audio voices --provider ` CLI subcommands at `rust/crates/rusty-claude-cli/src/main.rs`, threading `--output-format json|text|binary` flags. (j) Add `audio_input_per_minute_usd`, `audio_output_per_minute_usd`, `tts_per_million_chars_usd`, `whisper_per_minute_usd`, `audio_input_tokens_per_million_usd`, `audio_output_tokens_per_million_usd` fields to the `ModelPricing` struct at `rust/crates/runtime/src/usage.rs:9-15` (six new fields, the largest single-cluster-member pricing-tier extension catalogued because audio has multiple billing dimensions: per-minute for whisper, per-million-chars for tts-1, per-million-audio-tokens for gpt-4o-audio-preview's compound model). (k) Extend `pricing_for_model` at `rust/crates/runtime/src/usage.rs:59-79` to recognize whisper-1 / tts-1 / tts-1-hd / gpt-4o-audio-preview / gpt-4o-realtime-preview / gpt-4o-mini-tts / gpt-4o-mini-transcribe entries with their canonical pricing ($0.006/min for whisper-1, $15/M-chars for tts-1, $30/M-chars for tts-1-hd, $40/M-input-audio-tokens + $80/M-output-audio-tokens for gpt-4o-audio-preview, $100/M-input-audio-tokens + $200/M-output-audio-tokens for gpt-4o-realtime-preview, $0.60/M-text-input-tokens + $0.30/M-output-text-tokens + $10/M-output-audio-tokens for gpt-4o-mini-tts, $1.25/M-input-text-tokens + $1.25/M-input-audio-tokens + $5/M-output-text-tokens for gpt-4o-mini-transcribe per OpenAI pricing reference). (l) Add `claw doctor --json` `audio_provider: { provider, transcribe_model, synthesize_model, voice, format, total_seconds_transcribed, total_chars_synthesized, total_audio_input_tokens, total_audio_output_tokens }` field. Estimate: ~520 LOC production + ~640 LOC test (covering the OpenAI lane × the ElevenLabs lane × the Cartesia lane × the Deepgram lane × the AssemblyAI lane × the Speechmatics lane × the Anthropic-AudioError-Unsupported lane × multipart-form-data round-trip for transcription × JSON-in-binary-out round-trip for speech × `response_format` discriminator (json/text/srt/verbose_json/vtt) × `timestamp_granularities` discriminator (word/segment) × `voice` discriminator (eleven first-party voices + partner-voice-id passthrough) × `format` discriminator (mp3/wav/opus/aac/flac/pcm) × `speed` clamp (0.25..4.0) × `instructions` steerable-TTS direction × CLI-and-slash-command-surface symmetry × CLI `audio voices --provider` discovery × ModelPricing six-new-fields × pricing_for_model audio-model recognition × Anthropic-AudioError-Unsupported-with-three-partner-recommendation × OutputContentBlock::Audio response decoding for gpt-4o-audio-preview × InputContentBlock::Audio request encoding for gpt-4o-audio-preview × MessageRequest.modalities + MessageRequest.audio gpt-4o-audio request-side opt-in × claw doctor --json audio_provider field). The deeper fix is to declare a `Multimedia` typed module at the data-model layer that unifies image + audio + video (the three modality follow-on candidates from the endpoint-family-level absence shape, with #220 closing image-input, #225 closing audio-input-and-output, and a future #226 candidate closing image-generation-output, alongside an open #227 candidate for video-generation-output via the `/v1/videos/generations` Sora-2 / Veo-2 / Pika 2 / Runway Gen-4 endpoint family), with a `Provider::supports_modality(modality: Modality) -> bool` capability flag returning a structured snapshot that gives claw-code parity with anomalyco/opencode's `@voice` slash command (which uses Whisper to surface voice-input-driven tasks), Cursor's voice-mode (which uses Whisper for hands-free coding), GitHub Copilot's voice-for-VS-Code (which uses Whisper for voice-driven completions), simonw/llm's audio integration via `llm-whisper` plugin, Vercel AI SDK's `experimental_transcribe()` and `experimental_generateSpeech()` (which thread audio through provider-aware routing), LangChain's audio integrations covering 30+ STT and TTS providers, OpenAI Python SDK's `client.audio.transcriptions.create()` first-class typed surface, ElevenLabs Python SDK's parallel surface, Cartesia Python SDK's parallel surface, Deepgram Python SDK's parallel surface, AssemblyAI Python SDK's parallel surface, Speechmatics Python SDK's parallel surface, and Anthropic's recommended AssemblyAI/Deepgram/Whisper partnership (per `docs.anthropic.com/audio`). The cluster doctrine accumulates: every retrieval-augmented affordance that exists in 2025+ coding-agent harnesses must have a typed slot in the Rust data model, must traverse the wire via either `application/json` (chat-completion / models / batch / embeddings) or `multipart/form-data` (Files API / audio transcription) content-types, must round-trip cleanly through both native and openai-compat lanes (distinguishing the OpenAI side from the ElevenLabs/Cartesia/Deepgram/AssemblyAI/Speechmatics third-lane partners which each require their own client impl), must have a CLI subcommand surface AND a slash command surface that match each other, must accommodate provider-asymmetric coverage with explicit `*Error::Unsupported { recommendation }` returns where the canonical provider does not offer the endpoint, must thread modalities through the request struct for hybrid-modality endpoints (gpt-4o-audio-preview / gpt-4o-realtime-preview), and must have symmetric content-block-taxonomy coverage on both input and output sides for full-duplex modalities (audio is the first cluster member where this matters because audio is bidirectional in the gpt-4o-audio voice loop, distinguishing #225 from #220's image-input-only and #224's embedding-output-only single-direction modalities). The ninth axis — symmetric-modality-input-output-content-block-taxonomy — is novel in the cluster and motivates a new doctrine entry: any new endpoint family with full-duplex modality coverage (audio-input-AND-audio-output, video-input-AND-video-output, image-input-AND-image-edit-output) must have its content-block-taxonomy modeled symmetrically on both `InputContentBlock` and `OutputContentBlock`, distinct from prior cluster members which have either input-only (#220 image), output-only (#214 reasoning_content, #224 embedding-vector), or stateless (#221/#222/#223) modality coverage. Distinct from #220's `/image` and `/screenshot` (advertised, gated under STUB_COMMANDS, returns clear unsupported error — but the underlying capability is uniform across providers, no provider-asymmetric-delegation), #221's batch dispatch (per-request synchronous JSON only, uniform across providers, no transport-plumbing-extension), #222's model discovery (GET-only JSON catalog, uniform across providers, no transport-plumbing-extension), #223's Files API (multipart-form-data uploads, uniform across providers — just different beta header on Anthropic, no provider-asymmetric-delegation), and #224's Embeddings API (provider-asymmetric-delegation with Voyage-AI-third-lane, JSON-only, no transport-plumbing-extension, no advertised-but-unbuilt-slash-commands), #225's Audio API is the **first cluster member where five independent shape-axes converge** before any of the higher-level surfaces can ship — the largest fusion-shape gap catalogued so far, the upstream prerequisite of every voice-driven coding-agent affordance, and the first cluster member where the symmetric-modality-input-output content-block-taxonomy axis is introduced. + +**Status:** Open. No code changed. Filed 2026-04-26 03:36 KST. Branch: feat/jobdori-168c-emission-routing. HEAD: c01b470. Sibling-shape cluster (silent-fallback / silent-drop / silent-strip / silent-misnomer / silent-shadow / silent-prefix-mismatch / structural-absence / silent-zero-coercion / silent-content-discard / silent-header-discard / silent-tier-absence / silent-finish-mistranslation / silent-capability-absence / silent-false-positive-opt-in / advertised-but-unbuilt / endpoint-family-level-absence / advertised-but-rerouted / endpoint-family-level-absence-with-transport-plumbing-absence / endpoint-family-level-absence-with-provider-asymmetric-delegation / nine-layer-fusion-shape): #201/#202/#203/#206/#207/#208/#209/#210/#211/#212/#213/#214/#215/#216/#217/#218/#219/#220/#221/#222/#223/#224/#225 — twenty-four pinpoints. Wire-format-parity cluster grows to fifteen: #211 (max_completion_tokens) + #212 (parallel_tool_calls) + #213 (cached_tokens response-side) + #214 (reasoning_content) + #215 (Retry-After) + #216 (service_tier + system_fingerprint) + #217 (finish_reason taxonomy) + #218 (response_format / output_config / refusal) + #219 (cache_control request-side) + #220 (image content block + media_type) + #221 (Message Batches API) + #222 (Models list endpoint) + #223 (Files API + multipart-form-data transport plumbing) + #224 (Embeddings API + EmbeddingRequest + EmbeddingResponse + Voyage AI third-lane routing + provider-asymmetric-delegation pattern) + #225 (Audio API + TranscriptionRequest + SpeechRequest + AudioVoice + AudioFormat + AudioMediaType + AudioSource + Modality + AudioRequestConfig + InputContentBlock::Audio + OutputContentBlock::Audio + multipart-form-data audio-upload + six-partner provider-asymmetric-delegation + nine-layer-fusion-shape). Capability-parity cluster grows to seven: #218 (structured outputs) + #220 (multimodal input) + #221 (batch dispatch) + #222 (model discovery) + #223 (file management) + #224 (embeddings + RAG prerequisite) + #225 (audio + voice-loop prerequisite, the first cluster member with full-duplex symmetric-input-output modality coverage) — seven members, all four-or-more-layer structural absences. Cross-cutting-data-pipeline cluster grows to two: #224 (RAG prerequisite, semantic-similarity manifold) + #225 (voice-loop prerequisite, full-duplex audio bidirectional modality, the upstream root cause of every speech-driven coding-agent affordance). Multimodal-IO cluster grows to three: #220 (image input only, output is JSON markdown) + #224 (embedding output only, fixed-dimensional float vector) + #225 (audio input AND output, the first cluster member with full-duplex bidirectional modality where the same content-block-taxonomy axis applies to both InputContentBlock and OutputContentBlock variants). Advertised-but-unbuilt cluster grows to four: #220 (`/image`+`/screenshot` ×2) + #223 (`/files` ×1) + #225 (`/voice`+`/listen`+`/speak` ×3, the largest single-pinpoint count catalogued — strict-superset of #220's ×2 and #223's ×1). Multipart-transport cluster grows to two: #223 (Files API binary upload via /v1/files) + #225 (Audio transcription binary upload via /v1/audio/transcriptions, a strict-prerequisite-disjoint extension because audio-files do not need to be persisted via Files API for one-shot transcription — they're streamed inline as multipart/form-data per Whisper API spec, meaning #225 needs multipart-transport-plumbing even if #223's Files API surface is shipped first). Provider-asymmetric-delegation cluster grows to two: #224 (Voyage-AI single-partner-recommendation for embeddings) + #225 (ElevenLabs/Cartesia/PlayHT/Deepgram/AssemblyAI/Speechmatics six-plus-partner-set for TTS+STT, the largest partner-set in the surveyed ecosystem because audio is the most-fragmented modality across third-party providers). Nine-layer-fusion-shape (endpoint-URL-set-of-three [/v1/audio/transcriptions + /v1/audio/translations + /v1/audio/speech] + multipart-form-data-transport-plumbing + data-model-taxonomy-with-input-AND-output-content-blocks + modalities-request-side-opt-in + Provider-trait-method-set-of-three-with-Unsupported-fallback + ProviderClient-enum-dispatch-with-six-partner-third-lanes + advertised-but-unbuilt-slash-commands-×3 + CLI-subcommand-surface + pricing-tier-with-per-minute-and-per-million-chars-and-per-million-audio-tokens-compound-cost-model) is the largest single-pinpoint fusion catalogued, fusing #223's transport-plumbing axis + #224's provider-asymmetric-delegation axis + #220's advertised-but-unbuilt-slash-commands axis + #218's modalities-request-side axis + the new symmetric-input-output content-block-taxonomy axis (#225's first-of-its-kind contribution to the cluster doctrine, since prior cluster members have either input-only [#220] or output-only [#214] or stateless [#221/#222/#223] or input-with-fixed-output-vector [#224] modality coverage). Distinct from prior single-field (#211/#212/#214) / response-only (#213/#207) / header-only (#215) / three-dimensional (#216) / classifier-leakage (#217) / four-layer (#218) / false-positive-opt-in (#219) / five-layer-feature-absence (#220) / seven-layer-endpoint-family-absence (#221) / eight-layer-endpoint-family-absence-with-misleading-alias (#222) / seven-layer-endpoint-family-absence-with-transport-plumbing-absence (#223) / seven-layer-endpoint-family-absence-with-provider-asymmetric-delegation (#224) members; the nine-layer-fusion-shape is novel and applies to follow-on candidate Image-generation API typed taxonomy (`/v1/images/generations` + `/v1/images/edits` + `/v1/images/variations`, also provider-asymmetric — Anthropic does not offer image generation, OpenAI offers GA dall-e-3 + dall-e-2 + gpt-image-1, Google offers Imagen, recommended-partners include Stability AI / Midjourney / Black Forest Labs / Ideogram, and `/v1/images/edits` requires multipart-form-data with binary image+mask uploads — sibling fusion shape but with image-instead-of-audio modality, JSON-with-base64-or-url-output instead of binary-audio-output, and no symmetric input-AND-output content-block-taxonomy axis because images are output-only in the gpt-image-1 generation flow rather than full-duplex like gpt-4o-audio's bidirectional voice loop) — open candidate for #226. The fusion-shape pattern recurs across every modality-bearing endpoint family that combines provider-asymmetric coverage with multipart-transport needs and advertised-but-unbuilt-slash-command-clusters and symmetric-modality-input-output coverage, and #225 is the first cluster member where all five axes converge in a single pinpoint — the largest fusion-shape gap catalogued so far, the upstream prerequisite of every voice-driven coding-agent affordance, and the first cluster member where the symmetric-modality-input-output content-block-taxonomy axis is introduced. + +🪨