Voice and multimodal

Voice is a first-class peer to text in the GECX Headless Chat SDK. The same ChatMessage envelope carries voice parts, the same IdentityManager owns the session identity, and the same ChatGovernance exports and forgets transcripts and (optionally) audio bytes. Vision (image input + output) ships as a sibling part type so commerce and support copilots can show and accept images natively.

What voice adds

Five new ChatMessagePart variants live alongside the existing 16:

  • audio-input — user-captured audio. Status flows capturing → completed | cancelled | failed. retained: true only when the host has granted voice_recording consent at capture time; otherwise the audio bytes are session-ephemeral.
  • audio-output — model-generated audio. Status flows streaming → completed | interrupted. playedUpToMs is sampled from the playback element on barge-in and lets the provider truncate model history at the audio offset the user actually heard.
  • transcript — STT output. interim: true parts are replaced in-place when the provider finalizes; flips to false on finalization. role: 'user' | 'assistant' distinguishes who spoke.
  • audio-cue — side-channel signals: end-of-turn, barge-in, silence-threshold-hit, speech-started, session-ready, session-ended. The default renderer is a no-op; consumers can subscribe to drive overlays or analytics.
  • vision — first-class image input/output, separate from generic FilePart. kind: 'input' carries user-uploaded images; kind: 'output' carries model-returned images. altText is required for accessibility.

All five get predicates (isAudioInputPart, etc.), builders, JSON schema entries, drift-test coverage, and default renderers in <MessagePart>. They live alongside the rest of the part-types catalog — sentiment and intent signal parts and the computer-use surface part flow through the same ChatMessage.parts[] array.

The voice subsystem

VoiceSession is the framework-neutral facade. It owns:

  • The VoiceProvider lifecycle (open / close).
  • The mic MediaStream in the browser.
  • The barge-in arbiter: when the provider signals speech-started during a streaming AudioOutputPart, the arbiter aborts the in-flight chat-side AbortController, emits audio-cue: barge-in, marks the in-flight output as interrupted, and stamps playedUpToMs for upstream truncation. Target latency: <300 ms from cue to abort.
  • Identity inheritance: VoiceSession borrows chatSession.identity verbatim. No new identity primitives.

VoiceProvider is the BYO seam. Capabilities drive dispatch:

  • realtime: trueopenRealtimeSession() is used; one bidi connection carries everything.
  • realtime: false + stt + tts: true → pipeline path (transcribe → chat → synthesize). The pipeline path is a contract-only interface in v1; customers wire it themselves.

The bundled providers are:

ProviderSubpathUse
createMockVoiceProvidergecx-chat/testingScripted timer replay. Zero browser deps. Used in vitest.
createWebAudioMockVoiceProvidergecx-chat/voiceSame scenarios, real AudioContext, synthetic tones. Used in the showcase.
createGeminiLiveProvidergecx-chat/voiceReal WSS to Gemini Live. Default production provider.

One-line configuration

The simplest way to opt into voice is the voice field on ChatClientConfig:

const client = createChatClient({
  // ...
  voice: 'auto',
});

'auto' picks a provider by environment: mock in test, web-audio mock in dev, Gemini Live in prod/staging. An explicit VoiceConfig gives full control.

chat.voice is a lazy getter. Configuring voice does not request the microphone, open a WebSocket, or run the provider factory. The factory only runs when something first reads session.voice. Hosts that gate voice behind a button can defer permission requests until the user clicks. This is why the <VoiceToggle voice={chat.voice} /> React component takes the session through a prop and the toggle's first read is what triggers setup.

validateConfig() surfaces a VOICE_TOKEN_ENDPOINT_MISSING error in production/staging when an explicit gemini-live provider lacks a tokenEndpoint. gecx doctor checks the voice token endpoint when --token-endpoint is supplied; gecx scaffold --with-voice generates the page, voice config, and app/api/voice-token/route.ts from the proxy reference.

Latency instrumentation

VoiceSession emits a metrics event with firstAudioMs — the latency from start() to the first audible frame. The showcase /voice page renders this as a live overlay against a sub-400 ms target. Pipe the metric into your own analytics sink if you need to track voice latency in production.

Permissions

Microphone access is gated by PermissionManager. VoiceSession.start() calls permissions.ensure('microphone') before opening the realtime session and passes the captured MediaStream through as RealtimeOpenOpts.inputStream. If the user denies access (or HTTPS is not available), ensure() throws the appropriate PERMISSION_* error and the session never opens.

<VoiceToggle> renders a permission probe and inline remediation when access is blocked. See Permissions.

Transports

Two voice transports ship in PR 1:

  • createVoiceWebSocketTransport — extends the existing WebSocket transport patterns (cursor/resume, sequence dedup, reconnect) with audio frame envelopes. Wire protocol is JSON-only in v1; binary frames are a follow-up. Declares capabilities.audio = { input: true, output: true, bargeIn: true }. This transport is for customer-owned voice backends; the bundled Gemini Live adapter bypasses it.
  • createWebRTCVoiceTransport — interface-only stub. The SDP-exchange + data-channel skeleton compiles but openVoice() throws VOICE_PROVIDER_UNAVAILABLE. Real implementation lands when the SDK adds an OpenAI Realtime adapter or a LiveKit-partner binding.

Identity continuity

Voice does not have its own identity. A user who starts a text chat and switches to voice mid-session keeps the same sessionId, the same identityId, and the same X-Ceai-Identity-* headers on the wire. Cross-tab SyncChannel is unchanged. Switching back to text is voiceSession.stop(); the chat session never changed.

Governance

Audio is treated as more sensitive than transcripts:

  • Transcripts persist by default. TranscriptPart is text-equivalent and inherits the same retention as TextPart.
  • Audio bytes are session-ephemeral by default. RetentionPolicy.audio defaults to 'ephemeral' — audio is held in session-scoped URL.createObjectURL references and revoked on session end.
  • Opt-in retention requires both the voice_recording consent flag (orthogonal to the existing none|functional|analytics|all posture) AND a non-ephemeral retention.audio setting.

Audit events:

  • governance.voice_session_started
  • governance.voice_session_ended
  • governance.voice_recording_consent_granted
  • governance.voice_recording_consent_withdrawn

exportConversation always includes transcripts; includes audio bytes iff retained: true on the part. forgetMe always clears both.

Where to go next

Source: docs/concepts/voice.md