Voice and multimodal
Voice is a first-class peer to text in the GECX Headless Chat SDK. The same ChatMessage envelope carries voice parts, the same IdentityManager owns the session identity, and the same ChatGovernance exports and forgets transcripts and (optionally) audio bytes. Vision (image input + output) ships as a sibling part type so commerce and support copilots can show and accept images natively.
What voice adds
Five new ChatMessagePart variants live alongside the existing 16:
audio-input— user-captured audio. Status flowscapturing → completed | cancelled | failed.retained: trueonly when the host has grantedvoice_recordingconsent at capture time; otherwise the audio bytes are session-ephemeral.audio-output— model-generated audio. Status flowsstreaming → completed | interrupted.playedUpToMsis sampled from the playback element on barge-in and lets the provider truncate model history at the audio offset the user actually heard.transcript— STT output.interim: trueparts are replaced in-place when the provider finalizes; flips tofalseon finalization.role: 'user' | 'assistant'distinguishes who spoke.audio-cue— side-channel signals:end-of-turn,barge-in,silence-threshold-hit,speech-started,session-ready,session-ended. The default renderer is a no-op; consumers can subscribe to drive overlays or analytics.vision— first-class image input/output, separate from genericFilePart.kind: 'input'carries user-uploaded images;kind: 'output'carries model-returned images.altTextis required for accessibility.
All five get predicates (isAudioInputPart, etc.), builders, JSON schema entries, drift-test coverage, and default renderers in <MessagePart>. They live alongside the rest of the part-types catalog — sentiment and intent signal parts and the computer-use surface part flow through the same ChatMessage.parts[] array.
The voice subsystem
VoiceSession is the framework-neutral facade. It owns:
- The
VoiceProviderlifecycle (open / close). - The mic
MediaStreamin the browser. - The barge-in arbiter: when the provider signals
speech-startedduring a streamingAudioOutputPart, the arbiter aborts the in-flight chat-sideAbortController, emitsaudio-cue: barge-in, marks the in-flight output asinterrupted, and stampsplayedUpToMsfor upstream truncation. Target latency: <300 ms from cue to abort. - Identity inheritance:
VoiceSessionborrowschatSession.identityverbatim. No new identity primitives.
VoiceProvider is the BYO seam. Capabilities drive dispatch:
realtime: true→openRealtimeSession()is used; one bidi connection carries everything.realtime: false+stt + tts: true→ pipeline path (transcribe → chat → synthesize). The pipeline path is a contract-only interface in v1; customers wire it themselves.
The bundled providers are:
| Provider | Subpath | Use |
|---|---|---|
createMockVoiceProvider | gecx-chat/testing | Scripted timer replay. Zero browser deps. Used in vitest. |
createWebAudioMockVoiceProvider | gecx-chat/voice | Same scenarios, real AudioContext, synthetic tones. Used in the showcase. |
createGeminiLiveProvider | gecx-chat/voice | Real WSS to Gemini Live. Default production provider. |
One-line configuration
The simplest way to opt into voice is the voice field on ChatClientConfig:
const client = createChatClient({
// ...
voice: 'auto',
});
'auto' picks a provider by environment: mock in test, web-audio mock in dev, Gemini Live in prod/staging. An explicit VoiceConfig gives full control.
chat.voice is a lazy getter. Configuring voice does not request the microphone, open a WebSocket, or run the provider factory. The factory only runs when something first reads session.voice. Hosts that gate voice behind a button can defer permission requests until the user clicks. This is why the <VoiceToggle voice={chat.voice} /> React component takes the session through a prop and the toggle's first read is what triggers setup.
validateConfig() surfaces a VOICE_TOKEN_ENDPOINT_MISSING error in production/staging when an explicit gemini-live provider lacks a tokenEndpoint. gecx doctor checks the voice token endpoint when --token-endpoint is supplied; gecx scaffold --with-voice generates the page, voice config, and app/api/voice-token/route.ts from the proxy reference.
Latency instrumentation
VoiceSession emits a metrics event with firstAudioMs — the latency from start() to the first audible frame. The showcase /voice page renders this as a live overlay against a sub-400 ms target. Pipe the metric into your own analytics sink if you need to track voice latency in production.
Permissions
Microphone access is gated by PermissionManager. VoiceSession.start() calls permissions.ensure('microphone') before opening the realtime session and passes the captured MediaStream through as RealtimeOpenOpts.inputStream. If the user denies access (or HTTPS is not available), ensure() throws the appropriate PERMISSION_* error and the session never opens.
<VoiceToggle> renders a permission probe and inline remediation when access is blocked. See Permissions.
Transports
Two voice transports ship in PR 1:
createVoiceWebSocketTransport— extends the existing WebSocket transport patterns (cursor/resume, sequence dedup, reconnect) with audio frame envelopes. Wire protocol is JSON-only in v1; binary frames are a follow-up. Declarescapabilities.audio = { input: true, output: true, bargeIn: true }. This transport is for customer-owned voice backends; the bundled Gemini Live adapter bypasses it.createWebRTCVoiceTransport— interface-only stub. The SDP-exchange + data-channel skeleton compiles butopenVoice()throwsVOICE_PROVIDER_UNAVAILABLE. Real implementation lands when the SDK adds an OpenAI Realtime adapter or a LiveKit-partner binding.
Identity continuity
Voice does not have its own identity. A user who starts a text chat and switches to voice mid-session keeps the same sessionId, the same identityId, and the same X-Ceai-Identity-* headers on the wire. Cross-tab SyncChannel is unchanged. Switching back to text is voiceSession.stop(); the chat session never changed.
Governance
Audio is treated as more sensitive than transcripts:
- Transcripts persist by default.
TranscriptPartis text-equivalent and inherits the same retention asTextPart. - Audio bytes are session-ephemeral by default.
RetentionPolicy.audiodefaults to'ephemeral'— audio is held in session-scopedURL.createObjectURLreferences and revoked on session end. - Opt-in retention requires both the
voice_recordingconsent flag (orthogonal to the existingnone|functional|analytics|allposture) AND a non-ephemeralretention.audiosetting.
Audit events:
governance.voice_session_startedgovernance.voice_session_endedgovernance.voice_recording_consent_grantedgovernance.voice_recording_consent_withdrawn
exportConversation always includes transcripts; includes audio bytes iff retained: true on the part. forgetMe always clears both.
Where to go next
- Voice integration guide — the one-line config and React surface.
- Permissions — how microphone access is acquired before voice opens.
- Message parts reference — full property tables for each new part.
- Error codes reference — voice error codes and remediation.
docs/concepts/voice.md