Real-Time Streaming Support
This document explains how real‑time audio streaming works in iX Hello vNext and clarifies the differences between the two AI model options: GPT‑4o Mini and GPT-Real‑Time. Understanding the streaming behavior—especially for input audio—is crucial for designing conversational experiences with minimal latency.
Overview
Here we showcase the difference between non‑streaming and streaming system output. While iX Hello vNext’s streaming behavior may not exactly match that demo, it delivers clear improvements in user experience—especially with the GPT-Real‑Time model.
The key distinction between the two supported models:
GPT‑4o Mini
No
Yes (SSE-based text streaming)
GPT-Real‑Time
Yes
Yes (WebSocket-based audio/text streaming)

Input Audio Flow with GPT‑4o Mini (Non‑Streaming Input)
GPT‑4o Mini is the “regular AI model” used in iX Hello vNext. It is smart, fast, and cheap — but it does NOT listen to audio in real time.
It works like this:
Caller speaks.
The system listens to the whole sentence, converts it to text.
That text is sent to GPT‑4o Mini, and the model replies back with text.
Then the system reads the reply aloud to the caller.
Below is the simplified architecture for calls using GPT‑4o Mini, focusing only on the input audio flow.


2.1 Sequence
Voice Activity Detection (VAD) Before playing the prompt, the Media Server activates VAD to detect when the caller begins speaking.
Audiocollection via REST Once VAD detects speech, the caller’s audio is streamed as chunked audio over a REST API to the
gpt-4o-transcribemodel.Transcription (non-streaming) The transcription model waits until the caller finishes speaking.
Only after the user stops does it return one complete sentence.
Example: If the caller says, “I’d like to check my balance,” → the model sends back the entire sentence at once.
Text Delivery to Runtime Channel Gateway receives the full text and forwards it via WebSocket to the Runtime Engine.
LLM Response Generation (SSE-Streaming)
The Runtime sends the user’s text to GPT‑4o Mini using SSE (Server-Sent Events).
GPT‑4o Mini starts sending the reply bit by bit, so the system does not need to wait for the whole answer.
Text-to-Speech (11 Labs) The full final text is forwarded to 11 Labs for audio generation and playback to the caller.
2.2 Key Characteristics of GPT‑4o Mini Input Flow
Input is NOT streamed to the LLM.
Transcription occurs first, blocking until completion.
Only the output is streamed.
All red-marked components in your diagram correspond to non‑streaming behavior.
Demo on GPT -4o Mini - Non-Streaming of Audio:
3. Input Audio Flow with GPT-Real‑Time (Fully Streaming)
With GPT Real‑Time, both input and output use streaming via WebSocket's.


3.1 Sequence
Voice Activity Detection Media Server still uses VAD to detect when the caller starts speaking.
Real-Time Audio Streaming to Channel Gateway As soon as audio packets arrive, they are streamed immediately to Channel Gateway.
WebSocket Audio Streaming to Runtime Channel Gateway forwards the live audio packets to the Runtime Engine.
WebSocket Streaming to Azure OpenAI Runtime streams input audio directly to the GPT-Real‑Time model.
Immediate Real-Time Response The model streams output (text + optional synthesized audio hints) back over WebSocket.
Final Audio Playback Runtime sends the Final text to 11 Labs for TTS, and the audio is played back to the caller.
3.2 Key Characteristics
Input audio is streamed end-to-end.
Media Server and Channel Gateway never see the transcribed text—they only handle audio bytes.
Only Runtime receives text, because the model outputs it.
Demo on GPT-Real‑Time (Fully Streaming) Audio:
4. Summary of the Difference
GPT‑4o Mini (Non‑Streaming Input)
Has to wait for the caller to finish speaking → then transcribe → then process → then respond. ➡️ Takes around 2–3 seconds after user stops talking.
GPT-Real‑Time (Streaming Input)
Starts processing audio while the caller is still speaking. ➡️ Starts giving a response in about 1 second after the user stops talking.
Overall:
Real‑time is roughly 1–2 seconds faster, more consistent, and doesn’t slow down even with long user inputs.
Input Audio Streaming
❌ No
✔️ Yes
Transcription
Required before LLM call
Optional but recommended
Output Streaming
✔️ SSE
✔️ WebSocket
Latency
Higher
Lowest
Real-Time Responsiveness
Limited
Near-instant
5. Conclusion
The decision between GPT‑4o Mini and GPT-Real‑Time comes down to the required level of responsiveness:
GPT‑4o Mini is best for non-streaming input where cost matters. It's designed for basic text replies, suitable for simple interactions. Performance varies; it can be accurate or prone to errors. It struggles with transcription, especially language-switching. Its behavior is unpredictable, which undermines production reliability. Although budget-friendly, quality sacrifices are evident. Ideal only when cost is more important than conversation quality.
GPT-Real-Time is optimized for ultra-low latency, seamless conversational exchanges, and instantaneous system responses. It delivers superior accuracy, consistency, and reliability. It efficiently manages language switching and transcription and generates stable, high-quality responses for both brief and extended inputs. It's ideal for customer-facing and production environments where performance is critical. While it's more expensive, the enhancements in user experience and precision often justify the cost. Best suited for applications requiring exceptional performance, natural dialog, and dependability.
Last updated
Was this helpful?