Real-Time Streaming Support

This document explains how real‑time audio streaming works in iX Hello vNext and clarifies the differences between the two AI model options: GPT‑4o Mini and GPT-Real‑Time. Understanding the streaming behavior—especially for input audio—is crucial for designing conversational experiences with minimal latency.

Overview

Here we showcase the difference between non‑streaming and streaming system output. While iX Hello vNext’s streaming behavior may not exactly match that demo, it delivers clear improvements in user experience—especially with the GPT-Real‑Time model.

The key distinction between the two supported models:

Model
Supports Input Audio Streaming?
Supports Real-Time Output Streaming?

GPT‑4o Mini

No

Yes (SSE-based text streaming)

GPT-Real‑Time

Yes

Yes (WebSocket-based audio/text streaming)

  1. Input Audio Flow with GPT‑4o Mini (Non‑Streaming Input)

    1. GPT‑4o Mini is the “regular AI model” used in iX Hello vNext. It is smart, fast, and cheap — but it does NOT listen to audio in real time.

    2. It works like this:

      1. Caller speaks.

      2. The system listens to the whole sentence, converts it to text.

      3. That text is sent to GPT‑4o Mini, and the model replies back with text.

      4. Then the system reads the reply aloud to the caller.

Below is the simplified architecture for calls using GPT‑4o Mini, focusing only on the input audio flow.

2.1 Sequence

  1. Voice Activity Detection (VAD) Before playing the prompt, the Media Server activates VAD to detect when the caller begins speaking.

  2. Audiocollection via REST Once VAD detects speech, the caller’s audio is streamed as chunked audio over a REST API to the gpt-4o-transcribe model.

  3. Transcription (non-streaming) The transcription model waits until the caller finishes speaking.

    Only after the user stops does it return one complete sentence.

    Example: If the caller says, “I’d like to check my balance,” → the model sends back the entire sentence at once.

  4. Text Delivery to Runtime Channel Gateway receives the full text and forwards it via WebSocket to the Runtime Engine.

  5. LLM Response Generation (SSE-Streaming)

    The Runtime sends the user’s text to GPT‑4o Mini using SSE (Server-Sent Events).

    GPT‑4o Mini starts sending the reply bit by bit, so the system does not need to wait for the whole answer.

  6. Text-to-Speech (11 Labs) The full final text is forwarded to 11 Labs for audio generation and playback to the caller.

2.2 Key Characteristics of GPT‑4o Mini Input Flow

  • Input is NOT streamed to the LLM.

  • Transcription occurs first, blocking until completion.

  • Only the output is streamed.

  • All red-marked components in your diagram correspond to non‑streaming behavior.

Demo on GPT -4o Mini - Non-Streaming of Audio:


3. Input Audio Flow with GPT-Real‑Time (Fully Streaming)

With GPT Real‑Time, both input and output use streaming via WebSocket's.

3.1 Sequence

  1. Voice Activity Detection Media Server still uses VAD to detect when the caller starts speaking.

  2. Real-Time Audio Streaming to Channel Gateway As soon as audio packets arrive, they are streamed immediately to Channel Gateway.

  3. WebSocket Audio Streaming to Runtime Channel Gateway forwards the live audio packets to the Runtime Engine.

  4. WebSocket Streaming to Azure OpenAI Runtime streams input audio directly to the GPT-Real‑Time model.

  5. Immediate Real-Time Response The model streams output (text + optional synthesized audio hints) back over WebSocket.

  6. Final Audio Playback Runtime sends the Final text to 11 Labs for TTS, and the audio is played back to the caller.

3.2 Key Characteristics

  • Input audio is streamed end-to-end.

  • Media Server and Channel Gateway never see the transcribed text—they only handle audio bytes.

  • Only Runtime receives text, because the model outputs it.

Demo on GPT-Real‑Time (Fully Streaming) Audio:


4. Summary of the Difference

GPT‑4o Mini (Non‑Streaming Input)

Has to wait for the caller to finish speaking → then transcribe → then process → then respond. ➡️ Takes around 2–3 seconds after user stops talking.

GPT-Real‑Time (Streaming Input)

Starts processing audio while the caller is still speaking. ➡️ Starts giving a response in about 1 second after the user stops talking.

Overall:

Real‑time is roughly 1–2 seconds faster, more consistent, and doesn’t slow down even with long user inputs.

Aspect
GPT‑4o Mini
GPT Real‑Time

Input Audio Streaming

❌ No

✔️ Yes

Transcription

Required before LLM call

Optional but recommended

Output Streaming

✔️ SSE

✔️ WebSocket

Latency

Higher

Lowest

Real-Time Responsiveness

Limited

Near-instant


5. Conclusion

The decision between GPT‑4o Mini and GPT-Real‑Time comes down to the required level of responsiveness:

  • GPT‑4o Mini is best for non-streaming input where cost matters. It's designed for basic text replies, suitable for simple interactions. Performance varies; it can be accurate or prone to errors. It struggles with transcription, especially language-switching. Its behavior is unpredictable, which undermines production reliability. Although budget-friendly, quality sacrifices are evident. Ideal only when cost is more important than conversation quality.

  • GPT-Real-Time is optimized for ultra-low latency, seamless conversational exchanges, and instantaneous system responses. It delivers superior accuracy, consistency, and reliability. It efficiently manages language switching and transcription and generates stable, high-quality responses for both brief and extended inputs. It's ideal for customer-facing and production environments where performance is critical. While it's more expensive, the enhancements in user experience and precision often justify the cost. Best suited for applications requiring exceptional performance, natural dialog, and dependability.

Last updated

Was this helpful?