# Real-Time Streaming Support

This document explains how **real‑time audio streaming** works in iX Hello vNext and clarifies the differences between the two AI model options: **GPT‑4o Mini** and **GPT-Real‑Time**. Understanding the streaming behavior—especially for **input audio**—is crucial for designing conversational experiences with minimal latency.

### **Overview**

Here we showcase the difference between *non‑streaming* and *streaming* system output. While iX Hello vNext’s streaming behavior may not exactly match that demo, it delivers clear improvements in user experience—especially with the **GPT-Real‑Time** model.

The key distinction between the two supported models:

| Model             | Supports Input Audio Streaming? | Supports Real-Time Output Streaming?       |
| ----------------- | ------------------------------- | ------------------------------------------ |
| **GPT‑4o Mini**   | No                              | Yes (SSE-based text streaming)             |
| **GPT-Real‑Time** | Yes                             | Yes (WebSocket-based audio/text streaming) |

<figure><img src="/files/sA6601cOFDu4njYXsXa3" alt="" width="563"><figcaption></figcaption></figure>

1. Input Audio Flow with GPT‑4o Mini (Non‑Streaming Input)
   1. **GPT‑4o Mini is the “regular AI model” used in iX Hello vNext.**\
      It is smart, fast, and cheap — but it **does NOT listen to audio in real time**.
   2. It works like this:
      1. **Caller speaks.**
      2. The system **listens to the whole sentence**, converts it to **text**.
      3. That text is sent to GPT‑4o Mini, and the model **replies back with text**.
      4. Then the system reads the reply aloud to the caller.

Below is the simplified architecture for calls using **GPT‑4o Mini**, focusing only on the **input audio** flow.

<figure><img src="/files/AOTgwiq6n99JkfsLdu1q" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/NIp7B7hAJh8ZhU8o74J0" alt=""><figcaption></figcaption></figure>

#### **2.1 Sequence**

1. **Voice Activity Detection (VAD)**\
   Before playing the prompt, the Media Server activates VAD to detect when the caller begins speaking.
2. **Audiocollection via REST**\
   Once VAD detects speech, the caller’s audio is streamed **as chunked audio** over a REST API to the **`gpt-4o-transcribe`** model.
3. **Transcription (non-streaming)**\
   The transcription model **waits until the caller finishes speaking**.

   Only after the user stops does it return **one complete sentence**.

   Example: If the caller says, “I’d like to check my balance,”\
   → the model sends back the entire sentence at once.
4. **Text Delivery to Runtime**\
   Channel Gateway receives the full text and forwards it via WebSocket to the Runtime Engine.
5. **LLM Response Generation (SSE-Streaming)**

   The Runtime sends the user’s text to **GPT‑4o Mini** using SSE (Server-Sent Events).

   GPT‑4o Mini starts sending the reply **bit by bit**, so the system does not need to wait for the whole answer.
6. **Text-to-Speech (11 Labs)**\
   The full final text is forwarded to 11 Labs for audio generation and playback to the caller.

#### **2.2 Key Characteristics of GPT‑4o Mini Input Flow**

* Input is **NOT streamed** to the LLM.
* Transcription occurs **first**, blocking until completion.
* Only the **output** is streamed.
* All red-marked components in your diagram correspond to non‑streaming behavior.

#### Demo on GPT -4o Mini - Non-Streaming of Audio:

{% file src="/files/501Pdo6UMmh45rH28imS" %}

***

### **3. Input Audio Flow with GPT-Real‑Time (Fully Streaming)**

With **GPT Real‑Time**, both **input and output** use streaming via WebSocket's.

<figure><img src="/files/WC6RWE304SdXALV5YbmG" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/8YPMBIJzuJpjzQez0R6y" alt=""><figcaption></figcaption></figure>

#### **3.1 Sequence**

1. **Voice Activity Detection**\
   Media Server still uses VAD to detect when the caller starts speaking.
2. **Real-Time Audio Streaming to Channel Gateway**\
   As soon as audio packets arrive, they are streamed *immediately* to Channel Gateway.
3. **WebSocket Audio Streaming to Runtime**\
   Channel Gateway forwards the live audio packets to the Runtime Engine.
4. **WebSocket Streaming to Azure OpenAI**\
   Runtime streams input audio directly to the **GPT-Real‑Time** model.
5. **Immediate Real-Time Response**\
   The model streams output (text + optional synthesized audio hints) back over WebSocket.
6. **Final Audio Playback**\
   Runtime sends the Final text to 11 Labs for TTS, and the audio is played back to the caller.

#### **3.2 Key Characteristics**

* Input audio is streamed end-to-end.
* Media Server and Channel Gateway **never see the transcribed text**—they only handle audio bytes.
* Only Runtime receives text, because the model outputs it.

#### Demo on **GPT-Real‑Time (Fully Streaming)** Audio:

{% file src="/files/PHKmJZj1H9ogTiqkwARf" %}

***

### **4. Summary of the Difference**&#x20;

#### **GPT‑4o Mini (Non‑Streaming Input)**

Has to wait for the caller to finish speaking → then transcribe → then process → then respond.\
➡️ Takes around **2–3 seconds** after user stops talking.

#### **GPT-Real‑Time (Streaming Input)**

Starts processing audio *while* the caller is still speaking.\
➡️ Starts giving a response in about **1 second** after the user stops talking.

#### **Overall:**

**Real‑time is roughly 1–2 seconds faster**, more consistent, and doesn’t slow down even with long user inputs.

| Aspect                   | GPT‑4o Mini              | GPT Real‑Time            |
| ------------------------ | ------------------------ | ------------------------ |
| Input Audio Streaming    | ❌ No                     | ✔️ Yes                   |
| Transcription            | Required before LLM call | Optional but recommended |
| Output Streaming         | ✔️ SSE                   | ✔️ WebSocket             |
| Latency                  | Higher                   | Lowest                   |
| Real-Time Responsiveness | Limited                  | Near-instant             |

***

### **5. Conclusion**

The decision between **GPT‑4o Mini** and **GPT-Real‑Time** comes down to the required level of responsiveness:

* **GPT‑4o Mini** is best for non-streaming input where cost matters. It's designed for basic text replies, suitable for simple interactions. Performance varies; it can be accurate or prone to errors. It struggles with transcription, especially language-switching. Its behavior is unpredictable, which undermines production reliability. Although budget-friendly, quality sacrifices are evident. Ideal only when cost is more important than conversation quality.
* **GPT-Real-Time** is optimized for ultra-low latency, seamless conversational exchanges, and instantaneous system responses. It delivers superior accuracy, consistency, and reliability. It efficiently manages language switching and transcription and generates stable, high-quality responses for both brief and extended inputs. It's ideal for customer-facing and production environments where performance is critical. While it's more expensive, the enhancements in user experience and precision often justify the cost. Best suited for applications requiring exceptional performance, natural dialog, and dependability.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ixhello.com/ixhc2/real-time-streaming-support.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
