large language model:streaming - chunhualiao/public-docs GitHub Wiki
In the context of large language model (LLM) APIs (OpenAI, Anthropic, Groq, Gemini, Mistral, xAI’s API, etc.), the "stream" option means the model sends the response back piece by piece, as it is being generated, instead of waiting until the entire output is complete.
Two ways APIs can return text:
| Mode | Non-streaming (stream=false or omitted) | Streaming (stream=true) |
|---|---|---|
| How it works | The server generates the full answer internally, then sends it all at once when it’s finished. | The server starts sending tiny chunks (often individual tokens or a few words) immediately as they’re generated. |
| What the client receives | One single big response after seconds (or minutes for long outputs). | A continuous flow of small chunks (usually via Server-Sent Events / SSE or similar). |
| User experience in apps | You see a spinning loader → then the whole answer appears at once. | Text appears progressively, word-by-word or sentence-by-sentence, like ChatGPT does in the browser. |
| Latency perception | Feels slower because nothing shows up until the end. | Feels much faster because the first token arrives almost instantly (time-to-first-token is low). |
| Typical use case | Simple scripts, when you just need the final text. | Chat interfaces, real-time typing effect, anything interactive. |
Example with OpenAI’s API (the most common one)
Non-streaming
response = openai.chat.completions.create(
model="gpt-4o",
messages=[...],
stream=False # default
)
print(response.choices[0].message.content) # whole answer at once
Streaming
response = openai.chat.completions.create(
model="gpt-4o",
messages=[...],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# → text prints out in real time as the model generates it
Almost every major provider now supports this pattern (sometimes with slight differences in the payload format):
- OpenAI → stream=true
- Anthropic → stream=True
- Groq → stream=True
- Google Gemini → stream_generate_content
- Mistral, Cohere, Together.ai, Fireworks, etc. → all have a stream parameter
Why streaming is so popular now
- Dramatically better perceived speed (users see something within ~200–500 ms instead of waiting 5–30 seconds).
- Allows “typing indicators” and progressive rendering in UIs.
- Reduces timeout issues on very long responses.
So in short: “stream” = get the answer token-by-token in real time instead of all at the end. That’s why basically every chat app you use today feels like the AI is typing live — they all use streaming under the hood.