large language model:streaming - chunhualiao/public-docs GitHub Wiki

In the context of large language model (LLM) APIs (OpenAI, Anthropic, Groq, Gemini, Mistral, xAI’s API, etc.), the "stream" option means the model sends the response back piece by piece, as it is being generated, instead of waiting until the entire output is complete.

Two ways APIs can return text:

Mode	Non-streaming (stream=false or omitted)	Streaming (stream=true)
How it works	The server generates the full answer internally, then sends it all at once when it’s finished.	The server starts sending tiny chunks (often individual tokens or a few words) immediately as they’re generated.
What the client receives	One single big response after seconds (or minutes for long outputs).	A continuous flow of small chunks (usually via Server-Sent Events / SSE or similar).
User experience in apps	You see a spinning loader → then the whole answer appears at once.	Text appears progressively, word-by-word or sentence-by-sentence, like ChatGPT does in the browser.
Latency perception	Feels slower because nothing shows up until the end.	Feels much faster because the first token arrives almost instantly (time-to-first-token is low).
Typical use case	Simple scripts, when you just need the final text.	Chat interfaces, real-time typing effect, anything interactive.

Example with OpenAI’s API (the most common one)

Non-streaming

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=False   # default
)
print(response.choices[0].message.content)   # whole answer at once

Streaming

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
# → text prints out in real time as the model generates it

Almost every major provider now supports this pattern (sometimes with slight differences in the payload format):

OpenAI → stream=true
Anthropic → stream=True
Groq → stream=True
Google Gemini → stream_generate_content
Mistral, Cohere, Together.ai, Fireworks, etc. → all have a stream parameter

Why streaming is so popular now

Dramatically better perceived speed (users see something within ~200–500 ms instead of waiting 5–30 seconds).
Allows “typing indicators” and progressive rendering in UIs.
Reduces timeout issues on very long responses.

So in short: “stream” = get the answer token-by-token in real time instead of all at the end. That’s why basically every chat app you use today feels like the AI is typing live — they all use streaming under the hood.