large language model:streaming - chunhualiao/public-docs GitHub Wiki

In the context of large language model (LLM) APIs (OpenAI, Anthropic, Groq, Gemini, Mistral, xAI’s API, etc.), the "stream" option means the model sends the response back piece by piece, as it is being generated, instead of waiting until the entire output is complete.

Two ways APIs can return text:

Mode Non-streaming (stream=false or omitted) Streaming (stream=true)
How it works The server generates the full answer internally, then sends it all at once when it’s finished. The server starts sending tiny chunks (often individual tokens or a few words) immediately as they’re generated.
What the client receives One single big response after seconds (or minutes for long outputs). A continuous flow of small chunks (usually via Server-Sent Events / SSE or similar).
User experience in apps You see a spinning loader → then the whole answer appears at once. Text appears progressively, word-by-word or sentence-by-sentence, like ChatGPT does in the browser.
Latency perception Feels slower because nothing shows up until the end. Feels much faster because the first token arrives almost instantly (time-to-first-token is low).
Typical use case Simple scripts, when you just need the final text. Chat interfaces, real-time typing effect, anything interactive.

Example with OpenAI’s API (the most common one)

Non-streaming

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=False   # default
)
print(response.choices[0].message.content)   # whole answer at once

Streaming

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
# → text prints out in real time as the model generates it

Almost every major provider now supports this pattern (sometimes with slight differences in the payload format):

  • OpenAI → stream=true
  • Anthropic → stream=True
  • Groq → stream=True
  • Google Gemini → stream_generate_content
  • Mistral, Cohere, Together.ai, Fireworks, etc. → all have a stream parameter

Why streaming is so popular now

  • Dramatically better perceived speed (users see something within ~200–500 ms instead of waiting 5–30 seconds).
  • Allows “typing indicators” and progressive rendering in UIs.
  • Reduces timeout issues on very long responses.

So in short: “stream” = get the answer token-by-token in real time instead of all at the end. That’s why basically every chat app you use today feels like the AI is typing live — they all use streaming under the hood.