ollama:streaming mode - chunhualiao/public-docs GitHub Wiki
Streaming mode in the Ollama API context refers to how the generated text is delivered from the language model to your application. Let me explain the key aspects of this approach:
What is Streaming Mode?
When using streaming mode with the Ollama API ("stream": true
in your request), the API:
- Returns the generated text token-by-token as they're produced by the model
- Sends each token as a separate JSON object in a continuous HTTP response stream
- Allows your application to process and display tokens as they arrive, rather than waiting for the full response
How Streaming Works in the Script
In the Python script I provided, streaming mode works this way:
response = requests.post(url, headers=headers, json=payload, stream=True)
for line in response.iter_lines():
if line:
data = json.loads(line)
if "response" in data:
token_count += 1
if data.get("done", False):
total_time = time.time() - start_time
results.append((token_count, total_time))
break
Each line from response.iter_lines()
contains a JSON object that looks something like:
{"model":"llama2","created_at":"2025-05-07T12:34:56.789Z","response":" the","done":false}
The script:
- Processes each JSON object as it arrives
- Counts each object containing a "response" field as one token
- Stops when it receives an object with
"done": true
Advantages of Streaming
- Real-time experience: Users see text appear progressively, similar to watching someone type
- Accurate token counting: Each chunk from the stream represents a token, allowing precise measurement
- Faster perceived response time: Users see initial output immediately instead of waiting for the complete response
- Ability to interrupt generation: Since you're processing the response incrementally, you can stop generation early
- More accurate timing measurements: You can measure exactly when each token arrives
Non-Streaming Alternative
In non-streaming mode ("stream": false
), the API:
- Waits until the model completes the entire response
- Returns the full text all at once in a single JSON response
- Provides less visibility into the token-by-token generation process
This makes token counting and timing less precise in the script, which is why it uses an approximation based on word count.
Streaming mode is generally preferred for performance measurement as it gives you direct access to the token stream, allowing for more accurate metrics about real-time model performance.
Based on Ollama's API documentation, streaming mode is enabled by default when using the API (set to true
). Here's why streaming is the default choice:
In the Ollama API, streaming mode is enabled by default (set to true
) for most response-generating endpoints, meaning responses are delivered token-by-token in real-time as they're generated. This is different from some client libraries that might default to non-streaming mode.
There are several key reasons why streaming is the default behavior:
Why Streaming is Default in Ollama API
-
User Experience: Streaming provides a more responsive experience by showing text as it's generated rather than making users wait for the complete response. This creates the feeling of a more natural conversation.
-
Lower Perceived Latency: Users see immediate feedback instead of waiting for the entire response to be generated, which significantly improves the perceived responsiveness of the application.
-
Real-time Interactivity: Developers can build interfaces that display tokens as they arrive, similar to how popular commercial AI interfaces work, creating a more engaging user experience.
-
Resource Management: Streaming allows the client application to process and display content incrementally, which can be more efficient for memory management, especially with very long responses.
-
API Design Philosophy: Ollama's API documentation explicitly states that "Certain endpoints stream responses as JSON objects" and that "Streaming can be disabled by providing {"stream": false}", showing this is their intended default behavior.
When you want to disable streaming in the Ollama API, you need to explicitly set "stream": false
in your request payload. This is important for cases where you need the complete response at once, such as when processing structured data like JSON responses or when measuring overall response time.
In your token-per-second measurement script, supporting both streaming and non-streaming modes allows you to accurately measure different aspects of model performance depending on your specific needs.