ollama:streaming mode - chunhualiao/public-docs GitHub Wiki

ollama

Streaming mode in the Ollama API context refers to how the generated text is delivered from the language model to your application. Let me explain the key aspects of this approach:

What is Streaming Mode?

When using streaming mode with the Ollama API ("stream": true in your request), the API:

Returns the generated text token-by-token as they're produced by the model
Sends each token as a separate JSON object in a continuous HTTP response stream
Allows your application to process and display tokens as they arrive, rather than waiting for the full response

How Streaming Works in the Script

In the Python script I provided, streaming mode works this way:

response = requests.post(url, headers=headers, json=payload, stream=True)
            
for line in response.iter_lines():
    if line:
        data = json.loads(line)
        if "response" in data:
            token_count += 1
        if data.get("done", False):
            total_time = time.time() - start_time
            results.append((token_count, total_time))
            break

Each line from response.iter_lines() contains a JSON object that looks something like:

{"model":"llama2","created_at":"2025-05-07T12:34:56.789Z","response":" the","done":false}

The script:

Processes each JSON object as it arrives
Counts each object containing a "response" field as one token
Stops when it receives an object with "done": true

Advantages of Streaming

Real-time experience: Users see text appear progressively, similar to watching someone type
Accurate token counting: Each chunk from the stream represents a token, allowing precise measurement
Faster perceived response time: Users see initial output immediately instead of waiting for the complete response
Ability to interrupt generation: Since you're processing the response incrementally, you can stop generation early
More accurate timing measurements: You can measure exactly when each token arrives

Non-Streaming Alternative

In non-streaming mode ("stream": false), the API:

Waits until the model completes the entire response
Returns the full text all at once in a single JSON response
Provides less visibility into the token-by-token generation process

This makes token counting and timing less precise in the script, which is why it uses an approximation based on word count.

Streaming mode is generally preferred for performance measurement as it gives you direct access to the token stream, allowing for more accurate metrics about real-time model performance.

Based on Ollama's API documentation, streaming mode is enabled by default when using the API (set to true). Here's why streaming is the default choice:

In the Ollama API, streaming mode is enabled by default (set to true) for most response-generating endpoints, meaning responses are delivered token-by-token in real-time as they're generated. This is different from some client libraries that might default to non-streaming mode.

There are several key reasons why streaming is the default behavior:

Why Streaming is Default in Ollama API

User Experience: Streaming provides a more responsive experience by showing text as it's generated rather than making users wait for the complete response. This creates the feeling of a more natural conversation.
Lower Perceived Latency: Users see immediate feedback instead of waiting for the entire response to be generated, which significantly improves the perceived responsiveness of the application.
Real-time Interactivity: Developers can build interfaces that display tokens as they arrive, similar to how popular commercial AI interfaces work, creating a more engaging user experience.
Resource Management: Streaming allows the client application to process and display content incrementally, which can be more efficient for memory management, especially with very long responses.
API Design Philosophy: Ollama's API documentation explicitly states that "Certain endpoints stream responses as JSON objects" and that "Streaming can be disabled by providing {"stream": false}", showing this is their intended default behavior.

When you want to disable streaming in the Ollama API, you need to explicitly set "stream": false in your request payload. This is important for cases where you need the complete response at once, such as when processing structured data like JSON responses or when measuring overall response time.

In your token-per-second measurement script, supporting both streaming and non-streaming modes allows you to accurately measure different aspects of model performance depending on your specific needs.