This document explains how to use the streaming functionality in the xAI Grok API.
Streaming allows you to receive responses from the API as they are generated, rather than waiting for the entire response to be completed. This is particularly useful for chat applications where you want to show the response to the user as it's being generated.
Streaming is implemented using Server-Sent Events (SSE), which allows the server to push content to clients as it becomes available.
Currently, streaming is supported for the following endpoints:
/api/v1/chat/completions- For chat completions
Streaming is not supported for:
/api/v1/images/generate- Image generation inherently produces a complete result/api/v1/vision/analyze- Vision requests currently don't support streaming
While the xAI documentation suggests streaming is supported for "Chat, Image Understanding, etc.", our current implementation does not support streaming for vision requests. When a vision request is made with stream=true:
- The API will automatically convert it to a non-streaming request
- A header
X-Stream-Fallbackwill be added to the response to indicate this conversion - The response will be returned as a complete (non-streaming) response
This affects both the dedicated vision endpoint (/api/v1/vision/analyze) and vision requests made through the chat completions endpoint (with image content).
The simplest way to use streaming is with the OpenAI SDK, which will handle the SSE protocol for you:
import os
from openai import OpenAI
# Set up client
client = OpenAI(
api_key="your_api_key",
base_url="https://your-api-domain.com/api/v1" # Point to your API server
)
# Create a streaming completion
stream = client.chat.completions.create(
model="grok-4-1-fast-non-reasoning",
messages=[
{"role": "user", "content": "Write a short story about a space adventure"}
],
stream=True # Enable streaming
)
# Print each chunk as it arrives
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)You can also make direct HTTP requests to use streaming:
import requests
import json
API_KEY = "your_api_key"
API_URL = "https://your-api-domain.com/api/v1/chat/completions"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
"Accept": "text/event-stream"
}
data = {
"model": "grok-4-1-fast-non-reasoning",
"messages": [
{"role": "user", "content": "Write a short story about a space adventure"}
],
"stream": True
}
# Make the request with streaming enabled
response = requests.post(API_URL, headers=headers, json=data, stream=True)
# Process the streaming response
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
# SSE format is "data: {...}"
if line.startswith('data: '):
if line.strip() == 'data: [DONE]':
break
json_str = line[6:] # Remove 'data: ' prefix
try:
chunk = json.loads(json_str)
content = chunk['choices'][0]['delta'].get('content', '')
if content:
print(content, end='', flush=True)
except json.JSONDecodeError:
continueYou can test streaming with curl:
curl -X POST https://your-api-domain.com/api/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_api_key" \
-d '{
"model": "grok-4-1-fast-non-reasoning",
"messages": [
{"role": "user", "content": "Write a short story about a space adventure"}
],
"stream": true
}'The streaming response follows the SSE format:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1692469985,"model":"grok-4-1-fast-non-reasoning","choices":[{"delta":{"role":"assistant","content":"Once"},"index":0,"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1692469985,"model":"grok-4-1-fast-non-reasoning","choices":[{"delta":{"content":" upon"},"index":0,"finish_reason":null}]}
data: [DONE]
Each line starts with data: followed by a JSON object or [DONE] to indicate the end of the stream.
When using streaming, keep in mind that:
- Each streamed response counts as one request against your rate limit
- Streaming requests may consume more server resources for longer periods
- If the connection is interrupted, you'll need to start a new request
If an error occurs during streaming, the API will return an error message in the SSE format:
data: {"error":"API error: Invalid authentication token"}
You should handle these errors in your client application.