Set stream=True
when calling the chat completions or completions endpoints to stream completions. This returns an object that streams back the response as data-only server-sent events.
Streaming completion is an essential functionality offered by OpenAI, particularly useful in the implementation of real-time applications like live text generation or interactive conversational bots.
Traditionally, this feature has been more straightforward in JavaScript, but with Python’s growing popularity in AI and data science, there has been an increasing demand for implementing OpenAI’s streaming functionality in Python.
What is Streaming Completion?
In the context of OpenAI, streaming completions refer to the ability to receive a stream of tokens generated by OpenAI’s models, such as GPT-4, as they are produced, rather than waiting for the entire response to be generated before it is received. This real-time token generation provides an interactive user experience and has many potential applications in AI-driven solutions.
OpenAI typically generates the full text before sending it back to you in a single response. This process can take some time, especially if the text is long.
Streaming completions allow you to get responses faster. You can start to see or use the initial part of the generated text even before the entire text is finished. To do this, you just need to set stream=True
when you’re requesting completions. You’ll then receive an object that sends back the response in small parts as it’s being generated.
But, remember, there are a couple of challenges with this method. First, it can be harder to check and control the content of the completions, as you’re dealing with incomplete text. Secondly, you won’t get information on how many tokens were used in the response. However, you can calculate this on your own using a tool like tiktoken
once you’ve received the full text.
Simple Example
Consider the following example that shows how streaming Chat completion can be implemented using a generator: π
# Simple Streaming ChatCompletion Request response = openai.ChatCompletion.create( model='gpt-3.5-turbo', messages=[ {'role': 'user', 'content': "What's 1+1? Answer in one word."} ], temperature=0, stream=True ) for chunk in response: print(chunk)
Now it actually returns a generator so it dynamically creates the output. Basically, it “streams” the result back to you without waiting for the whole batch.
π Recommended: Python Generator Expressions
If you create applications against the OpenAI API, you can create a more interactive seamless interaction with the user.
Example output from here:
{
"choices": [
{
"delta": {
"role": "assistant"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {
"content": "\n\n"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {
"content": "2"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {},
"finish_reason": "stop",
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
Python Implementation
Implementing streaming completions in Python with OpenAI involves using the stream=True
parameter in the openai.Completion.create
function.
Here’s an illustrative example:
for resp in openai.Completion.create(model='code-davinci-002', prompt='def hello():', max_tokens=512, stream=True): sys.stdout.write(resp.choices[0].text) sys.stdout.flush()
In this example, initiating the streaming mode happens with stream=True
, and each token generated by the model is outputted as it’s received.
Handling Token Cost
When working with OpenAI’s API, understanding and managing token cost is critical. Typically, the token cost is sent via a server-sent event.
There’s a potential solution to estimate token usage using OpenAIβs tokenizer, tiktoken
. However, this implementation only allows for estimating the prompt tokens and does not account for the completion tokens.

While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.
To help students reach higher levels of Python success, he founded the programming education website Finxter.com that has taught exponential skills to millions of coders worldwide. He’s the author of the best-selling programming books Python One-Liners (NoStarch 2020), The Art of Clean Code (NoStarch 2022), and The Book of Dash (NoStarch 2022). Chris also coauthored the Coffee Break Python series of self-published books. He’s a computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.
His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.