Python OpenAI Streaming Completions - Be on the Right Side of Change

Set stream=True when calling the chat completions or completions endpoints to stream completions. This returns an object that streams back the response as data-only server-sent events.

Streaming completion is an essential functionality offered by OpenAI, particularly useful in the implementation of real-time applications like live text generation or interactive conversational bots.

Traditionally, this feature has been more straightforward in JavaScript, but with Python’s growing popularity in AI and data science, there has been an increasing demand for implementing OpenAI’s streaming functionality in Python.

What is Streaming Completion?

In the context of OpenAI, streaming completions refer to the ability to receive a stream of tokens generated by OpenAI’s models, such as GPT-4, as they are produced, rather than waiting for the entire response to be generated before it is received. This real-time token generation provides an interactive user experience and has many potential applications in AI-driven solutions.

OpenAI typically generates the full text before sending it back to you in a single response. This process can take some time, especially if the text is long.

Streaming completions allow you to get responses faster. You can start to see or use the initial part of the generated text even before the entire text is finished. To do this, you just need to set stream=True when you’re requesting completions. You’ll then receive an object that sends back the response in small parts as it’s being generated.

But, remember, there are a couple of challenges with this method. First, it can be harder to check and control the content of the completions, as you’re dealing with incomplete text. Secondly, you won’t get information on how many tokens were used in the response. However, you can calculate this on your own using a tool like tiktoken once you’ve received the full text.

Simple Example

Consider the following example that shows how streaming Chat completion can be implemented using a generator: 👇

# Simple Streaming ChatCompletion Request
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {'role': 'user', 'content': "What's 1+1? Answer in one word."}
    ],
    temperature=0,
    stream=True
)

for chunk in response:
    print(chunk)

Now it actually returns a generator so it dynamically creates the output. Basically, it “streams” the result back to you without waiting for the whole batch.

🔗 Recommended: Python Generator Expressions

If you create applications against the OpenAI API, you can create a more interactive seamless interaction with the user.

Example output from here:

{
  "choices": [
    {
      "delta": {
        "role": "assistant"
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1677825464,
  "id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {
        "content": "\n\n"
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1677825464,
  "id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {
        "content": "2"
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1677825464,
  "id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {},
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "created": 1677825464,
  "id": "chatcmpl-6ptKyqKOGXZT6iQnqiXAH8adNLUzD",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}

Python Implementation

Implementing streaming completions in Python with OpenAI involves using the stream=True parameter in the openai.Completion.create function.

Here’s an illustrative example:

for resp in openai.Completion.create(model='code-davinci-002', prompt='def hello():', max_tokens=512, stream=True):
    sys.stdout.write(resp.choices[0].text)
    sys.stdout.flush()

In this example, initiating the streaming mode happens with stream=True, and each token generated by the model is outputted as it’s received.

Handling Token Cost

When working with OpenAI’s API, understanding and managing token cost is critical. Typically, the token cost is sent via a server-sent event.

There’s a potential solution to estimate token usage using OpenAI’s tokenizer, tiktoken. However, this implementation only allows for estimating the prompt tokens and does not account for the completion tokens.