Dirk van Meerveld, Author at Be on the Right Side of Change

Building Complex Multi-Agent Teams and Setups with LangGraph

Dirk van Meerveld — Thu, 18 Apr 2024 20:28:42 +0000

Info: This course is a complete text tutorial. It’s based on our academy course. If you’re interested in video explainers, check out the course here.

Hi and welcome to this course on building complex multi-agent teams and setups using LangGraph, LangChain, and LangSmith. In this course we’ll start from the ground up using LangChain, and then build and build, adding more complexity and tools as we go along. We will learn how to build a graph with paths, conditional paths, teams, team managers, and more, all stringing our agents together in powerful ways.

In part 1, we’ll get started with the basics of LangChain, learning how to create prompt templates and Chains, working with the LangChain syntax to easily string together our LLM calls.
In the next part we’ll learn how to write tools so that we can make our future agents powerful by giving them functions they can call. We will use the newest LangChain syntax for this and create both an image generation tool and a weather tool.
Part 3 is where we will learn the basics of LangGraph, covering the underlying concepts and exactly how it works. We will learn by setting up our first agent and graph which can return a visual representation of the current weather in any city you name.
In part 4 we’ll look at how we can take this all to yet the next level, discussing how we can use all of this to create a whole team of agents working together for us. We’ll also write a tool that can output PDF files in preparation for our multi-agent setup.
Part 5 is where the rubber really hits the road and we will create a powerful multi-agent setup in LangGraph using a team, team manager, many agents, conditional paths, and more. We will create a team that can work together independently and create travel itineraries for us, providing them in PDF format with an inserted image and a full travel plan.
In the final part we’ll have a look at writing asynchronous tools for our agents and then create a web research and article writing graph that can visit many web pages at the same time and then write an article about our desired topic for us.

I hope you’re as excited as I am to get started. Let’s dive in!

LangChain, LangSmith and LangGraph

Hi and welcome to this course on LangGraph, LangChain, and LangSmith. My name is Dirk van Meerveld and I will be your host and guide as we go on this exploration together.

So what is up with all these Lang-words? Well, in short:

LangChain is a basic framework that will allow us to work with LLMs.
LangGraph will allow us to make more complex combinations using LangChain by introducing graph structures, where we can have multiple nodes or even teams of LLM agents working together.
LangSmith is a tool that helps us see exactly what is going on while we work with the above two, to help us debug and improve our code in a more convenient way.

LangChain

Let’s get started with LangChain first. Langchain is a framework designed to make it easier to build applications that use large language models (LLMs). Think of it as a set of tools that helps bridge the gap between LLMs and the applications you might want to build with them.

LangChain helps us:

Provide a unified interface: Any code you write can be used with different LLMs with little modification, and you can use the same code to write prompts or tools for different LLMs.
Prebuilt tools for common tasks: Langchain includes tools for common tasks you might want to do with LLMs, such as building chatbots, summarizing documents, or analyzing code. Besides just building our own tools and functions, we can also import community pre-built tools.
Memory and Context: Langchain makes it easy to incorporate memory and context into our LLM applications. This means our application can remember past interactions and use that information to inform future responses.

So let’s get started! First go ahead and create a new project folder and name it whatever you like, I’ll call mine FINX_LANGGRAPH:

 FINX_LANGGRAPH

Create a venv in the root project folder

We’ll be running this project inside a virtual environment. A virtual environment is a self-contained directory that will allow us to install specific versions of packages inside the virtual environment without affecting the global Python installation.

We will use this as I will be using specific versions for the libraries we install as we go along, and I want to make sure that you have the exact same experience as I do.

For example, when we use pydantic we’ll be using the older V1 for this project, as it plays nicely with LangChain. You’ll probably have V2 installed on your system-wide Python installation, and then your imports will be different from mine, causing confusion. We also don’t want to mess with your system-wide Python installation.

The virtual environment will make it easy for you to install my exact versions without worrying about affecting any of your other projects and is a good practice to follow in general.

To create a new virtual environment we’ll use a tool called pipenv. If you don’t have pipenv installed, you can install it using pip, which is Python’s package manager. Run the following command in your terminal:

pip install pipenv

Make sure the terminal is inside your root project folder, e.g. /c/Coding_Vault/Finx_Fine_Tuning, and then run the following command to create a new virtual environment:

pipenv shell

This will create a new virtual environment and also a Pipfile in your project directory. Any packages you install using pipenv install will be added to the Pipfile.

To generate a Pipfile.lock, which is used to produce deterministic builds, run:

pipenv lock

This will create a Pipfile.lock in your project directory, which contains the exact version of each dependency to ensure that future installs are able to replicate the same environment.

We don’t need to install a library first to create a Pipfile.lock. From now on when we install a library in this virtual environment with pipenv install library_name, they will be added to the Pipfile and Pipfile.lock, which are basically just text files keeping track of our exact project dependencies.

For reference, I’m using Python 3.10 for this project, but you should be fine with any recent version. Consider upgrading if you’re using an older version.

Basic project setup

Before we get started, we need to make sure we have our OpenAI API key ready to load in a convenient way, we cannot hardcode this one in our source code. Go to https://platform.openai.com/api-keys and copy your API key, or make a new one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env in the root folder of your project:

 FINX_LANGGRAPH
     .env             New file
     Pipfile
     Pipfile.lock

And paste your API key in the .env file like this, making sure not to use any spaces or quotes:

OPENAI_API_KEY=your_api_key_here

Then go ahead and save and close this file. If you are using Git, make sure to add this file to your .gitignore file so you don’t accidentally commit your API key to your repository. If you’re not using Git, just make sure you exclude the .env file if you share your code with anyone.

We’ll be using several API keys and settings across our project, adding more as we go, so let’s create a simple and reusable way to load them to stop us from writing the same code over and over again.

Run the following command in your terminal to add the python-decouple package inside your pipenv environment:

pipenv install python-decouple==3.7

We will use this package to read the .env file and get the API key from it. Now create a new file named setup_environment.py in the root folder of your project:

 FINX_LANGGRAPH
     .env
     Pipfile
     Pipfile.lock
     setup_environment.py  New file

Then inside this new setup_environment.py file, write the following code:

import os

from decouple import config


def set_environment_variables() -> None:
    os.environ["OPENAI_API_KEY"] = str(config("OPENAI_API_KEY"))

We import the os and config from the decouple package we just installed a minute ago. We then create a function we can import from our other code files.

The config("OPENAI_API_KEY") function reads the .env file and gets the value of the OPENAI_API_KEY variable we set in there, so make sure you have used the exact same name in there. The str() cast just makes sure it’s a string value. We then set this value to the OPENAI_API_KEY environment variable using os.environ.

This way we can just use LangChain freely without having to worry about our API key as both LangChain and OpenAI are set up to read our API keys from the environment variables automatically.

LangChain basics

Ok, time to get started with LangChain! Let’s cover the basics first so we understand the building blocks. We’ll start with some installs. Make sure you run all of these even if you have some of these libraries installed already as we’re not using the global Python installation but our virtual environment. Run the following command in your terminal:

pipenv install openai==1.14.2 langchain==0.1.13 langchain-openai==0.1.0

The openai library will work with the OpenAI API behind the scenes while we use langchain and the langchain-openai library has some functionality that overlaps both.

Now create a new file named langchain_basics.py in the root folder of your project:

 FINX_LANGGRAPH
     .env
     langchain_basics.py  New file
     Pipfile
     Pipfile.lock
     setup_environment.py

Inside this new langchain_basics.py file, let’s get started with the following imports:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from setup_environment import set_environment_variables

Before we explain the imports, I want to cover a potential problem you may have here. You may have the following problem where the imports are not recognized and have red squiggly lines under them even though you just installed these libraries:

So what is going on here? Well, the virtual environment we created comes with its own Python interpreter, and the Python interpreter in your code editor is probably set to the system-wide Python interpreter. This means that the code editor doesn’t know where to find the libraries we just installed in the virtual environment.

To fix this, press Ctrl+Shift+P in VS Code to open the command palette, then type Python: Select Interpreter and select the Python interpreter from the virtual environment you created. You can find the correct one easily by comparing your root project name with the interpreter name. My root folder is FINX_LANGGRAPH, so I can find mine in the list under this name:

When you click this the red squiggly lines should go away and you’re now using the correct Python interpreter.

With that out of the way, let’s look at the imports here:

StrOutputParser is a class that will help us parse the output from the LLMs into a string format. Normally when you get the return from ChatGPT, we have to index into the response.choices[0].message.content to get the response. Just think of this as a convenience class that will help us with this.
ChatPromptTemplate is a class that will help us create a template for our chat prompts. This will make it easier to create prompts for the LLMs.
ChatOpenAI is a class that will basically just allow us to create an instance of OpenAI and use it with LangChain.

The value here of these output parsers and prompt templates is that they are a unified interface that we can use in the same manner without changes even if we change the LLM we are using halfway through our project or in the future.

Prompt templates

We then import the set_environment_variables function from the setup_environment file we created earlier. Now let’s continue our code by creating a prompt template:

set_environment_variables()


french_german_prompt = ChatPromptTemplate.from_template(
    "Please tell me the french and german words for {word} with an example sentence for each."
)

First, we make sure to call our set_environment_variables function to set our API key. As a simple example prompt, I’ll create an example that asks for the French and German words for a given word, along with an example sentence for each. This is just a simple example to show the parts of LangChain before we get into more complex examples.

The {word} part is the template variable that we can replace with any word we want to ask about. We then create a ChatPromptTemplate instance using the from_template method and pass in our prompt string. The ChatPromptTemplate class will help us create prompts for the LLMs in a more convenient way and basically deals with formatting message history like this:

## Example of a ChatPromptTemplate
template = ChatPromptTemplate.from_messages([
            ("system", "You are a helpful AI bot. Your name is {name}."),
            ("human", "Hello, how are you doing?"),
            ("ai", "I'm doing well, thanks!"),
            ("human", "{user_input}"),
        ])

We need only a single message here though, which is why we use the from_template method. In this case, LangChain will assume this to be a human message so this will result in:

template = ChatPromptTemplate.from_messages([
            ("human", "Please tell me the french and german words for {word} with an example sentence for each.")
        ])

Creating a chain

Now that we have a prompt template to create our prompts, let’s continue:

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
output_parser = StrOutputParser()

french_german_chain = french_german_prompt | llm | output_parser

First, we define our LLM instance using the ChatOpenAI class and pass in the model we want to use. I’ll be using gpt-3.5-turbo-0125 as it is more than enough for the simple test we’re doing here. If at any part in the course you want to use GPT-4-turbo instead then feel free to do so.

We’ve already set the API key to the environment variable so we don’t need to worry about it. We then create an instance of the StrOutputParser class to parse the output from the LLMs into a string response as discussed earlier.

Now that we have three building blocks, it is time for one of LangChain’s important concepts, “chains”. We can simply use the | operator to chain these building blocks together. This operator is taken from the pipe operator in Unix, which is used to chain commands together.

In this case, we take the french_german_prompt as the entry point of our chain, and we pipe the resulting prompt into our llm, making an LLM call. We then pipe the output into our output_parser to get the string response. Notice how easy and readable the chain is. We use chains to build stuff with large language models, hence the name LangChain. This piping style of syntax above is often referred to as LCEL or LangChain Expression Language.

Running the chain

Now let’s actually try and run this chain. To do this we can simply use the invoke method on our chain:

result = french_german_chain.invoke({"word": "polar bear"})
print(result)

We can technically also just pass in the string "polar bear" as we only have a single variable, but it’s better practice to use a dictionary like this as you may have multiple variables in your prompt. So go ahead and run this Python file and you should get something like the following:

French: ours polaire
German: Eisbär

Example sentence in French: L'ours polaire est un animal emblématique de l'Arctique.
Example sentence in German: Der Eisbär ist das größte an Land lebende Raubtier der Welt.

The order or structure may be slightly different as we didn’t specify any specific desired output structure, but that’s not the point here, it works! You’ll notice LangChain is very easy to read and understand, and this exact same code can be used with other LLMs with little modification.

We can also very easily stream the response instead. Edit your code like this, commenting out the previous invoke call and calling stream instead:

# result = french_german_chain.invoke({"word": "polar bear"})
# print(result)

for chunk in french_german_chain.stream({"word": "polar bear"}):
    print(chunk, end="", flush=True)

So for every chunk in the stream that results from calling french_german_chain.stream with the word “polar bear”, we print the chunk to the console. The end="" and flush=True are just to make sure there are no line breaks in between print messages and that the output is printed immediately to the console.

Now if you run it again, you’ll see the tokens being streamed and written to your console in real time.

Another useful method provided for us is batch, so let’s give that a spin as well:

# for chunk in french_german_chain.stream({"word": "polar bear"}):
#     print(chunk, end="", flush=True)

print(
    french_german_chain.batch(
        [{"word": "computer"}, {"word": "elephant"}, {"word": "carrot"}]
    )
)

This time we pass in a list of dictionaries with one entry for each run in the batch. Running this will give the responses in a list, one for each entry in the batch:

["French: \nComputer - Ordinateur \nExample sentence: J'utilise mon ordinateur pour travailler et regarder des films.\n\nGerman:\nComputer - Computer \nExample sentence: Mein Computer ist schon ein paar Jahre alt, aber er funktioniert immer noch einwandfrei.", "French: éléphant\nExample sentence: J'ai vu un éléphant au zoo.\n\nGerman: Elefant\nExample sentence: Der Elefant im Zoo war sehr groß.", "French: carotte\nExample sentence: J'ai acheté des carottes pour faire une soupe.\n\nGerman: Karotte\nExample sentence: Ich esse gerne Karotten als Snack."]

Now go ahead and comment that one out as well and let’s check the properties of our chain:

# print(
#     french_german_chain.batch(
#         [{"word": "computer"}, {"word": "elephant"}, {"word": "carrot"}]
#     )
# )

print("input_schema", french_german_chain.input_schema.schema())
print("output_schema", french_german_chain.output_schema.schema())

And if we run that we get a JSON schema that shows the in and outputs of our chain:

input_schema {'title': 'PromptInput', 'type': 'object', 'properties': {'word': {'title': 'Word', 'type': 'string'}}}
output_schema {'title': 'StrOutputParserOutput', 'type': 'string'}

We can see that the input takes a single object variable that needs to have a key word with a string value. If we add more variables to our prompt, we’ll see them in the schema as well. The output schema is a simple string because we used the StrOutputParser to parse the output into a string in the end.

Adding complexity

That is the basics of an extremely simple chain in LangChain. So let’s make it a bit more complex here. In this same file let’s declare a second chain and let’s say for the sake of a simple demonstration that this second chain is supposed to check if the output of the first chain is correct or not. (We’re just using simple examples here to save time and get to the good stuff faster).

So down below the other stuff in the langchain_basics.py file, let’s define the prompt template for our second chain:

# print("input_schema", french_german_chain.input_schema.schema())
# print("output_schema", french_german_chain.output_schema.schema())


check_if_correct_prompt = ChatPromptTemplate.from_template(
    """
    You are a helpful assistant that looks at a question and its given answer. You will find out what is wrong with the answer and improve it. You will return the improved version of the answer.
    Question:\n{question}\nAnswer Given:\n{initial_answer}\nReview the answer and give me an improved version instead.
    Improved answer:
    """
)

This time we have two variables in our prompt, question and initial_answer. We ask it to give an improved version of the first answer. The first answer is likely to be perfect already but again this is just for the sake of a quick demonstration.

We can reuse the llm and output_parser instances we created earlier, so let’s just create a new chain with the new prompt:

check_answer_chain = check_if_correct_prompt | llm | output_parser

Now we will need to run the input through the first chain, and then we need to keep both the original prompt from the first chain and the answer we get back from the first chain to pass them into the second one. So let’s do that:

def run_chain(word: str) -> str:
    initial_answer = french_german_chain.invoke({"word": word})
    print("initial answer:", initial_answer, end="\n\n")
    answer = check_answer_chain.invoke(
        {
            "question": f"Please tell me the french and german words for {word} with an example sentence for each.",
            "initial_answer": initial_answer,
        }
    )
    print("improved answer:", answer)
    return answer

So we define a function run_chain that takes a word as string input and will return a string. The initial answer is our return after we invoke the french_german_chain with the word.

We then print this answer and pass it into the check_answer_chain along with the original prompt, by passing both through a dictionary with the appropriate keys matching our prompt template. We print the improved answer and return it.

Now let’s run this function with a word:

run_chain("strawberries")

I apologize if I suddenly gave you a craving for strawberries! Run it and your output will be something like this:

initial answer: French: fraises
Example sentence: J'adore manger des fraises en été.

German: Erdbeeren
Example sentence: Im Sommer esse ich gerne Erdbeeren mit Sahne.

improved answer: French: fraises
Example sentence: J'adore manger des fraises en été.

German: Erdbeeren
Example sentence: Im Sommer esse ich gerne Erdbeeren.

Now of course both of them are fine and there wasn’t really anything to improve as the question is very simple, but we successfully ran a chain through another chain.

So that works fine, but you can see passing the values around to the second chain is a bit cumbersome. Now imagine we want to add a 3rd step to the chains above or even a 4th one. A conditional split path perhaps? If x then call chain a and else call chain b.

Using the above method would be a bit of a mess, so we’d have to create some kind of state object instead that has all the data in a single object so that we can pass this around between chains, with each chain adding or modifying the state object as needed.

This is actually a pretty good solution to the problem and as it happens, this is pretty much what LangGraph will do for us. Before we get there though, we need to take a short detour to LangSmith and also learn how to write our own tools in LangChain so we can use the power of function calling and agents to fully leverage the power of LangGraph and create some really cool stuff. That’s it for part 1 of this course, I hope you enjoyed it and I’ll see you in the next one!

LangSmith and Writing Tools

Hi and welcome back to part 2 of the tutorial series where we will be having a look at LangSmith which will help us debug our LLM creations and also write tools that our powerful agents will be able to execute from part 3 onwards.

LangSmith setup

So what is LangSmith? LangSmith is another part of the LangChain ecosystem that will help us during the development and debugging of our LLM applications

LLM Debugging and Testing: It will make it easier to identify and fix errors and test our applications to ensure they work as expected.
Monitoring and Evaluation: It also provides tools to monitor performance and effectiveness, especially helpful if your project needs fast response times.
Easy integration: LangSmith integrates seamlessly with LangChain and is very easy to set up as you will see.

First we’ll need to get an API key for LangSmith, so it can keep track of our traces for us using our unique identifier. This is free for single-user accounts with up to 3000 traces per month, which is more than enough for general development and testing. You shouldn’t have to provide any payment details unless you want to switch to a heavier plan later on.

Go to https://smith.langchain.com/ and sign up using your GitHub, Google, or email address:

After you have made your account and logged in at smith.langchain.com find the gear icon in the bottom left corner and click it, then find the Create Api Key button to generate your API key:

Copy your API key and then let’s open our existing .env file in the root of our project and edit it by adding the LangSmith API key (no spaces or quotation marks):

OPENAI_API_KEY=your_api_key_here
LANGCHAIN_API_KEY=your_api_key_here

Save and close your .env file. We don’t need to install LangSmith as it is already included in the LangChain package. Let’s move on to our existing setup_environment.py file to add the LangSmith setup to our reusable setup script.

In order to enable LangSmith tracing, we need to do three things.

Provide the LangSmith API key
Set the tracing environment variable to true
Set the project name so we can distinguish between different projects in our LangSmith dashboard

Replace all the code so far in the setup_environment.py file with the following:

import os
from datetime import date

from decouple import config


def set_environment_variables(project_name: str = "") -> None:
    if not project_name:
        project_name = f"Test_{date.today()}"

    os.environ["OPENAI_API_KEY"] = str(config("OPENAI_API_KEY"))

    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_API_KEY"] = str(config("LANGCHAIN_API_KEY"))
    os.environ["LANGCHAIN_PROJECT"] = project_name

    print("API Keys loaded and tracing set with project name: ", project_name)

We added the date from datetime import so we can use the date as the project name. Then we added an argument project_name to the function so we can set a custom project name for the LangChain dashboard. If no project name is provided, it will default to Test_{date.today()} so we still have something to distinguish it by even if we forget to set the name.

The OPENAI_API_KEY environment variable was already there, but now we have added three more environment variables for LangSmith. LANGCHAIN_TRACING_V2 enables LangSmith tracing when set to true, and then we have the LANGCHAIN_API_KEY and LANGCHAIN_PROJECT environment variables which LangSmith will read to know who we are and group the traces per project in our dashboard.

Make sure you use the exact same names for the environment variables. Save and close the file. Now let’s see what LangSmith will do for us by giving it a test run. Open the langchain_basics.py file that we created in part 1 and change only the following line:

set_environment_variables()

to add a project name:

set_environment_variables("Simple LangChain test")

Now go ahead and run the langchain_basics.py file from part 1 again without changing anything about the code. LangSmith will now trace the execution of the code as we are using the updated set_environment_variables script.

After running the script, go to the LangSmith dashboard at https://smith.langchain.com/ and make sure you’re logged in. In your dashboard you will see the project name you set in the overview:

We can see that our Simple LangChain test project has been run a total of 2 times (1 run for each chain), with an error rate of 0%. We can see how many of the responses were streamed and how many tokens have been used in total for this project name.

Scrolling to the right reveals additional details:

We can see that our total cost for all runs on this project so far is $0.000237 and we have a latency of around 3 seconds per run. We also have the most recent run for reference. Go ahead and click the project for more details:

We have two entries, one for the french_german_chain and one for the check_answer_chain. When we use graphs later these will no longer be separate but combined into a single trace. Go ahead and click the lower one with and input of strawberries to see the details:

We can see the RunnableSequence which is the overall chain, and then the three sub-elements that we had in our chain, the ChatPromptTemplate, the LLM, and the StrOutputParser. On this page we see the input and output for the entire chain, and if you click on any of the steps like ChatOpenAI you will see the in- and output for that specific step:

Now our trace here is not that helpful as it is both very simple and broken up into two separate parts for each chain we ran, but this will be very helpful for easy feedback and debugging when we get to our graphs, which will combine complex systems into a single trace.

Tools – Image generator

Now let’s continue on and take a look at tools. If we want to have powerful multi AI-agent teams working away for us we need to be able to give them tools or functions to call. Naturally LangChain also comes with a handy integration for writing tools using a somewhat more pleasant syntax than the vanilla OpenAI tools.

We will be writing two tools, both of which we will use in our LangGraph graph in the next part. One of the tools will use Dall-e to generate an image (using our OpenAI key we already have) and download and save the image to disk. The other tool is going to get the current weather in a certain location. There are multiple ways in which tools can be defined in LangChain, but we will be using the latest convenient syntax here using the @tool decorator.

First let’s create a new folder called images and another one called tools in the root of our project, and then inside the tools folder create a new file named image.py:

 FINX_LANGGRAPH
     images          New empty folder
     tools           New folder
         image.py    New file
     .env
     langchain_basics.py
     Pipfile
     Pipfile.lock
     setup_environment.py

In the image.py file we will define our first tool and see how this works. Let’s get started with our imports:

import uuid
from pathlib import Path

import requests
from decouple import config
from langchain.tools import tool
from openai import OpenAI
from pydantic import BaseModel, Field

As we will also download the image, we import uuid to create a unique filename so we don’t get clashes. We will use pathlib to define the path where we will save the image and requests to send an HTTP request to download the generated image from the internet.

We also import config from decouple to read our .env file, tool from langchain.tools to define our tool, OpenAI from openai to make a request to Dall-e, and BaseModel and Field from pydantic to define the input of our tool.

requests is already installed as a dependency of LangChain itself, and we already installed openai. Let’s make sure we install pydanctic as well by running:

pipenv install pydantic==1.10.13

Make sure you use this version as it plays nicely with the current LangChain versions. If you install V2 instead you will have to use different imports from mine.

As this is the only place where we will use the vanilla OpenAI client, we’ll just declare it here instead of integrating it into the setup_environment.py script. Add the following:

IMAGE_DIRECTORY = Path(__file__).parent.parent / "images"
CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))

To get a path to the images folder in the root of our project we first use Path(__file__) to get the path to the current file, then parent to go up one level to the tools folder, and then another parent to go up to the root of our project. We then add /images to get the path to the images folder.

We also create a CLIENT object using the OpenAI class and our API key from the .env file.

Image downloader

Let’s first create a helper function that takes an image URL and downloads and saves that image to our /images folder. This is not our tool but just a quick helper we can call from inside our tool later on. continuing in image.py add the following:

def image_downloader(image_url: str | None) -> str:
    if image_url is None:
        return "No image URL returned from API."
    response = requests.get(image_url)
    if response.status_code != 200:
        return "Could not download image from URL."
    unique_id: uuid.UUID = uuid.uuid4()
    image_path = IMAGE_DIRECTORY / f"{unique_id}.png"
    with open(image_path, "wb") as file:
        file.write(response.content)
    return str(image_path)

We define a function image_downloader that takes an image URL as input and returns a string with the path to the downloaded image. If the image URL is None we return a message saying that no image URL was returned from the API. We then use requests.get to download the image from the URL and check if the status code is 200 which means the request was successful, again sending a message if it was not successful.

We then create a unique ID using by instantiating a new UUID class object using uuid.uuid4(). We then create a path to the image using the IMAGE_DIRECTORY we defined earlier and the unique ID with a .png extension. Finally, we open the file in write binary mode (wb) and write the content of the response to the file, returning the path to the image as a string.

The reason we do not raise an error but send a string if the download fails is that an error will blow up our LLM application, but if we return a string instead the LLM agent will see that something went wrong and it can try to fix it or try calling the function again.

Input interface

Before defining our tool itself, we’re going to define the exact input interface that our tool will accept. Behind the scenes LangChain will use this to generate the JSON schema that the OpenAI API requires for function and tool calling. Add the following:

class GenerateImageInput(BaseModel):
    image_description: str = Field(
        description="A detailed description of the desired image."
    )

We use pydantic to define a GenerateImageInput class which inherits from BaseModel This will allow us to clearly define the input arguments our tool will need in order to run, as the LLM will need this information when calling a tool or deciding whether to call a tool or not.

We define a single field image_description which is a string and we use Field to add a description to the field. So we want an input argument of image_description which is a string that describes the image we want to generate. If you need multiple arguments you can define these here as well in the same fashion. For our uses, this one argument will do here.

Tool definition

Now it’s time to write our actual tool using the @tool decorator. Add the following:

@tool("generate_image", args_schema=GenerateImageInput)
def generate_image(image_description: str) -> str:
    """Generate an image based on a detailed description."""
    response = CLIENT.images.generate(
        model="dall-e-3",
        prompt=image_description,
        size="1024x1024",
        quality="standard",  # standard or hd
        n=1,
    )
    image_url = response.data[0].url
    return image_downloader(image_url)

We start with the @tool decorator which takes the name of the tool as the first argument and the schema of the input arguments as the second argument, passing in our GenerateImageInput class we defined earlier.

After that, we declare the function itself, which takes a string as input with the image description and will return an image path in string format. Note that we included a docstring that describes what the tool does: """Generate an image based on a detailed description.""".

This docstring is required when defining tools using the @tool decorator and is the description that will be used for the OpenAI tool schema generated behind the scenes that helps the LLM agent choose which function(s) to call. For this reason you must make sure it is an adequate description of what the tool does and what it’s purpose is.

After that we simply make a vanilla Dall-e image generation API request using CLIENT.images.generate with the model set to dall-e-3, the prompt set to the image_description we received as input, the size set to 1024x1024, the quality set to standard, and the number of images to generate set to 1. You can of course call on any image generation API you want, but as we already have an OpenAI key set we will use Dall-e here to keep things simple.

We then extract the URL by accessing response.data[0].url and return the result of calling the image_downloader function we defined earlier with the image URL as input. As the image_downloader function will save the image to file and return a path to it in stringform that fulfills our promise of having the generate_image function return a string file path to the image requested.

Test run

Tools are just functions except we clearly defined the input arguments, name, and the purpose of the function using a docstring. Now let’s give our tool a test run by adding the following to the bottom of the file:

if __name__ == "__main__":
    print(generate_image.run("A picture of sharks eating pizza in space."))

If this file is the main file being run, the generate_image function will be called for a quick test. If we import the tool from elsewhere this code block will not be triggered. Note that we call the run method on a tool in order to run it, this is part of the defined interface for LangChain tools.

So go ahead and run this file and you should see an image appear in the images folder in the root of your project, indicating that it worked. Make sure you didn’t forget to create the empty images folder in the root of your project.

My image here is pretty epic, I must say :

It is interesting to see that Dall-e choose peperoni pizza as a default pizza. Sorry if I made you hungry yet again .

Weather tool

Ok with that settled, save and close up this file, and let’s move on to our second tool which will get the current weather in a certain location. We’ll go through this one quickly as the process is very similar to the first tool.

First, sign up for a free account at https://www.weatherapi.com/. They will give you pro for 14 days for free but it will automatically switch back to free afterward and you don’t have to provide any payment or credit card information, so don’t worry about it, the sign up will be pretty fast and totally free.

Signup and then get yourself an API key:

Now add your new API key to your .env file:

OPENAI_API_KEY=your_api_key_here
LANGCHAIN_API_KEY=your_api_key_here
WEATHER_API_KEY=your_api_key_here

Save and close that and now lets create a new file in the tools folder called weather.py:

 FINX_LANGGRAPH
     images
     tools
         image.py
         weather.py    New file
     .env
     langchain_basics.py
     Pipfile
     Pipfile.lock
     setup_environment.py

In the weather.py file we will define our second tool. Let’s get started with our imports:

from json import dumps

import requests
from decouple import config
from langchain.tools import tool
from pydantic import BaseModel, Field

We import dumps from json too which will allow us to convert a dictionary to string format, as LLMs can only handle strings. The rest of the imports are familiar from the generate_image tool we made. Let’s define the input interface for our weather tool using a pydantic model:

class WeatherInput(BaseModel):
    location: str = Field(description="Must be a valid location in city format.")

This is the same as the other tool, again make sure the description is a good one as the LLM agent will make use of this. Let’s define our function that will call the weather API and return the response. Add the following:

@tool("get_weather", args_schema=WeatherInput)
def get_weather(location: str) -> str:
    """Get the current weather for a specified location."""
    if not location:
        return (
            "Please provide a location and call the get_current_weather_function again."
        )
    API_params = {
        "key": config("WEATHER_API_KEY"),
        "q": location,
        "aqi": "no",
        "alerts": "no",
    }
    response: requests.models.Response = requests.get(
        "http://api.weatherapi.com/v1/current.json", params=API_params
    )
    str_response: str = dumps(response.json())
    return str_response

We start with the @tool decorator with the name of the tool and the input schema as before. We then define the function itself which takes a string as input with the location and will return a string with the weather data. We include a docstring that describes what the tool does and is for so the LLM agent can make use of this.

If the location is not provided we return a message asking the LLM to provide a location and call the function again. We then define the API parameters as a dictionary with the API key which we read from the .env file using config, the location (q), and two optional parameters aqi (air quality index) and alerts set to no.

We then make a request to the weather API using requests.get with the URL http://api.weatherapi.com/v1/current.json and the API parameters. This will return a Response object from requests.models which we can convert to a dictionary using it’s .json() method. We then convert the dictionary to a string using the dumps (dump string) function we imported and return the string with the weather data.

Let’s add a quick test just like with the other tool:

if __name__ == "__main__":
    print(get_weather.run("New York"))

Now go ahead and give it a test run and you should see something like the following:

{"location": {"name": "New York", "region": "New York", "country": "United States of America", "lat": 40.71, "lon": -74.01, "tz_id": "America/New_York", "localtime_epoch": 1711278898, "localtime": "2024-03-24 7:14"}, "current": {"last_updated_epoch": 1711278000, "last_updated": "2024-03-24 07:00", "temp_c": -0.6, "temp_f": 30.9, "is_day": 1, "condition": {"text": "Sunny", "icon": "//cdn.weatherapi.com/weather/64x64/day/113.png", "code": 1000}, "wind_mph": 2.2, "wind_kph": 3.6, "wind_degree": 2, "wind_dir": "N", "pressure_mb": 1020.0, "pressure_in": 30.13, "precip_mm": 0.0, "precip_in": 0.0, "humidity": 49, "cloud": 0, "feelslike_c": -5.9, "feelslike_f": 21.5, "vis_km": 16.0, "vis_miles": 9.0, "uv": 2.0, "gust_mph": 15.8, "gust_kph": 25.4}}

Excellent! We now have some functions for our agents to play around with while we explore building more complex systems using graphs.

Simplifying tool imports

There is one quick thing left to do before we move on to the next part. The way our tools folder is set up right now we would have to import the tools from the tools folder in a kind of awkward way:

# Example, no need to copy - we will not use this code
from tools import weather, image

weather.get_weather("Alabama")
image.generate_image(
    "A T-rex made from kentucky fried chicken is attacking the white house."
)

This weather.get_weather is kind of awkward so let’s create a __init__.py file in the tools folder to make it easier to import the tools. Create a new file called __init__.py in the tools folder:

 FINX_LANGGRAPH
     images
     tools
         __init__.py    New file
         image.py
         weather.py
     .env
     langchain_basics.py
     Pipfile
     Pipfile.lock
     setup_environment.py

In the __init__.py file add the following:

from .image import generate_image
from .weather import get_weather

This will import the generate_image and get_weather tools from their respective files and make them available when importing the tools folder. It has effectively made the tools folder a package that can be imported from as a single entity.

Now the above example can be changed to this:

# Example, no need to copy - we will not use this code
from tools import get_weather, generate_image

get_weather("Alabama")
generate_image("A T-rex made from kentucky fried chicken is attacking the white house.")

This is a lot more sensible. Save and close the __init__.py file and we are done with this part. In the next part, it is time to dive into LangGraph and start building some more complex systems using agents and tool calls to interlink them into a graph that can do some cool stuff. See you there!

P.S. I know you are secretly curious what the T-rex made from KFC attacking the white house looks like . Here is is:

Kentucky Fried T-rex, anyone?

LangGraph Introduction

Hello and welcome back to part 3 of this tutorial series. In this part, we’ll be getting started with LangGraph. Instead of having a lot of explanation before we start, we’ll see how stuff works as we go along. So without further ado, let’s just jump right in.

Let’s start by actually installing LangGraph, as it doesn’t get installed by default with LangChain. To install LangGraph, you can use the following command in your terminal:

pipenv install langgraph==0.0.30 langchainhub==0.1.15

Once you’ve installed LangGraph, let’s start by creating a new file called simple_langgraph.py:

 FINX_LANGGRAPH
     images
     tools
     .env
     langchain_basics.py
     Pipfile
     Pipfile.lock
     setup_environment.py
     simple_langgraph.py    New file

Over the next three parts, we’ll be looking at different ways in which you can use LangGraph to chain LLMs and tools together. In this first part we’ll be looking at a simple classic LLM –> goes to a tool executor –> and then back to LLM type setup.

Open up simple_langgraph.py and let’s start by importing the necessary modules:

import operator
from typing import Annotated, TypedDict, Union

from colorama import Fore, Style
from langchain import hub
from langchain.agents import create_openai_functions_agent
from langchain_core.agents import AgentAction, AgentActionMessageLog, AgentFinish
from langchain_core.messages import BaseMessage
from langchain_core.runnables.base import Runnable
from langchain_openai.chat_models import ChatOpenAI
from langgraph.graph import END, StateGraph
from langgraph.prebuilt.tool_executor import ToolExecutor

from setup_environment import set_environment_variables
from tools import generate_image, get_weather

That is a lot of stuff! Don’t worry, most of it is actually not as complex as it seems. Usually, I’ll go over all the imports before we get started, but as there are quite a few to go through, I’ll cover each import when we get to the part where it’s used instead. For now, just have them copied.

Next, we’ll set the environment variables and define a couple of constants:

set_environment_variables("LangGraph Basics")

LLM = ChatOpenAI(model="gpt-3.5-turbo-0125", streaming=True)
TOOLS = [get_weather, generate_image]
PROMPT = hub.pull("hwchase17/openai-functions-agent")

We reused our set_environment_variables function from the previous part to set the environment variables and set the name for the LangSmith traces to LangGraph Basics. We then define our LLM just like we did in part 1, also setting the streaming parameter to True. We then define a list of tools which is literally just a list containing the two tools that we wrote.

The LangChain Hub

For the prompt template, we pull it from the LangChain Hub this time, mostly because I want to show you that it exists! The LangChain Hub is kind of like a mini-GitHub for storing LangChain ChatPromptTemplates just like the simple ones we wrote in part 1. You can push new commits to your templates and pull them like we just did here, kind of like GitHub.

You can go to https://smith.langchain.com/ and scroll down to find the Hub button:

Click it to visually browse the prompts available on the hub:

You can use this as a convenient place to store your prompts. You can also set them to private if you don’t want to share them with the world and you can even fork other public prompts that you like to your own repositories. It’s a handy tool for development. For production or highly sensitive company data, you might want to store your prompts in a more secure location.

If we look up the prompt we just pulled, we can see that it is a fairly simple prompt:

It has an extremely basic system message of "You are a helpful assistant" and we can see that it has placeholders for chat_history, human input and an agent_scratchpad. The chat_history and input are kind of self-explanatory in that they hold the chat history so far and the human input, but what about this agent_scratchpad?

The agent_scratchpad is kind of like a place where the agent can take notes while going through its reasoning process of what action should be taken next and what functions should be called. Think of it as a notepad where the LLM can jot down its thoughts. Think of it kind of like the following:

user:
    "Can you recommend me a zombie game from the year 2022?"

    > Entering new AgentExecutor chain...
    Thought: Oh, I love zombie games! There are so many great ones out there. Let me think about the best zombie game from 2022.
    Action: use_search_engine
    Action Input: "best zombie game 2022"

    Observation:[{list of search result objects for query "best zombie game 2022"}]
    There are three great zombie games from 2022 that I found: Zombie Cure Lab, Zombie Survivors, and SurrounDead. Let me think about which one to recommend.
    Action: use_search_engine
    Action Input: "Zombie Cure Lab"

    Observation:[{list of search result objects for query "Zombie Cure Lab"}]
    Zombie Cure Lab is a game where you manage a lab and try to cure the zombie virus. (Bunch more info here yadayada...) I recommend Zombie Cure Lab as the best zombie game from 2022.

    Final Answer: The best zombie game from 2022 is Zombie Cure Lab.

This is just a conceptual example here to describe the idea, but the agent takes reasoning steps and makes observations along the way, first deciding to call a search engine tool to better answer the user question, then deciding to call the search engine tool to get more information on one of the games in particular, and then finally deciding that it has enough information to answer the user question.

So the agent_scratchpad is used to store these intermediate observations on what action to take next, but also to decide when the agent is done, so that it doesn’t just keep looping indefinitely. We’ll get back to how we can see when the agent is done in a moment.

The State Object

Ok, we have an LLM, some tools, and a prompt template. The next thing we need is a state object to keep track of the state for each step along our graph. So a LangGraph is kind of like a state machine, and it is going to take this state object and pass it along each node of the graph. Let’s look at a simplified example:

# Simplified example
StateObject():
    user_input = "please do a for me"
    chat_history = [list of previous chat messages for context...]
    am_i_done = False
    steps_taken = []

So say we have this state object above. We have received the user input question, and whatever chat history has come before if we have decided to implement memory. We have a flag am_i_done which is obviously set to False at the start, and we have a list of steps_taken which is empty at the start. Now we hand this state object to node A in our graph ->

# Simplified example Node A
StateObject():
    user_input = "please do a for me"
    chat_history = [list of previous chat messages for context...]
    am_i_done = False
    steps_taken = ["action_a was taken"]

It does some action we will just call action_a, which has taken it a step closer to answering the user question but it is not quite done yet so the am_i_done flag is still set to false. Now node A passes this state object to node B in our graph ->

# Simplified example Node B
StateObject():
    user_input = "please do a for me"
    chat_history = [list of previous chat messages for context...]
    am_i_done = True
    steps_taken = ["action_a was taken", "action_b was taken"]

This node does some action_b stuff and now has the final answer it needs to give to the user. It sets the am_i_done flag to True because it is done. We can use this am_i_done flag to test if the graph is completed yet (e.g. the user question or request has been fully answered).

So as the graph traverses over the nodes we define, each node will receive the state object, update it where needed, and then pass it along to the next node, or perhaps back to the previous one, or sideways to node D if a certain condition is met. So let’s define the real state object that we will be using:

class AgentState(TypedDict):
    input: str
    chat_history: list[BaseMessage]
    agent_outcome: Union[AgentAction, AgentFinish, None]
    intermediate_steps: Annotated[list[tuple[AgentAction, str]], operator.add]

We use a TypedDict to define a specific dictionary structure, defining the keys that this dictionary will have and the types of values that will be stored for each of those keys. The first entry is simply the user input, which is a str string value.

The second entry is the chat history, which is a list of BaseMessage objects. A BaseMessage object is just any one of the lines of this object below where you have a message and the originator of the message like “system”, “human”, or “ai”:

# Example BaseMessages
("system", "You are a helpful AI bot. Your name is {name}."),
("human", "Hello, how are you doing?"),
("ai", "I'm doing well, thanks!"),
("human", "{user_input}"),

The third item in the state object will be agent_outcome. The agent here will do its thing and then either return an AgentAction object or an AgentFinish object to us.

AgentAction: An AgentAction object simply contains the name of the tool the agent wants to call and the input arguments for that tool call, maybe like get_weather and {"location": "New York"}.
AgentFinish: An AgentFinish object simply means that the agent considers its task finished and holds the final return_values inside.

Using this agent_outcome object we can see what the next step is or if it is done.

The fourth and last entry in the AgentState object is a bit easier to read from the inside. We have a list of tuples where each tuple contains an AgentAction object and a str string. The AgentAction here is the same object that we described in the step above, containing a tool to be called and its input arguments. The difference here is that the step is already taken and the string which is the second item in the tuple is the tool output after it was called. So something like this:

## Fictional example object
[
    (
        AgentAction(tool="get_weather", input={"location": "New York"}),
        "{API response JSON object...}",
    ),
    (
        AgentAction(tool="generate_image", input={"image_description": "cat"}),
        "Path/to/image.png",
    ),
]

The Annotated type hint is used to add metadata to the type hint. In this case, we are using the operator.add function to tell the type checker that this list will be added to, so we are describing the AgentState object’s intermediate_steps list as a list that will be added to, like the example above.

The Agent

Now that we have our state object defined, we will define our agent that will have access to both the generate_image and get_weather tools:

runnable_agent: Runnable = create_openai_functions_agent(LLM, TOOLS, PROMPT)

We use the create_openai_functions_agent function we imported from LangChain to create an agent that has access to the LLM, the tools, and the prompt we defined so far. LangChain will make this into an OpenAI compatible agent by combining them for us into a Runnable type object. We have seen this Runnable object before in part 1 in the form of our chains. All Runnable type objects have the invoke, stream, and batch methods just like the chains we used in part 1.

Before we move on with the nodes and graph let’s test the agent we have so far. We’ll manually create a quick input here (as we haven’t built our graph yet) and then call invoke on the agent:

inputs = {
    "input": "give me the weather for New York please.",
    "chat_history": [],
    "intermediate_steps": [],
}

agent_outcome = runnable_agent.invoke(inputs)
print(agent_outcome)

Now go ahead and run this to test the agent so far and you should see something like this:

API Keys loaded and tracing set with project name:  LangGraph Basics
tool='get_weather' tool_input={'location': 'New York'} log="\nInvoking: `get_weather` with `{'location': 'New York'}`\n\n\n" message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{"location":"New York"}', 'name': 'get_weather'}}, response_metadata={'finish_reason': 'function_call'})]

We can see the agent wants to call the get_weather tool with the input {"location": "New York"}, so it’s asking us to call this function with these input arguments. Of course, it stopped running there as we haven’t linked up any other nodes yet, but we know that the agent is working so far.

Go ahead and remove the test inputs and agent_outcome code. Just for clarity, here is what you should have so far:

import operator
from typing import Annotated, TypedDict, Union

from colorama import Fore, Style
from langchain import hub
from langchain.agents import create_openai_functions_agent
from langchain_core.agents import AgentAction, AgentActionMessageLog, AgentFinish
from langchain_core.messages import BaseMessage
from langchain_core.runnables.base import Runnable
from langchain_openai.chat_models import ChatOpenAI
from langgraph.graph import END, StateGraph
from langgraph.prebuilt.tool_executor import ToolExecutor

from setup_environment import set_environment_variables
from tools import generate_image, get_weather


set_environment_variables("LangGraph Basics")

LLM = ChatOpenAI(model="gpt-3.5-turbo-0125", streaming=True)
TOOLS = [get_weather, generate_image]
PROMPT = hub.pull("hwchase17/openai-functions-agent")


class AgentState(TypedDict):
    input: str
    chat_history: list[BaseMessage]
    agent_outcome: Union[AgentAction, AgentFinish, None]
    intermediate_steps: Annotated[list[tuple[AgentAction, str]], operator.add]


runnable_agent: Runnable = create_openai_functions_agent(LLM, TOOLS, PROMPT)

The Nodes

So now the first thing we need to do is to create some nodes here so we can string them together into a graph. Let’s start with the Agent Node:

def agent_node(input: AgentState):
    agent_outcome: AgentActionMessageLog = runnable_agent.invoke(input)
    return {"agent_outcome": agent_outcome}

We define the node as a simple function that takes input which will be the AgentState object for all nodes. It then calls the invoke method on the agent with the input and catches the return in a variable named agent_outcome which is of type AgentActionMessageLog. This agent_outcome will have either the AgentAction object or the AgentFinish object that we talked about earlier, indicating what the next step is or if the agent is done. Whatever is in the agent_outcome, this function simply returns it in a dictionary.

Now that we have an agent node we need another node to execute the tools that the agent wants to call. Let’s define the Tool Executor Node:

tool_executor = ToolExecutor(TOOLS)

def tool_executor_node(input: AgentState):
    agent_action = input["agent_outcome"]
    output = tool_executor.invoke(agent_action)
    print(f"Executed {agent_action} with output: {output}")
    return {"intermediate_steps": [(agent_action, output)]}

First, we create a new instance of the ToolExecutor class that we imported from LangGraph. This ToolExecutor is initialized by giving it our list of tools which includes two tools in this case. The ToolExecutor provides a prebuilt interface that will extract the function and arguments the agent wants to call from the AgentAction object and then call the function with the arguments so we don’t have to do this manually.

Then we define the tool_executor_node function which again is just a simple function with input (which will be the state object). We extract the agent_action from the input dictionary and then call the invoke method on the tool_executor object which will run whatever tool the agent wants to call for us.

We have a print statement just for our own visual feedback here, and then we return the intermediate_steps list with the agent_action and the output of the tool call. Notice that this is the intermediate steps list that we defined in the AgentState object and talked about earlier and will be added to whatever steps were already there.

Now that we have these two functions for the nodes, we need a way to test if we want to finish the graph because the Agent Node has arrived at the final answer or if we need to continue on to the Executor node because it needs to execute a tool call. We can do this by defining a function that will check if the agent is done:

def continue_or_end_test(data: AgentState):
    if isinstance(data["agent_outcome"], AgentFinish):
        return "END"
    else:
        return "continue"

This function takes the AgentState object as input. Then it simply indexes into the agent_outcome. We said earlier that the agent_outcome will either be an AgentAction object (if still working) or an AgentFinish object if the agent is done. So if the agent_outcome is an instance of AgentFinish we return "END" to signal that the graph is done, otherwise, we return "continue" to signal that the graph should continue.

Creating our Graph

Now that we have two nodes and a test to see if we need to continue (this is just a very simple first example to explain the concepts), we can define our graph. The main type of graph in LangGraph is called a StatefulGraph, which passes a state object around as we discussed. Each node then returns some kind of update to that state, either setting specific attributes or adding to the existing attribute like the intermediate_steps list.

Setting up our graph is easy:

workflow = StateGraph(AgentState)

workflow.add_node("agent", agent_node)
workflow.add_node("tool_executor", tool_executor_node)

workflow.set_entry_point("agent")

First, we instantiate a new StateGraph passing in our AgentState object that we defined. We then simply add our two nodes, giving them a string name and passing in the functions we wrote second. Lastly, we set the entry point to the agent node, which is the first node that will be called when we start the graph.

Now we have a graph with an entry point. The next step is to define the connections called edges between the nodes. This is also very easy:

workflow.add_edge("tool_executor", "agent")

workflow.add_conditional_edges(
    "agent", continue_or_end_test, {"continue": "tool_executor", "END": END}
)

First, we add an edge from the tool_executor node back to the agent node. After we execute a tool call, we always want to feed the result back into the agent node.

Then we add a conditional edge from the agent node. We pass in our continue_or_end_test function that will determine where this edge will lead. If the function returns "continue" we will go to the tool_executor node, and if it returns "END" we will go to the END node. The END node is a special pre-built node that was part of our imports when we started this file.

Our simple graph in visual form now looks like this:

Now that we have our graph defined, we need to take the final step which is to compile the graph before we can use it:

weather_app = workflow.compile()

Testing our Graph

Now let’s whip up a quick function to test our graph:

def call_weather_app(query: str):
    inputs = {"input": query, "chat_history": []}
    output = weather_app.invoke(inputs)
    result = output.get("agent_outcome").return_values["output"]  # type: ignore
    steps = output.get("intermediate_steps")

    print(f"{Fore.BLUE}Result: {result}{Style.RESET_ALL}")
    print(f"{Fore.YELLOW}Steps: {steps}{Style.RESET_ALL}")

    return result

The function will take a string query. As input, we need to define the input key with the query and an empty chat_history list as we don’t have a previous history for now. We then call invoke on the weather_app graph object and catch the output in a variable named output. The agent_outcome will have an AgentFinish which has the return_values attribute that holds the final answer as we discussed.

# type: ignore is just for the type checker here as it doesn’t know that agent_outcome will always be an AgentFinish object and I don’t want to go too far into type hinting in this tutorial. If you don’t use type checking you won’t need the comment. We also extract the intermediate_steps list from the output into a variable named steps.

When we started the file we imported Fore and Style from the colorama library. This library has already been installed as a dependency of something else, so we didn’t have to install it. The Fore.BLUE sets the text foreground color to blue and the Style.RESET_ALL resets the color back to the default, repeating the pattern on the next line with yellow for easy readability.

Now we can test our graph by calling the function with a query:

call_weather_app("What is the weather in New York?")

Go ahead and run this and you should see the final answer in blue:

Result: The current weather in New York is sunny with a temperature of 35.1°F (1.7°C). The wind is coming from the north at 11.2 km/h. The humidity is at 52%, and
the visibility is 16.0 km.
Steps: All the steps here in yellow...

Good! That worked. The steps are a bit hard to read, but that is what we have LangSmith for. Head over to https://smith.langchain.com/ and check out your trace under the project name of LangGraph Basics. Take the one named LangGraph as the RunnableSequence one is from when we did the partial test before we built our graph:

We can see that the graph started with our agent, then went to the tool_executor, back to the agent, and then ended. Click on any of the steps to see more detail. Nice and readable right?

Something a bit cooler!

So let’s give our simple graph test here a bit of a bigger challenge! Comment out the old query and let’s ask something a bit harder:

# call_weather_app("What is the weather in New York?")

call_weather_app("Give me a visual image displaying the current weather in Seoul, South Korea.")

Let’s run this and see what we get (it should auto-save an image in the project’s images folder):

Result: Here is the visual image displaying the current weather in Seoul, South Korea:

![Seoul, South Korea Weather](c:\Coding_Vault\FINX_LANGGRAPH\images\152cf0e0-c50e-483b-be63-50ef40ea3255.png)

That’s pretty good! It has the temperature and the rain. I can confirm that it is currently dark and rainy over here and this also corresponds to the weather data the API sent back. Pretty dang cool right!?

If we look at the LangSmith trace we’ll see exactly what we expect:

The agent calls the weather function, it comes back to the agent which calls the image function, and then it ends by giving us the image. I’ll leave you to click on any of the steps if you want to see the in and outputs at each step.

Of course, we can put this information of wanting a visual image into the prompt so the user doesn’t have to type it and improve on this in many ways like directly displaying the image to the end user but that is not the point here, this is just a simple demonstration of how the edges and nodes come together to create a simple graph.

In the next part we’ll take this up a step. Where we basically have a single agent now, we’ll look at having a whole team of agents working together! I’ll see you in the next part!

P.S. I generated another one just for fun and it’s pretty good:

Multi-Agent LangGraph Teams Preparation

Hi and welcome back to part 4 of this tutorial series where we’ll once again be taking it up a step. We’ll basically compress the Agent and the Executor into a single node and then have multiple of these ‘agent and executor’ nodes inside of a team working together. First, we’ll cover the basic idea and do some short work to prepare the extra functions we will need, and then we’ll continue into the next part where we’ll put it all together into a multi-agent team that does the work for us while we sit back and relax!

Advantages of multi-agent teams

So why is this multi-agent thing useful in the first place? We can simply give one agent multiple tools right? Well, up to a point. If you give a single agent a prompt to first do thing A by calling function_a and then do thing B by calling function_b followed by either function_c or function_d depending on the output of function_b then the prompt of this agent is going to become a mess and it will also be fairly unreliable. The main advantages of multi-agent teams for more complex setups are:

Grouping responsibilities gives better results as agents will tend to perform better when they have a more focused task rather than a dozen tools and responsibilities to choose from.
Separate prompts will give better results as each prompt can have its own examples of exactly what we want it to do and how. We can even have a specific agent run on a fine-tuned version of ChatGPT that is specifically trained and optimized for that node’s task.
Easier development as you can work on, test, and evaluate each agent in insolation without it being connected to and breaking stuff elsewhere in the chain when you make improvements. It’s also easier to conceptually wrap your brain around the system as a whole.

There are many possible slight variations for how this could be implemented. You could have a shared scratchpad for example so that all of the agents can see what thought processes and work the other agents have done. The downside is that this is very verbose though and the amount of information exchanged may be pointlessly large.

Alternatively, you could have them be isolated as single LLM calls without a strong interconnection that basically operate independently but they are merely strung together in a chain. This may be a bit too isolated though.

The example we’ll be looking at here lies somewhere in the middle where we will have independent fully-fledged agents that have their own scratchpad and ability to call tools if needed but the result of each agent doing its independent work gets stored in a shared state object like we had in the previous part.

This will be supervised by a sort of ‘team supervisor’ node we’ll call an ‘agent supervisor’ that will use this overall state object with the work done so far to decide what happens next and who to call. The basic idea looks like this:

The user sends a query to the Team Supervisor. The Team Supervisor then has a team of agents and it decides who it should call on next to complete some work, it can choose any of the agents at any point. Every agent points back to the Team Supervisor so that the Team Supervisor gets to decide again after each step which agent is next or if the work has been completed, in which case it will return to the end user.

Ours will look slightly different but we’ll build a diagram for it as we go along.

Tavily API

Before we jump in we’ll need to add another API key to our .env and setup_environment.py files. We will be using the Tavily API lightly during this part and again in the next part of the series. Go to https://app.tavily.com/ and sign up for a free API key.

Tavily is a search engine optimized for AI agents and we can use it to have an agent search the internet. One of the reasons I chose Tavily here is that LangChain comes with pre-built tools for Tavily that we can just import and use as is, allowing us to focus more on learning about LangGraph as we have one less tool to write. You can just use your Google account for quick and easy sign up and it will cost you nothing for the first 1000 or so queries which is way more than we’ll use. Get your API key and copy it to the clipboard. Then open your .env file and add it like so:

OPENAI_API_KEY=your_api_key_here
LANGCHAIN_API_KEY=your_api_key_here
WEATHER_API_KEY=your_api_key_here
TAVILY_API_KEY=your_api_key_here

Make sure not to use any spaces or quotation marks as usual. Then go ahead and save and close the .env file. Now open the setup_environment.py file and add a single tine to load the TAVILY_API_KEY to an environment variable like so:

import os
from datetime import date

from decouple import config


def set_environment_variables(project_name: str = "") -> None:
    if not project_name:
        project_name = f"Test_{date.today()}"

    os.environ["OPENAI_API_KEY"] = str(config("OPENAI_API_KEY"))

    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_API_KEY"] = str(config("LANGCHAIN_API_KEY"))
    os.environ["LANGCHAIN_PROJECT"] = project_name

    ##### Add only this line #####
    os.environ["TAVILY_API_KEY"] = str(config("TAVILY_API_KEY"))
    ##############################

    print("API Keys loaded and tracing set with project name: ", project_name)

Now save and close the setup_environment.py file.

Prep for our multi-agent team

For this example over the next two parts, we will be creating a multi-agent team that will generate travel itineraries for us in PDF format, with us simply inputting a query and getting a fully formed PDF travel itinerary out the other end including an image. We will have three different tools that we will need for the overall setup:

An image generator: We already made one in the last part, so we can just import and reuse it, which is one of the nice things about LangChain tools.
An internet search tool: In case the agent wants to search for more information. LangChain comes with some pre-built tools one of which is for Tavily Search, which is why we got the API key. We can just use this prebuilt here to save some time.
A PDF generator: We will need a tool for our agents to be able to write PDF files and save them to disk. We will have to write this one ourselves before we can get started on our travel itinerary multi-agent team setup.

PDF writing tool

So let’s write up a quick PDF writing tool for our agents before we move on. Inside your tools folder make a new file named pdf.py:

 FINX_LANGGRAPH
     images
     tools
         __init__.py
         image.py
         pdf.py    New file
         weather.py
     .env
     langchain_basics.py
     Pipfile
     Pipfile.lock
     setup_environment.py
     simple_langgraph.py

Inside this new pdf.py file get started with our imports:

import os
import uuid
from pathlib import Path

import pdfkit
from langchain.tools import tool
from markdown import markdown
from pydantic import BaseModel, Field

We import os to work with the operating system, uuid to generate unique filenames again, and Path to create a path towards an output folder to save our PDF files. The tool decorator from LangChain is the same one that we used last time and the Basemodel and Field from pydantic imports are for defining the input arguments interface for our function just like we did before.

The pdfkit library is going to let us save HTML to real output PDF files, but the downside is that it needs HTML as input to do the conversion. As HTML is more complex for our LLM agents to write which introduces more variables and I want to keep this example simple we will be using the markdown library to convert markdown to HTML for us. That way we can just tell our agents to write in markdown formatting (which is very simple) and our function will do markdown -> HTML -> PDF.

Both pdfkit and markdown are not installed by default so we will have to install them in our virtual environment. Open your terminal and run:

pipenv install markdown==3.6 pdfkit==1.0.0

That will take care of the basic Python library installs, but pdfkit needs an additional step, as it actually uses something called wkhtmltopdf under the hood to achieve the conversion. Head over to https://wkhtmltopdf.org/downloads.html and click the appropriate download for your platform. I am on Windows so I’ll select the Windows 64-bit download option:

Run the installer and select an install location. I’ll simply use the default C:\Program Files\wkhtmltopdf myself. Whichever install location you choose, take note of it and copy it somewhere as you will need it in a moment:

Let that run the install and when it’s done we can get back to the code! Below our imports in pdf.py we’ll add some quick setup:

PATH_WKHTMLTOPDF = r"C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe"
PDFKIT_CONFIG = pdfkit.configuration(wkhtmltopdf=PATH_WKHTMLTOPDF)

OUTPUT_DIRECTORY = Path(__file__).parent.parent / "output"

First of all, we do some setup for pdfkit by pointing it to the location of the wkhtmltopdf executable. This is the path I used on my Windows machine, you have to adjust this path to where you installed wkhtmltopdf on your machine so be sure that you use the correct path for you! After defining the path we can simply call pdfkit.configuration with the wkhtmltopdf argument set to the path we just defined. Later in the code when we actually write the PDF files, we can pass in this PDFKIT_CONFIG as an argument to use this configuration.

We then use the same trick as last time to get a path to a folder named output in our project root. This is where we will save our PDF files, but the folder doesn’t exist yet. Make sure you create it right now or the code will fail when it tries to save the PDF files later and you’ll be stuck debugging why it doesn’t work:

 FINX_LANGGRAPH
     images
     output    New empty folder
     tools
         __init__.py
         image.py
         pdf.py
         weather.py
     .env
     langchain_basics.py
     Pipfile
     Pipfile.lock
     setup_environment.py
     simple_langgraph.py

Good! Now back to our pdf.py file. Below the setup we’ll define our input arguments interface just like we did with our other tools so far:

class MarkdownToPDFInput(BaseModel):
    markdown_text: str = Field(
        description="Markdown text to convert to PDF, provided in valid markdown format."
    )

We simply define the input arguments as a single string that has to be in a valid markdown format. Once again make sure your description is a good one as the LLM will use it, it is not just for our own reference.

HTML generation

Let’s make the problem we need to solve smaller by first writing a separate function to generate the HTML from the markdown text so we can just feed HTML into pdfkit:

def generate_html_text(markdown_text: str) -> str:
    """Convert markdown text to HTML text."""
    markdown_text = markdown_text.replace("file:///", "").replace("file://", "")
    html_text = markdown(markdown_text)
    html_text = f"""
    
    
        
    
    
    {html_text}
    
    
    """
    return html_text

This function takes a markdown_text as string input. First, we’ll search the markdown text for any file:/// or file:// protocol declarations sometimes used when the model inserts our image in markdown. These are not needed so we simply replace them with an empty string "" as these would cause our image to not show up in the final generated PDF file. This kind of thing is something you just discover during your development work.

Now we can simply call the markdown function we imported on our markdown to get valid HTML based on the markdown. As I felt like doing some light styling I then wrapped the html_text in some basic HTML tags html, head, and body. In the head we can then include a style tag which allows us to load the Roboto font from Google using the css function @import url, set it as the font, and give some extra line height to our document to make the text more readable. This is the final html_text that will be returned with the markdown call converted HTML in the body portion. If you happen to be less familiar with HTML just copy what I have, it’s not really important for the course.

Finishing up the tool

Now it’s time to define the actual tool itself. Continue below:

@tool("markdown_to_pdf_file", args_schema=MarkdownToPDFInput)
def markdown_to_pdf_file(markdown_text: str) -> str:
    """Convert markdown text to a PDF file. Takes valid markdown as a string as input and will return a string file-path to the generated PDF."""
    html_text = generate_html_text(markdown_text)
    unique_id: uuid.UUID = uuid.uuid4()
    pdf_path = OUTPUT_DIRECTORY / f"{unique_id}.pdf"

    options = {
        "no-stop-slow-scripts": True,
        "print-media-type": True,
        "encoding": "UTF-8",
        "enable-local-file-access": "",
    }

    pdfkit.from_string(
        html_text, str(pdf_path), configuration=PDFKIT_CONFIG, options=options
    )

    if os.path.exists(pdf_path):
        return str(pdf_path)
    else:
        return "Could not generate PDF, please check your input and try again."

We start with the @tool decorator, once again providing a string name for our function and then the input argument interface we defined. The function itself takes a markdown_text as input and returns a string file path to the generated PDF file. We have a docstring that explains what the function does and what it expects as input as the LLM is going to use this.

We then call our generate_html_text function on the markdown_text to get the html_text we need and generate a unique ID for the PDF file name, creating a path to the PDF file in our OUTPUT_DIRECTORY folder. We then define some options for pdfkit to use when generating the PDF. These are just some basic options that I found to work ok for our example, we don’t want to get sidetracked here by spending too much time on this as it is not the focus of this tutorial.

Finally, we call pdfkit.from_string with the html_text, the path to the PDF file in str format instead of a Path object, the configuration we set up atop this file, and the options we just defined. If the PDF file is successfully generated, which we can check with the os.path.exists function to see if the file exists or not, we return the path to the PDF file. If it does not exist we return a message saying that the PDF could not be generated. We purposely do not raise an error but send a string response as the agent can receive this, try to find the error, fix it, and try again.

PDF tool test run

Now let’s add a quick test at the bottom of our file:

markdown_dummy_text = """
# Title
This is a test of the markdown to PDF function.
## Subtitle
This is a test of the markdown to PDF function.
### Sub-subtitle
This is a test of the markdown to PDF function. This is a paragraph with random text in it nunc nunc tincidunt nunc, nec.
S'il vous plaît.
"""

if __name__ == "__main__":
    print(markdown_to_pdf_file(markdown_dummy_text))

There are a couple of headings here and some French with non-standard characters like in “plaît” to make sure it also works with special characters. Now go ahead and run your file (Reminder: make sure you created the output folder!). Close the printer message popup if you get one, we’ll just ignore it for now. You should see a new PDF file in your output folder. Go ahead and open it:

It’s not perfect by any means, but it works well enough for our LangGraph example purposes. As LangGraph is the focus here we will not spend any more time perfecting the details of this particular tool.

One last step though to fix the imports. Open up the tools/__init__.py file and fix the code to:

from .image import generate_image
from .weather import get_weather
from .pdf import markdown_to_pdf_file

Save and close that so we can have the nicer imports in our main code. That’s it for the preparation, this part is slightly shorter by design as the next one will be extra long. It is finally time to set up and run our multi-agent team! So let’s get to the fun stuff, I’ll see you there!

Multi-Agent LangGraph: Setting Up Our Multi-Agent Team

Welcome back to part 5, where we’ll set up our multi-agent team. So buckle up and let’s jump right in. Create a new file named multi_agent.py in your project root:

 FINX_LANGGRAPH
     images
     output
     tools
         __init__.py
         image.py
         pdf.py
         weather.py
     .env
     langchain_basics.py
     multi_agent.py    New file
     Pipfile
     Pipfile.lock
     setup_environment.py
     simple_langgraph.py

Open up the multi_agent.py file and start with the imports:

import functools
import operator
from typing import Annotated, Sequence, TypedDict

from colorama import Fore, Style
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_openai import ChatOpenAI
from langgraph.graph import END, StateGraph

from setup_environment import set_environment_variables
from tools import generate_image, markdown_to_pdf_file

We have a lot of imports again, many of which will be familiar. We import our own two tools from the tools folder and also the TavilySearchResults from the langchain_community tools. There are some new imports like functools and the AgentExecutor but we’ll cover each one and how they are used as we go along.

Environment variables and constants

Let’s load up our environment variables and create a bunch of constants we’ll need:

set_environment_variables("Multi_Agent_Team")

TRAVEL_AGENT_NAME = "travel_agent"
LANGUAGE_ASSISTANT_NAME = "language_assistant"
VISUALIZER_NAME = "visualizer"
DESIGNER_NAME = "designer"

TEAM_SUPERVISOR_NAME = "team_supervisor"
MEMBERS = [TRAVEL_AGENT_NAME, LANGUAGE_ASSISTANT_NAME, VISUALIZER_NAME]
OPTIONS = ["FINISH"] + MEMBERS

We load our environment variables and set the project name to Multi_Agent_Team. We then define a bunch of constants for the names of our agents and the team supervisor. These are just strings but as we’ll have to type each of these strings multiple times it will be very annoying if we change or mistype one, hence storing these in a single place up top is the way to go.

Note that we have the travel_agent, language_assistant, and visualizer inside a list called members and we have the designer and team_supervisor on the outside. We also imported the END node we used last time. That leaves us with a situation like this:

The list named OPTIONS is going to be the potential options the team_supervisor can choose from each step along the way, so it has all three members in the team + the "FINISH" option to indicate this particular team has finished its work.

Add two more final constants below:

TAVILY_TOOL = TavilySearchResults()
LLM = ChatOpenAI(model="gpt-3.5-turbo-0125")

We have the TAVILY_TOOL which is the Tavily search tool we imported from the langchain_community tools and the LLM which is gpt-3.5-turbo-0125 here but feel free to use GPT-4-turbo instead if you want.

Agent creator function

We’re going to be creating a lot of agents here, so let’s create a function to handle the repetitive work of creating an agent:

def create_agent(llm: BaseChatModel, tools: list, system_prompt: str):
    prompt_template = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            MessagesPlaceholder(variable_name="messages"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ]
    )
    agent = create_openai_tools_agent(llm, tools, prompt_template)
    agent_executor = AgentExecutor(agent=agent, tools=tools)  # type: ignore
    return agent_executor

We define a function named create_agent which takes an llm of the type BaseChatModel. This is just a type hint but it was part of our imports for clarity. BaseChatModel is the base class for all chat models in LangChain, including the ChatOpenAI variation we use here. You can pass any LLM you want and have different nodes of the same graph run on completely different LLMs. The other arguments are a list of tools and a system_prompt string.

We then declare a prompt_template using the ChatPromptTemplate.from_messages method that we used all the way back in part 1, but this time we use multiple messages. We have a "system" message that is the system prompt string passed into the function and then we have two placeholders for the messages and agent_scratchpad variables that we have seen before. The MessagesPlaceholder, as the name suggests, is just a placeholder for both of these so we can insert them later using the names we have defined under variable_name.

We then use the create_openai_tools_agent just like we did in part 3, but this time we go one step further and create an AgentExecutor in the step below. This AgentExecutor comes with LangChain and will basically combine the agent and the executor nodes we had in the previous part into a single node, handling the function call logic we did in the previous part for us! It takes an agent and a list of tools for that agent to use as arguments.

The # type: ignore comment is in case you use a type checker as it will complain here, and this series is not about type checking so we won’t go too deep into it as it’s no big deal here. We then return the agent_executor we created.

Agent state object

Now let’s declare the state object that we will be passing around in this particular graph:

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    next: str

This time we need two entries. The first is the messages which is a sequence of BaseMessage objects which again are just messages like ("human", "Hello, how are you doing?"), or ("ai", "I'm doing well, thanks!"),. We define it as a Sequence, so like a list or a tuple of these messages, and the operator.add again indicates that we will add to this sequence of messages with each step. Annotated is just used as it allows us to add the annotation of operator.add.

The second entry is the next which is a string that will be the name of the next agent to call. This is the agent that the team_supervisor will decide to call next based on the state object it receives and then we can use this field to see which agent to route to next. This field can just be overwritten as we don’t need the history, so a single string without any fancy annotations will do fine here.

Agent node function

Now let’s define a function that represents one of these agent nodes in general:

def agent_node(state, agent, name):
    result = agent.invoke(state)
    return {"messages": [HumanMessage(content=result["output"], name=name)]}

The function takes the state object, an agent, and the string name for the agent (the ones we defined up top as constants). Then we simply need to invoke the agent with the state and then keeping with the promise we made above in the AgentState object we defined the node needs to return a messages object with a message in it. We will simply use a HumanMessage, as it doesn’t really matter who the message comes from, and get the result from result["output"] which is the output of the agent’s call.

Team supervisor’s next member choice

Next, we’re going to need a way to have the team_supervisor choose which agent to invoke next. The easiest way to do this reliably is to pretend this is a function that the agent supervisor has to call for us. The only possible input arguments are the names of our agents and we tell the team_supervisor that it must call nonexistent_function(agent_name) to invoke the agent.

This is a bit of a hack, but it makes it very easy for us to extract the agent_name consistently and easily to see which agent node needs to run next. We will also include one extra option of “FINISH” so the team_supervisor can tell us when it’s done and needs to break out of the team. Doing this will also let us use the JsonOutputFunctionsParser later on in our code, as the function call will be sent in a correct JSON format, making the parsing of the output easier.

For this function that doesn’t actually exist, we’re going to define an old-school vanilla OpenAI function description that describes how the function works to the LLM team supervisor. Add the following variable:

router_function_def = {
    "name": "route",
    "description": "Select the next role.",
    "parameters": {
        "title": "routeSchema",
        "type": "object",
        "properties": {
            "next": {
                "title": "next",
                "anyOf": [
                    {"enum": OPTIONS},
                ],
            }
        },
        "required": ["next"],
    },
}

This is actually JSON Schema vocabulary, but is quite readable. We define the name of the function as route and give it a description of what the function does. We then define the parameters that the function takes, giving the parameter object a title of routeSchema and defining that it is an object. Then we define the properties of this object, which is just a single property named next. This property has a title of next and the options available are anyOf the enumerate (list) of OPTIONS we defined up top. We then define that the next property is required.

This JSON Schema style is what the OpenAI API normally uses for function/tool calls, but LangChain has done this under the hood for the functions we have used so far. Again, this function will not actually exist, but that doesn’t stop us from feeding it to the LLM and extracting the next property from the arguments the LLM provides for us.

Team supervisor system prompt

Now let’s create a secondary file to store our prompt system setup messages as we’re going to be using quite a lot of them here. Create a new file named multi_agent_prompts.py in your project root:

 FINX_LANGGRAPH
     images
     output
     tools
         __init__.py
         image.py
         pdf.py
         weather.py
     .env
     langchain_basics.py
     multi_agent.py
     multi_agent_prompts.py    New file
     Pipfile
     Pipfile.lock
     setup_environment.py
     simple_langgraph.py

We’ll use this file to store the prompt string variables for the system messages our agents will use. If you’re watching the video tutorial version of this please be advised that there is a written blog version of this tutorial where you can copy these prompts so you don’t have to type them all over again, as we have a lot more of them coming. Let’s start with the team supervisor. Inside the multi_agent_prompts.py file add:

TEAM_SUPERVISOR_SYSTEM_PROMPT = """
You are a supervisor tasked with managing a conversation between the following workers: {members}. Given the following user request, respond with the worker to act next. Each worker will perform a task and respond with their results and status. The end goal is to provide a good travel itinerary for the user, with things to see and do, practical tips on how to deal with language difficulties, and a nice visualization that goes with the travel plan (in the form of an image path, the visualizer will save the image for you and you only need the path).

Make sure you call on each team member ({members}) at least once. Do not call the visualizer again if you've already received an image file path. Do not call any team member a second time unless they didn't provide enough details or a valid response and you need them to redo their work. When finished, respond with FINISH, but before you do, make sure you have a travel itinerary, language tips for the location, and an image file-path. If you don't have all of these, call the appropriate team member to get the missing information.
"""

So we have some basic instructions for the team supervisor on how to manage the team here. We have the placeholder {members} in there twice which will be replaced with the actual list of members. We tell it we want a travel itinerary with things to do and sightseeing, language tips, and a visualization for the itinerary. The prompt here is far from perfect and you can tweak it further if you like.

Save the multi_agent_prompts.py file and let’s get back to the multi_agent.py file. First of all, add an extra import up top with the other imports:

#... all the other imports ...

from multi_agent_prompts import TEAM_SUPERVISOR_SYSTEM_PROMPT

Note that we could just use from multi_agent_prompts import * as the * will simply import everything from the file, even the variables we add later, but this is a bad practice as it makes it hard to see where the variables come from and leads to namespace pollution. It’s better to explicitly define and keep track of what you’re importing or sooner or later you’re going to have multiple variables with the same name and you won’t know where they come from.

Team supervisor prompt template

Now scroll all the way back down past the router_function_def and add the following code to define our team supervisor’s prompt template manually as it will be different from all the other agents:

team_supervisor_prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", TEAM_SUPERVISOR_SYSTEM_PROMPT),
        MessagesPlaceholder(variable_name="messages"),
        (
            "system",
            "Given the conversation above, who should act next?"
            " Or should we FINISH? Select one of: {options}",
        ),
    ]
).partial(options=", ".join(OPTIONS), members=", ".join(MEMBERS))

We use the same ChatPromptTemplate.from_messages method we used before, but this time we have three messages. The first is the TEAM_SUPERVISOR_SYSTEM_PROMPT we defined in the multi_agent_prompts.py file. The second is a MessagesPlaceholder for the messages variable and the third is a short system message that reminds the team supervisor what it’s task is and what options it has available to choose from.

This team supervisor prompt template will need 3 variables to be filled in and used properly.

The first is inside the TEAM_SUPERVISOR_SYSTEM_PROMPT where we used the members placeholder twice.
The second one is the messages for the MessagesPlaceholder in the middle.
The third is the options for the options placeholder in the last message.

We have two of these available, namely the options and the members, but we don’t have the messages yet. the .partial chained on method will let us fill in the two parts that we have and leave the messages part to be added later, so we can go ahead and pass our OPTIONS to the options placeholder and the MEMBERS to the members placeholder ahead of time using this partial filling in method.

Note that we use the join method on the OPTIONS and MEMBERS lists to turn them into a single string with the members separated by a comma and a space as we cannot pass list variables to LLMs.

Team supervisor node

So the team supervisor is basically going to act like a router between our agents, deciding who is up next. Remember in part 1 where we used LCEL with the | pipe operator to create chains by piping a prompt into an LLM and then into an output parser? These simple vanilla LangChain chains can also be used as nodes in LangGraph. As the team supervisor node is going to be special we will use our part 1 vanilla LangChain knowledge to simply chain it together manually:

team_supervisor_chain = (
    team_supervisor_prompt_template
    | LLM.bind_functions(functions=[router_function_def], function_call="route")
    | JsonOutputFunctionsParser()
)

So we simply define the team_supervisor_chain as the prompt template we just made for it, then we pipe that into the LLM, and pipe that into a JsonOutputFunctionsParser. As we’re using a function here we can use the JSON output parser to extract the next property from the arguments the LLM provides for us.

The LLM here uses the bind_functions method to bind the router_function_def JSON Schema we defined as the available functions for this LLM call, and by passing in the second optional argument function_call="route" we tell the LLM that it MUST call the route function we defined earlier, meaning we are actually forcing it to call this function and not do anything else as this is its only purpose. Remember we added an entry in the AgentState to store the next parameter.

The system prompts for our other agents

Ok, now we need to create the agents that will make up the rest of our graph. These are going to be a lot easier as we’ll be able to use the create_agent function we wrote earlier. But first, we need some system setups which are going to be unique for each agent. Let’s move back over to the multi_agent_prompts.py file and add the following below the existing TEAM_SUPERVISOR_SYSTEM_PROMPT, starting with the travel agent:

TRAVEL_AGENT_SYSTEM_PROMPT = """
You are a helpful assistant that can suggest and review travel itinerary plans, providing critical feedback on how the trip can be enriched for enjoyment of the local culture. If the plan already includes local experiences, you can mention that the plan is satisfactory, with rationale.

Assume a general interest in popular tourist destinations and local culture, do not ask the user any follow-up questions.

You have access to a web search function for additional or up-to-date research if needed. You are not required to use this if you already have sufficient information to answer the question.
"""

So we just have some basic instructions here, and notice how we say that if the plan already includes local experiences the agent can mention that the plan is satisfactory already, to make sure we’re not forcing it to do pointless work. The second paragraph is to stop it from asking questions and expecting an answer from the user, it should just help us without asking stuff.

Finally, we tell it that we give it access to a web search function to do more research if it needs to, but it won’t use these much as it has most travel info hard-wired into the LLM already. (We’ll use these search functions more extensively in the last part). I’ve taken some inspiration for these agents and prompts from the Autogen demo agents here, but this is just a starting point, and these can be tweaked much further.

Now for the language assistant:

LANGUAGE_ASSISTANT_SYSTEM_PROMPT = """
You are a helpful assistant that can review travel plans, providing feedback on important/critical tips about how best to address language or communication challenges for the given destination. If the plan already includes language tips, you can mention that the plan is satisfactory, with rationale.

You have access to a web search function for additional or up-to-date research if needed. You are not required to use this if you already have sufficient information to answer the question.
"""

This is basically the same but with a focus on language tips instead of travel itinerary plans. Let’s move on to the visualizer:

VISUALIZER_SYSTEM_PROMPT = """
You are a helpful assistant that can generate images based on a detailed description. You are part of a travel agent team and your job is to look at the location and travel itinerary and then generate an appropriate image to go with the travel plan. You have access to a function that will generate the image as long as you provide a good description including the location and visual characteristics of the image you want to generate. This function will download the image and return the path of the image file to you.

Make sure you provide the image, and then communicate back as your response only the path to the image file you generated. You do not need to give any other textual feedback, just the path to the image file.
"""

This one is a bit different as it’s going to generate an image for us. We tell it that it should only provide the path to the image file and not any other feedback. This is of course because the image generation tool that we wrote ourselves will save the image to disk and return the path to the image file, so we don’t need any other feedback from the agent other than the path which means that the image generation was successful.

Now we have one last agent’s system prompt to define, the designer, which is going to exist outside of our team of three agents above. We will also need the path to the images folder in our project to insert into this prompt. First scroll all the way back up to the top of the multi_agent_prompts.py file, and add the following import:

from tools.image import IMAGE_DIRECTORY

Now scroll all the way back down again and add the designer’s system prompt, this time using a multi-line f string:

DESIGNER_SYSTEM_PROMPT = f"""
You are a helpful assistant that will receive a travel itinerary in parts. Some parts will be about the travel itinerary and some will be the language tips, and you will also be given the file path to an image. Your job is to call the markdown_to_pdf_file function you have been given, with the following argument:

markdown_text: A summary of the travel itinerary and language tips, with the image inserted, all in valid markdown format and without any duplicate information.

Make sure to use the following structure when inserting the image:
![Alt text]({str(IMAGE_DIRECTORY)}/image_name_here.png) using the correct file path. Make sure you don't add any stuff like 'file://'.

Start with the image and itinerary first and the language tips after, creating a neat and organized final travel itinerary with the appropriate markdown headings, bold words and other formatting.
"""

We explain that it’s function is to call the markdown_to_pdf_file function we wrote passing in a full markdown summary with the image inserted as well. We give it specific instructions on how to format the image link in the markdown so it will work with our converter, and finally give it some last instructions on the structure we want.

Inside your multi_agent_prompts.py file you now have the following constants:

TEAM_SUPERVISOR_SYSTEM_PROMPT = ...
TRAVEL_AGENT_SYSTEM_PROMPT = ...
LANGUAGE_ASSISTANT_SYSTEM_PROMPT = ...
VISUALIZER_SYSTEM_PROMPT = ...
DESIGNER_SYSTEM_PROMPT = ...

Creating our agents and nodes

Go ahead and save and close the multi_agent_prompts.py file and let’s get back to the multi_agent.py file. First lets update our import up top with the other imports, changing it like this:

#... all the other imports ...

from multi_agent_prompts import (
    TEAM_SUPERVISOR_SYSTEM_PROMPT,
    TRAVEL_AGENT_SYSTEM_PROMPT,
    LANGUAGE_ASSISTANT_SYSTEM_PROMPT,
    VISUALIZER_SYSTEM_PROMPT,
    DESIGNER_SYSTEM_PROMPT,
)

Then go ahead and scroll all the way back down to the bottom of the file and let’s start creating some agents and nodes! First up is the travel agent:

travel_agent = create_agent(LLM, [TAVILY_TOOL], TRAVEL_AGENT_SYSTEM_PROMPT)
travel_agent_node = functools.partial(
    agent_node, agent=travel_agent, name=TRAVEL_AGENT_NAME
)

First we create the travel_agent by calling our create_agent function and passing in the LLM, a list with the TAVILY_TOOL in it as our list of tools, as we promised it an internet tool if it needed one, and the TRAVEL_AGENT_SYSTEM_PROMPT. We now have our travel agent / executor.

To get the travel agent’s node we need to use the agent_node function we defined before, which needs three arguments, the agent, the state and the name of the agent in string format. We have the agent and the name already, but the state will only be available at runtime. To solve this problem we can use the functools.partial function to create a new function that has the agent and name already filled in, and then we can pass in the state at runtime.

If you’re unfamiliar with functools.partial, it basically works like this:

########### Example, not part of the code ############
# Original function
def multiply(x, y):
    return x * y

# Create a new function that multiplies by 2
multiply_by_two = functools.partial(multiply, x=2)

result = multiply_by_two(3)
print(result)  # Output: 6

So it takes a function and creates a new function based on the original with a portion of the arguments already filled in, reducing the number of arguments the function takes in it’s new form. This is very useful as we now have our complete travel_agent_node that needs only the state object to be passed in for it to work.

Now in exactly the same manner we can create our language_assistant, visualizer, and designer agents and nodes:

language_assistant = create_agent(LLM, [TAVILY_TOOL], LANGUAGE_ASSISTANT_SYSTEM_PROMPT)
language_assistant_node = functools.partial(
    agent_node, agent=language_assistant, name=LANGUAGE_ASSISTANT_NAME
)

visualizer = create_agent(LLM, [generate_image], VISUALIZER_SYSTEM_PROMPT)
visualizer_node = functools.partial(agent_node, agent=visualizer, name=VISUALIZER_NAME)

designer = create_agent(LLM, [markdown_to_pdf_file], DESIGNER_SYSTEM_PROMPT)
designer_node = functools.partial(agent_node, agent=designer, name=DESIGNER_NAME)

The language assistant takes the TAVILY_TOOL, while our visualizer needs the generate_image and the designer the markdown_to_pdf_file tool. We then create the nodes for each of these agents in the same way we did for the travel agent above, passing in their respective names using the ...NAME constants we defined up top.

Creating the graph

Time to create our graph and the nodes:

workflow = StateGraph(AgentState)
workflow.add_node(TRAVEL_AGENT_NAME, travel_agent_node)
workflow.add_node(LANGUAGE_ASSISTANT_NAME, language_assistant_node)
workflow.add_node(VISUALIZER_NAME, visualizer_node)
workflow.add_node(DESIGNER_NAME, designer_node)
workflow.add_node(TEAM_SUPERVISOR_NAME, team_supervisor_chain)

We initialize the StateGraph passing in our AgentState format we defined. Then we simply create a node for each agent passing in the name first, and the actual node second. Note that we’ve used these ...NAME variables several times now, which is why we defined them up top as constants to give them only a single point of definition instead of repeating strings all over the place.

Now that we have the nodes let’s start building some connections:

for member in MEMBERS:
    workflow.add_edge(member, TEAM_SUPERVISOR_NAME)

workflow.add_edge(DESIGNER_NAME, END)

For every member in the list of team MEMBERS we add an edge back to the team supervisor, as it will decide where to go next between each step. We also add an edge from the designer to the END node as the designer is the last step in our graph and will exist outside of the team.

So far we have this, and these are all hard edges with no conditions. Now it is time for us to add some conditional edges:

conditional_map = {name: name for name in MEMBERS}
conditional_map["FINISH"] = DESIGNER_NAME
workflow.add_conditional_edges(
    TEAM_SUPERVISOR_NAME, lambda x: x["next"], conditional_map
)

We create a conditional_map dictionary that maps each member to itself, and then we add a key "FINISH" that maps to the DESIGNER_NAME. So if the team supervisor calls on the "visualizer" this will simply map like {"visualizer": "visualizer"} but the one exception is the {"FINISH": "designer"} mapping.

We then call the add_conditional_edges method on the workflow object. This method takes the start point, so we pass in the TEAM_SUPERVISOR_NAME, a function that will return a value, and then a mapping that will map that value to the next desired node.

The function is a lambda that takes the state object as input and simply returns the state’s next key that the team supervisor has put in there. The conditional_map is the mapping we defined above, so if the team supervisor calls on a team member it will map to that team member’s node, but if it calls "FINISH" it will map to the "designer" node.

Now set the entry point and compile the graph:

workflow.set_entry_point(TEAM_SUPERVISOR_NAME)

travel_agent_graph = workflow.compile()

Our completed graph now looks like this:

Where the white lines represent the fixed edges and the dotted lines represent conditional ones. Now let’s actually give this a test run and see what happens!:

for chunk in travel_agent_graph.stream(
    {"messages": [HumanMessage(content="I want to go to Paris for three days")]}
):
    if "__end__" not in chunk:
        print(chunk)
        print(f"{Fore.GREEN}#############################{Style.RESET_ALL}")

So we’re going to call stream on the travel_agent_graph and pass in a dictionary with the messages key and a list with a single HumanMessage object in it, saying that we want to visit Paris. for three days. We then loop over the chunks and print them out, and then print a line of #s in green to visually separate the chunks.

Now go ahead and run this and let’s see what happens! You may see some printer message popup, again just click X on it if it pops up for now. When it’s done running have a look in your output folder for the final result:

That is pretty darn cool right! Our whole team of AI agents is working together to do our bidding without any work on our part! Everything worked exactly as expected with the routing and everything, which you can confirm in your LangSmith dashboard (https://smith.langchain.com/) as well by checking out the trace for the run:

We can see that after each step the system returns to the team supervisor and at the end it breaks out of the team towards the designer. I’ve done a bunch more test runs to verify that it works well and here are some example runs for other destinations:

Remember that I’ve been using the gpt-3.5-turbo-0125 model all this time. You can easily swap out any of the models for gpt-4-turbo if you want more detail, or if you have some trouble with a specific node. Say the designer has trouble working consistently, you could just swap out only that node for a different model with a higher quality and leave the rest as is.

You can literally create just about any combination of agents, nodes, edges, and conditional edges you want. The combination possibilities are mind-boggling. We decided to have one agent outside of the team here, no problem! We can also have 2 teams or even more if we want, each with their own manager. Your imagination is the limit here.

That’s it for part 5! In the next and last part, we’ll take a look at writing and integrating asynchronous tools into our systems. I’ll see you there!

Multi-Agent LangGraph – Web Research and Asynchronous Tools

Hello and welcome back to the last part of the LangGraph/LangChain/LangSmith course. In this part, we’ll learn how to deal with asynchronous tools by building a graph that will do some web research for us, where one of the tools is going to be visiting several websites at once to feed info back into the graph.

This type of asynchronous action is very helpful when there are multiple steps or actions that can be performed at the same time for optimization as it will save a lot of time and make the user experience much better. It is a bit different to set up and work with though, which is why we’ll be going through it in this part.

I will try to cover the bare basics of async Python programming here, as it can look quite confusing and I want all skill levels to be able to follow along. If you are already very familiar with async programming the level of explanation may be a bit excessive for you and you can probably skip over some of the explanations and just look at the code.

Web research tool

Let’s start by building our tool as usual. This tool is going to visit a bunch of web URLs at the same time (asynchronously) and return the HTML content of each page. We will need to install the BeautifulSoup library to parse the HTML content of the pages. Run the following command in the terminal:

pipenv install beautifulsoup4==4.12.3

Then go ahead and create a new file called web.py in the tools directory:

 FINX_LANGGRAPH
     images
     output
     tools
         __init__.py
         image.py
         pdf.py
         weather.py
         web.py    New file
     .env
     langchain_basics.py
     multi_agent.py
     multi_agent_prompts.py
     Pipfile
     Pipfile.lock
     setup_environment.py
     simple_langgraph.py

In the web.py file let’s start with our imports as usual:

import asyncio
import json
import sys

import aiohttp
from bs4 import BeautifulSoup
from langchain.tools import tool
from pydantic import BaseModel, Field

We import asyncio to work with asynchronous code, aiohttp to make HTTP requests asynchronously, and BeautifulSoup to parse the HTML content of the pages. The tool decorator and pydantic imports are the same as for the other tools and json is to return the JSON responses in string format.

Async and event loops

First of all, we’ll use the sys import to set the type of event loop to use for the asynchronous code:

if sys.platform.startswith("win"):
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

Without going into too much detail, there is a known issue with the Python asyncio library on Windows specifically that happens when the Proactor event loop (the default on Windows) is closed while there are still outstanding tasks. It doesn’t affect the correct execution of the code, but something on Windows + aysncio + LangChain/LangGraph triggers it. We’ll use the selector event loop policy to avoid this issue (this is only needed/triggers if you’re on Windows.).

While this tutorial part is way too short to really go in-depth on Python’s asynchronous programming, we’ll try to cover the basics as we go along. Basically, we get an event loop, and we can put tasks in there. Normally a task like fetching a webpage would block the code until it’s done, but with asyncio we can put it in the event loop and continue with other tasks while it’s being fetched. This allows us to run multiple operations at the same time.

This is not to be confused with multi-threading or multi-processing, which are quite different in nature:

Multi-processing: is about spreading tasks over a computer’s CPU cores, and is well suited for tasks that require lots of mathematical computations.
Multi-threading: is about running multiple threads in the same process, and is well suited for tasks that are I/O bound (like fetching webpages).
Asynchronous programming: is a single-process, single-threaded design that uses coroutines to handle multiple tasks concurrently. Async functions are able to sort of pause and resume their execution, allowing other tasks to run in the meantime during this pause.

Async programming in Python is very similar to the JavaScript async/await pattern, and it’s a great way to handle I/O-bound tasks like fetching web pages. If you’re a bit new to this all, just keep going and you’ll get a feel for how it works.

Parsing HTML content

First, we’ll write a very basic function that uses BeautifulSoup to parse some HTML content:

def parse_html(html_content: str) -> str:
    soup = BeautifulSoup(html_content, "html.parser")
    for tag in ["nav", "footer", "aside", "script", "style", "img", "header"]:
        for match in soup.find_all(tag):
            match.decompose()

    text_content = soup.get_text()
    text_content = " ".join(text_content.split())
    return text_content[:8_000]

This function takes the HTML content of a webpage as a string and returns a string with the text content of the page. First we instantiate a new BeautifulSoup object passing in the html.parser string to select the parser. We then make a list of all the HTML tags we want to filter out, namely the navigation, footer, aside, script, style, image and header tags. We’re interested in the main content and don’t want all this pollution.

For each tag in this list of HTML tags, we run soup.find_all(tag) to find all the tags with that name in the HTML content, which returns all the matches for that tag. This allows us to loop over each match in soup.find_all(tag) and call match.decompose() to remove the tag from the HTML content.

We then get the text content of the page with soup.get_text() to remove as much HTML and unneeded stuff as possible from what was left.

Then we call text_content.split() to split the text content into a list of words, which has the side effects of removing long sequences of whitespace, tab, and newline characters. We then join the list of words back into a string with " ".join so that we’re left with only a single space between all words to save space. The LLM does not care about formatting and sending tons of whitespace to it is just a waste of space.

Finally, we return the first 8,000 characters of the text content, to make sure we don’t exceed the context limit if we load like 5 or 6 pages at once. You can set this higher if you use GPT-4-turbo instead of 3.5-turbo

Fetching webpages

Notice that the parse_html function is just a regular synchronous function. Now let’s get started on the asynchronous part. The first thing we’ll do is write a function to fetch the HTML content of a single webpage, and then we can just call this function multiple times to fetch the content of multiple pages at once.

async def get_webpage_content(url: str) -> str:
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            html_content = await response.text()

    text_content = parse_html(html_content)
    print(f"URL: {url} - fetched successfully.")
    return text_content

First, we declare our async function using async def instead of the normal def. This will allow us to later call await on this function to make the code non-blocking and run other tasks while we wait for the response. We take a URL string and return a string.

Where we would normally fetch a webpage with the requests library, here we need to use aiohttp which is an asynchronous HTTP client/server library for Python that allows us to write this non-blocking code. The ClientSession object represents a single web session, so you could set headers or cookies here that apply to all requests in this session.

The whole thing can be used as a context manager giving us the async with aiohttp.ClientSession() as session syntax and any indented code afterward now takes place inside this context. Then we call get(url) on the session object and use that as a context manager in the same exact manner as the line above it.

The line after calls await on the response.text() and will then save this in the html_content variable. This await keyword is the magic, and whenever you see this keyword it sort of pauses this code, as time is needed to fetch the webpage. While this is happening, other tasks can run in the event loop.

When the html_content has finished fetching, we move outside of the two async context managers and call our parse_html function to get the text content of the page. We then print a message to the console that the URL was fetched successfully and return the text content.

Note that we could easily edit the above function to fetch the whole list of URLs we have inside the same ClientSession context manager, but as the overhead to calling this function a couple of times is minimal, I’ll just keep it as is for now.

Another fair point to make is that the parse_html function is technically blocking non-async code, but as it doesn’t take long to run at all, it’s fine to keep it here. The main time-waster is the fetching of the webpages and we made that asynchronous.

Input arguments and the tool

Before we get to the actual tool itself we need to make sure to define our pydantic object with the input arguments for the tool:

class ResearchInput(BaseModel):
    research_urls: list[str] = Field(description="Must be a list of valid URLs.")

No surprises here, we just want a list of URLs in string format. We’ve used this type of object several times before.

Now let’s write our tool, starting with the first half:

@tool("research", args_schema=ResearchInput)
async def research(research_urls: list[str]) -> str:
    """Get content of provided URLs for research purposes."""
    tasks = [asyncio.create_task(get_webpage_content(url)) for url in research_urls]

We use the @tool decorator to define our tool, passing in the name and the argument schema as always. We declare the function making sure to use async def, and we declare the same research_urls argument as we defined in the ResearchInput class. Again mind the docstring description for the LLM to use.

Then we use a list comprehension, let’s read it from the right to the left. for each url in the list of research_urls, we call asyncio.create_task(get_webpage_content(url)) to create a task for each URL. The asyncio.create_task() function schedules the coroutine to run on the event loop and returns a Task object. However, it doesn’t automatically await the task.

What this means is that it will create our async task and also start it for us, but it won’t await it, or wait for it to finish, which would block the code. We are left with a list full of these task objects of tasks that are currently running but not yet finished.

Let’s finish our tool:

@tool("research", args_schema=ResearchInput)
async def research(research_urls: list[str]) -> str:
    """Get content of provided URLs for research purposes."""
    tasks = [asyncio.create_task(get_webpage_content(url)) for url in research_urls]
    contents = await asyncio.gather(*tasks, return_exceptions=True)
    return json.dumps(contents)

The asyncio.gather() function is used to schedule multiple tasks to run and waits for all of them to complete. It will wait for all our tasks from the previous line to fetch their web pages and then gather the results. This is why we await this function, and then save the results in contents. *tasks is a way to unpack the list of tasks into separate arguments passing them into the function.

The return_exceptions parameter in asyncio.gather() determines how exceptions are handled. If return_exceptions is set to False, gather() will immediately raise the first exception it encounters. When set to True, instead of raising exceptions, it will return them in the result list so that contents will be a list of results or exceptions. We use this as we want to go ahead and fetch the rest of the pages even if one fails.

Finally, dump the response to a JSON string and return it, as naturally, LLMs need string input.

Testing the tool

Now let’s add a quick test to this file to test our tool in isolation and make sure there are no problems:

if __name__ == "__main__":
    import time

    TEST_URLS = [
        "https://en.wikipedia.org/wiki/SpongeBob_SquarePants",
        "https://en.wikipedia.org/wiki/Stephen_Hillenburg",
        "https://en.wikipedia.org/wiki/The_SpongeBob_Movie:_Sponge_Out_of_Water",
    ]

    async def main():
        result = await research.ainvoke({"research_urls": TEST_URLS})

        with open("test.json", "w") as f:
            json.dump(result, f)

    start_time = time.time()
    asyncio.run(main())
    end_time = time.time()
    print(f"Async time: {end_time - start_time} seconds")

We’ve covered the if __name__ == "__main__": block before, so only if we run this file directly will the code inside this block run. We define a list of test URLs to use and then define an async function called main() to run our tool with these test URLs. Instead of invoking the tool as we normally do we now use ainvoke for the async version, and we have to await the result. This is why the main function is async as well.

We then open a file called test.json in write mode and dump the result to it so we can have a quick look to check if the output is as expected. Finally we run the main() function with asyncio.run(main())

asyncio.run is a useful function that creates a new event loop, runs the given coroutine which is main in our case, closes the loop, and then returns the result. This makes it a convenient way to run async code from a synchronous context as it handles the whole event loop thing for us.

I’ve also sneaked a start and end timer in there using time.time() to see how long it takes to run the async code.

Now go ahead and run the web.py file and you’ll see something like this:

URL: https://en.wikipedia.org/wiki/The_SpongeBob_Movie:_Sponge_Out_of_Water - fetched successfully.
URL: https://en.wikipedia.org/wiki/Stephen_Hillenburg - fetched successfully.
URL: https://en.wikipedia.org/wiki/SpongeBob_SquarePants - fetched successfully.
Async time: 2.9387967586517334 seconds

I have also tried the synchronous normal version of this code using the requests library, and it took over 7 seconds, so we have a considerable time save here, and this is with only 3 URLs. If I increase the number of URLs to just 6, the async version takes about 4 seconds, while the synchronous version takes like 14.

If you open the test.json file that has been created you should see something like the following that goes on for quite a while:

"[\"SpongeBob SquarePants - Wikipedia Jump to content From Wikipedia, the free encyclopedia American animated television series This article is about the television series.........

Web research graph setup

We have just written our first async tool! Now let’s put it to good use and write up a quick web research graph. In your root folder create two new files called web_research.py and web_research_prompts.py:

 FINX_LANGGRAPH
     images
     output
     tools
         __init__.py
         image.py
         pdf.py
         weather.py
         web.py
     .env
     langchain_basics.py
     multi_agent.py
     multi_agent_prompts.py
     Pipfile
     Pipfile.lock
     setup_environment.py
     simple_langgraph.py
     web_research.py          New file
     web_research_prompts.py    New file

The graph here will be reasonably simple, having two agents. One of them will use Tavily to do a basic search query, and the other one will use our async tool to do more in-depth research on the URLs provided by the first Tavily agent. You know the drill by now, so we’ll just define our system prompts for the agents before we get started on the main file. If you’re watching the video version of this tutorial make sure you open up the written version so you can more easily copy these. Start by opening up the web_research_prompts.py file.

We’ll get started with the Tavily agent’s system prompt first:

TAVILY_AGENT_SYSTEM_PROMPT = """
You are a search agent. Your tasks is simple. Use your tool to find results on the internet for the user query, and return the response, making sure to include all the sources with page title and URL at the bottom like this example:

1. [Title 1](https://www.url1.com/whatever): ...
2. [Title 2](https://www.url2.com/whatever): ...
3. [Title 3](https://www.url3.com/whatever): ...
4. [Title 4](https://www.url4.com/whatever): ...
5. [Title 5](https://www.url5.com/whatever): ...

Make sure you only return the URLs that are relevant for doing additional research. For instance:
User query Spongebob results from calling your tool:

1. [The SpongeBob Official Channel on YouTube](https://www.youtube.com/channel/UCx27Pkk8plpiosF14qXq-VA): ...
2. [Wikipedia - SpongeBob SquarePants](https://en.wikipedia.org/wiki/SpongeBob_SquarePants): ...
3. [Nickelodeon - SpongeBob SquarePants](https://www.nick.com/shows/spongebob-squarepants): ...
4. [Wikipedia - Excavators](https://en.wikipedia.org/wiki/Excavator): ...
5. [IMDB - SpongeBob SquarePants TV Series](https://www.imdb.com/title/tt0206512/): ...


Given the results above and an example topic of Spongebob, the Youtube channel is going to be relatively useless for written research, so you should skip it from your list. The Wikipedia article on Excavators is not related to the topic, which is Spongebob for this example, so it should be omitted. The others are relevant so you should include them in your response like this:
1. [Wikipedia - SpongeBob SquarePants](https://en.wikipedia.org/wiki/SpongeBob_SquarePants): ...
2. [Nickelodeon - SpongeBob SquarePants](https://www.nick.com/shows/spongebob-squarepants): ...
3. [IMDB - SpongeBob SquarePants TV Series](https://www.imdb.com/title/tt0206512/): ...
"""

This is a bit of a long prompt, but it’s quite simple. The Tavily agent is tasked with finding relevant URLs for a given query, and then returning the URLs that are relevant for further research. The prompt gives an example of what the response should look like and also gives an example of what URLs are relevant and what URLs are not.

Now let’s define the system prompt for the web research agent:

RESEARCHER_SYSTEM_PROMPT = """
You are an internet research information-providing agent. You will receive results for a search query. The results will look something like this:

1. [Wikipedia - SpongeBob SquarePants](https://en.wikipedia.org/wiki/SpongeBob_SquarePants): ...
2. [Nickelodeon - SpongeBob SquarePants](https://www.nick.com/shows/spongebob-squarepants): ...
3. [IMDB - SpongeBob SquarePants TV Series](https://www.imdb.com/title/tt0206512/): ...

Your job is to use your research tool to find more information on the topic and to write an article about the information you find in markdown format. You will call the research tool with a list of URLs, so for the above example your tool input will be:

["https://en.wikipedia.org/wiki/SpongeBob_SquarePants", "https://www.nick.com/shows/spongebob-squarepants", "https://www.imdb.com/title/tt0206512/"]

After you have finished your research you will write a long-form article on all the information you found and return it to the user, making sure not to leave out any relevant details. Make sure you include as much detail as possible and that the article you write is on the topic (for instance Pokemon) instead of being about the websites that you visited (e.g. Wikipedia, YouTube). Use markdown formatting and supply ONLY the resulting article in your response, with no extra chatter except for the fully formed, well-written, and formatted article. Use headers, sub-headers, bolding, bullet lists, and other markdown formatting to make the article easy to read and understand. Your only output will be the fully formed and detailed markdown article.
"""

The agent is tasked with using the web research tool to find more information on a topic and then writing an article about the information found. The prompt gives an example of what the input to the tool should look like and then specific instructions on using markdown formatting to write the output article and details on the article we want it to write. Save and close the web_research_prompts.py file.

Web research graph main file

Now let’s move on to the main file web_research.py and start by importing the necessary modules:

import asyncio
import functools
import operator
import uuid
from typing import Annotated, Sequence, TypedDict

from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from langgraph.graph import END, StateGraph

from setup_environment import set_environment_variables
from tools.pdf import OUTPUT_DIRECTORY
from tools.web import research
from web_research_prompts import RESEARCHER_SYSTEM_PROMPT, TAVILY_AGENT_SYSTEM_PROMPT

You’ve seen pretty much all of these imports before in some part of our code so far. We import the prompts we just created and the web research function as well as the OUTPUT_DIRECTORY we defined in the pdf.py file so that we can access this folder to save our output. To do this properly it would be best to store these project-wide constants like the paths in a separate file but for now, we’ll just import it from pdf.py.

Now continue below the imports:

set_environment_variables("Web_Search_Graph")

TAVILY_TOOL = TavilySearchResults(max_results=6)
LLM = ChatOpenAI(model="gpt-3.5-turbo-0125")

TAVILY_AGENT_NAME = "tavily_agent"
RESEARCH_AGENT_NAME = "search_evaluator_agent"
SAVE_FILE_NODE_NAME = "save_file"

We load up our variables and use the project name Web_Search_Graph for our LangSmith traces. We create a new instance of the Tavily search tool we imported setting the max_results to 6, and we create a ChatOpenAI object as usual. After that we set up some string constants for the names of our agents and nodes again.

We’ll have the create_agent function which is basically the same as last time:

def create_agent(llm: ChatOpenAI, tools: list, system_prompt: str):
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            MessagesPlaceholder(variable_name="messages"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ]
    )
    agent = create_openai_tools_agent(llm, tools, prompt)
    executor = AgentExecutor(agent=agent, tools=tools)  # type: ignore
    return executor

No real changes there so let’s move on to the AgentState definition:

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]

We have a simple list of BaseMessage objects and every node in the graph will add a message to this list as the state passes through that particular node.

Creating our nodes

Now we’ll have a basic function to create a new agent node similar to what we’ve done before, but this time we’ll also have one to create an async agent node:

def agent_node(state: AgentState, agent, name):
    result = agent.invoke(state)
    return {"messages": [HumanMessage(content=result["output"], name=name)]}


async def async_agent_node(state: AgentState, agent, name):
    result = await agent.ainvoke(state)
    return {"messages": [HumanMessage(content=result["output"], name=name)]}

The first one is pretty much the same as the one we used before, making sure we return a message in line with what we defined each node will add to the state object. The second one is the async version of the same function. We use async def and here we await the agent’s ainvoke method instead of just calling the normal invoke method.

Now we can create our Tavily agent and our research agent:

tavily_agent = create_agent(LLM, [TAVILY_TOOL], TAVILY_AGENT_SYSTEM_PROMPT)
tavily_agent_node = functools.partial(
    agent_node, agent=tavily_agent, name=TAVILY_AGENT_NAME
)


research_agent = create_agent(LLM, [research], RESEARCHER_SYSTEM_PROMPT)
research_agent_node = functools.partial(
    async_agent_node, agent=research_agent, name=RESEARCH_AGENT_NAME
)

You’ve seen all of this before, but make sure you use the async_agent_node function for the research agent instead of the normal one.

Now we need one more node, that will take the output of the research agent and write it to a file for us. This node does not need any agents or LLM action, so we can just define it as a normal function:

def save_file_node(state: AgentState):
    markdown_content = str(state["messages"][-1].content)
    filename = f"{OUTPUT_DIRECTORY}/{uuid.uuid4()}.md"
    with open(filename, "w", encoding="utf-8") as file:
        file.write(markdown_content)
    return {
        "messages": [
            HumanMessage(
                content=f"Output written successfully to {filename}",
                name=SAVE_FILE_NODE_NAME,
            )
        ]
    }

This shows that the graph is really nothing but a state machine. We can just write any arbitrary function and use it as a node as long as we meet the conditions we set for the graph. The function takes the AgentState object as input, does whatever it wants to do, and then adds an update to the AgentState object as promised. It doesn’t matter that there is no agent or LLM in this step.

In this case, we extract the markdown content from the state object’s last message [-1] which is the research node’s output. We then generate a random filename using the uuid module and write the markdown content to a file with that name and the .md extension. Finally, we return a message to the state object that the output was written successfully.

Piecing our graph together

Now we can define our graph:

workflow = StateGraph(AgentState)
workflow.add_node(TAVILY_AGENT_NAME, tavily_agent_node)
workflow.add_node(RESEARCH_AGENT_NAME, research_agent_node)
workflow.add_node(SAVE_FILE_NODE_NAME, save_file_node)

workflow.add_edge(TAVILY_AGENT_NAME, RESEARCH_AGENT_NAME)
workflow.add_edge(RESEARCH_AGENT_NAME, SAVE_FILE_NODE_NAME)
workflow.add_edge(SAVE_FILE_NODE_NAME, END)

workflow.set_entry_point(TAVILY_AGENT_NAME)
research_graph = workflow.compile()

We just go from the Tavily agent to the research agent, and then from the research agent to the save file node. This example is pretty simple as we’re focusing on the async part. We can always add this to more complex graphs later on if we need to.

Now let’s create a main function to run the graph:

async def run_research_graph(input):
    async for output in research_graph.astream(input):
        for node_name, output_value in output.items():
            print("---")
            print(f"Output from node '{node_name}':")
            print(output_value)
        print("\n---\n")

This function is an async function that takes an input and then runs the graph with that input. It uses an async for loop to iterate over the output of the graph after we run astream (async stream) on it. For each output, we get the node’s name and the output value, so we print both to the console to see what is going on live.

Now we can run the graph with a simple test input:

test_input = {"messages": [HumanMessage(content="Jaws")]}

asyncio.run(run_research_graph(test_input))

We create the first input message for the state object and then use asyncio.run as we did before because it takes care of the event loop that runs the async code for us. Save and run this file and you should see the graph running and outputting the results to the console:

API Keys loaded and tracing set with project name:  Web_Search_Graph
Output from node 'tavily_agent':
---
{'messages': [HumanMessage(content='Here are some relevant sources about "Jaws": ... ', name='tavily_agent')]}

---

URL: https://www.imdb.com/title/tt0073195/ - fetched successfully.
URL: https://www.rottentomatoes.com/m/jaws - fetched successfully.
URL: https://www.britannica.com/topic/Jaws-film-by-Spielberg - fetched successfully.
URL: https://en.wikipedia.org/wiki/Jaws_(film) - fetched successfully.
Output from node 'search_evaluator_agent':
---
{'messages': [HumanMessage(content='# **Jaws: A Deep Dive into the Iconic Film**\n\n## markdown summary here... ', name='search_evaluator_agent')]}

---

Output from node 'save_file':
---
{'messages': [HumanMessage(content='Output written successfully to c:\\Coding_Vault\\FINX_LANGGRAPH_TUTS\\output/d22855f8-9f76-4fc6-8192-7490852e1644.md', name='save_file')]}

---

Output from node '__end__':
---
{'messages': ['The whole state object...']}

---

Go ahead and open the .md file that was created in the output folder and you should see the markdown article that was written by the research agent:

I’ve gone ahead and tried another one inputting the topic “Pokemon”:

There you go! We’ve created a pretty fast and very useful internet research and article-writing tool!

From here on we can create PDF files, send emails, write articles, or do anything and everything we want really. We can tweak the output or the number of input URLs, or use gpt-4-turbo if we want a very long output article and large input context window so we can use even more sources.

We can add any conditional edges and paths and have the agents do whatever we want! All we’ve shown is just the basic ways in which you can combine stuff. You now have all the knowledge you need to build whatever you want. I’ll leave the rest up to your imagination.

It’s been a pleasure to take this journey together. I hope you learned a lot and had some fun along the way. I’ll see you again soon in the next one, until then, happy coding!

Info: This course is a complete text tutorial. It’s based on our academy course. If you’re interested in video explainers, check out the course here.

By the way, we’ve been featured on Feedspot’s Top 10 Python Blogs list!

The post Building Complex Multi-Agent Teams and Setups with LangGraph appeared first on Be on the Right Side of Change.

[Full Tutorial] OpenAI Fine-Tuning: Creating a Chatbot of Yourself (Example: ChrisGPT)

Dirk van Meerveld — Sat, 06 Apr 2024 07:43:04 +0000

Hi and welcome to this course on fine-tuning ChatGPT with the OpenAI API. In this course we’ll not just look at how to deal with the fine-tuning API endpoints itself, but also how to run the fine-tuned model, and most importantly, the data preparation and validation steps so that you will be able to do this yourself using any data you want to when you are done.

Tip: This is a full-text tutorial on how to fine-tune ChatGPT using the OpenAI API with code! For a video guide-through, check out our premium course “Fine-Tuning OpenAI: How to Create and Scale Your Digital Self (Example: ChrisGPT)“

For our fine-tuning example, we have a really fun one here. We are going to clone Chris! Yes, you read that correctly. We will use the Finxter emails Chris always sends us to train our own ChrisGPT! ChrisGPT is going to learn how to write emails in the distinctive Finxter style, on any topic we choose!

In part 1 we’ll start by looking at what fine-tuning is and why we would want to do it, and look at the type of data we will need in order to fine-tune our own ChatGPT models, getting started on our data preparation.
Part 2 will have us really dive into the data preparation, also using ChatGPT to make our work a bit easier, creating a perfectly formatted JSON dataset for training ChrisGPT.
In part 3 we’ll look at flattening our data into JSONL format, and then data validation to make sure we get it right the first time. We’ll also calculate the cost so that we know exactly what we’re getting into.
Part 4 is all about the fine-tuning API and how to use it. We’ll train ChrisGPT and then have a lot of fun having our own RoboChris write our emails about anything we want!

I hope you’re as excited as I am to get started. Let’s dive in!

Part 1: Fine-tuning Our Own ChatGPT Model

Welcome to this course on ChatGPT fine-tuning. My name is Dirk van Meerveld and together we will be taking a look at fine-tuning ChatGPT to make our own custom versions of the popular LLM. Before we dive in we’ll take a look at what fine-tuning entails and when we should use it.

Why fine-tuning?

First of all, let’s take a moment to consider how we usually get ChatGPT to do what we want. We tell it, using a prompt message right? Basically, we tell it what we want to do, and we probably give it some examples as well if the task has any complexity to it. This is called “few-shot-learning” as we give a couple of demonstrations on how to perform the task.

So usually prompt engineering will get ChatGPT to do whatever we want and there’s not really any problem with that right? But what if the problem we want ChatGPT to solve is a bit more complex, and would require hundreds of reasonably sized examples? There are several use cases for this, but we’ll start with an example on brand identity.

Say that your company brand has a certain style and tone of communication, which is different from the default ChatGPT way of speaking. You are probably not a famous person, so you can’t just query GPT to write “In the style of Elon Musk” or “In the style of Barack Obama”. ChatGPT doesn’t know who you are!

So what do you do? Use the very largest GPT-4-turbo model with the largest context limit and just send 100 pages full of examples of your brand’s style of communication in the prompt setup message every single time? This will not work very well for several reasons:

Cost -> Sending that much information with every GPT call, especially when using the most expensive GPT4 model, will be very expensive if you scale it up.
Latency -> Your calls will not only be expensive but also slow in comparison, as the amount of data sent and processed is very large.
The normal model will have trouble learning an entire way of speaking including the tone and nuances from just a single system prompt setup message, even if it is very long. The input text is just a prompt and this style of speaking will not get ’embedded into the neurons’ of the model so to speak.

This is where fine-tuning comes to the rescue. Basically, OpenAI will give us a vanilla GPT model in a separate container. We then get to provide extra training data of our own, and OpenAI will further train the GPT model on the data we provide, creating our own custom fine-tuned version of ChatGPT.

We feed it a large amount of examples of our brand’s style of communication. This way we won’t have to send a million tokens in the context limit every time and can just query our custom-trained model which has our brand’s style of communication embedded into its very neurons!

I think you can see how this would be extremely helpful in many areas. A content creator may want some help writing initial drafts or ideas for new work but needs them to adhere to his own writing style. A large brand company may want to employ customer service bots, like all do these days, but needs them to adhere to the brand’s style and rules for communication, just like the human employees. Anyone with any kind of writing or speaking style may want some assistance from ChatGPT but in their own style and form of speech.

Let’s clone Chris!

To explore this idea and show you how to implement this for yourself or your clients using example data of their writing, we will be using an example most of you will be familiar with, Chris! Most of you will be familiar with Chris’ writing from the Finxter emails as you’ve probably received a fair amount of them if you’re taking this Finxter course. Today we are going to make ChrisGPT, a model that has been fine-tuned on Chris’ writing style.

I’ve chosen Chris as an example for several reasons:

Most of you are probably familiar with him from the emails.
He’s not so famous that we could just query GPT to write “In the style of Chris” and get a good result. This makes it into a realistic example of doing this for yourself or a client.
He has a distinct style of writing that we will be able to differentiate from the default ChatGPT style.
I have loads of data for him in the form of the Finxter emails on my email account.
He has agreed to let us do this (thanks Chris! ).

Of course, Finxter emails from my Gmail inbox aren’t perfectly clean ideal-world data examples, but they will be good enough to give you an idea of how fine-tuning works and how you can use it to implement a brand identity or writing style into ChatGPT with the proper data. They will also add some fun to our exploration and learning experience!

Now I could just give you the perfectly formatted data so that you can send it straight to the OpenAI API for fine-tuning and we could have the whole thing over and done with in a single video. But if I did that I’d be doing you a huge disservice. The fine-tuning API itself is very simple and doesn’t take much time to learn at all, but it’s only a small part of the picture.

The real challenge is getting your input data, and getting the input data right. There is a real process to getting the right training data, making sure you formulate it right and avoid the pitfalls, and then getting all the training data into properly formulated examples. The fine-tuning part itself is the easy part. We will look at the nitty-gritty details of getting the data right, as our GPT will only ever be as good as our training data.

Garbage in = Garbage out.

So what kind of data do we need?

Well, basically we need to show ChatGPT what we want, by giving it examples. These examples need to be in the same format as normal ChatGPT chat completions, which makes total sense, as we’re trying to teach it what kind of chat completions we want it to create right? So the basic idea is something like this:

[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a Harry Potter wizard spell creator who comes up with new magical spells! Alohamora!"
      },
      { "role": "user", "content": "How can I bake a cookie?" },
      {
        "role": "assistant",
        "content": "Confectio Deliciosum! This spell will help you bake cookies in no time!"
      }
    ]
  },
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a Harry Potter wizard spell creator who comes up with new magical spells! Alohamora!"
      },
      { "role": "user", "content": "I need to fend off a shark!" },
      {
        "role": "assistant",
        "content": "Squalus Repellum! This spell will help you fend off a shark!"
      }
    ]
  },
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a Harry Potter wizard spell creator who comes up with new magical spells! Alohamora!"
      },
      { "role": "user", "content": "I need to lose some weight fast." },
      {
        "role": "assistant",
        "content": "Adiposum Reductum! Your belly fat will just fly out the window!"
      }
    ]
  }
]

Notice that we have a list of objects, which look like the messages list we would typically work with. We are basically simulating conversations with ChatGPT that never occurred, so we can use these as training data. For each example we have a system message, then we have the user query in the second slot, and the assistant’s response in the third slot.

Now notice that the entire prompt message is sent and is exactly the same for all three examples. You might feel that this is wasteful, as we’re repeating the same thing over and over again, but remember that we’re trying to sort of hard-wire these instructions into ChatGPT, so this is a good thing.

If you try to leave out the system messages to save tokens the model will have to learn entirely through demonstration of the outcome, and your training may actually take longer. We recommend just leaving that system message in there for every example, and make sure it’s a good one because it is going to get baked into the model!

The second entry, the user query, is obviously going to be different each time. Make sure you include examples that match the kind of use you want to use your final fine-tuned model for. Especially make sure you include any edge cases and harder-than-usual examples, the training phase is the time to show the model what it will be up against.

The third entry, the assistant’s response, is going to be the exact perfect answer that we want ChatGPT to give for this query. ChatGPT will be trained on this system message, with this query, this is the response I should give.

Note the example above is of course useless, as we could easily achieve this output without any fine-tuning at all from basically any LLM in existence. It is just an example of the training data structure. In reality, we need at least 10 examples for fine-tuning, but you should probably aim for at the very least 50 well-crafted examples if not more.

Also, the final format needs to be in JSONL format, with every object flattened down onto a single very long line, which looks kind of like this:

{"messages": [{system...}, {user...}, {assistant...}]}
{"messages": [{system...}, {user...}, {assistant...}]}
{"messages": [{system...}, {user...}, {assistant...}]}

But this is only a minor and easy conversion, so we’ll get back to that later.

As for the length, each training example is limited to the context length of the model. So every single line of the JSONL data can be up to the context limit, which for gpt-3.5-turbo-1106 is 16,385 tokens. As this is a very high amount, we’re not going to worry about it too much for our use cases here, as we’re not going to be going over this limit.

Now we’ll be using gpt-3.5-turbo-1106 here as it is the newest version of the model that has fine-tuning support so far. This is probably a good thing though as fine-tuning on GPT-4 would be a lot more expensive and as we’ll be showing the model exactly what we want it to do anyway, we won’t really need GPT-4’s extra capabilities.

The data

So, I’ve gone through my email account and extracted a whole bunch of emails I have received from Chris at Finxter, the last 200 to be precise. This very first step, I have done for you, as I obviously cannot give you all access to my personal email inbox! But I will still cover roughly the steps taken:

I’ve applied a label to all the emails I wanted to extract from my inbox, so I could easily find them.
I went to Google Takeout and requested a download of all my emails with that label.
I received a file with all my emails in MBOX format.
I wrote a Python script, mbox_to_json_decode.py, which takes the emails, decodes them, takes all my personal unsubscribe links and other personal data out, and then writes them to a JSON file.

As this MBOX to JSON conversion is hyper-specific, and the MBOX file contains some of my personal data, this is the only step along the way we will skip, as the chances that you will also have to convert MBOX files to JSON are very slim and I want to keep this course relevant. If you do need information on MBOX to JSON conversion, I will add the mbox_to_json_decode script in the GitHub repository so you can check it out if you need to.

So now we are left with Finx_dataset.json, which will be our entry point for this tutorial. Normally I would include this file in the GitHub repository, but as it has a large amount of the Finxter email data, we have elected to not include it in the repository. Instead, the file will be available for download from the course page in the Finxter Academy. If you haven’t downloaded it yet, please do so now.

Then go ahead and create a base project folder to use for this course, I’ve named mine Finx_Fine_Tuning, and then create a folder named data inside of it. Then move the Finx_dataset.json file into the data folder to create the following structure:

Finx_Fine_Tuning
    data
        Finx_dataset.json

Create a venv in the root project folder

Ok, just a small detour before we continue with our project!

We will use this as I will be using specific versions for the libraries we install as we go along, and I want to make sure that you have the exact same experience as I do. The virtual environment will make it easy for you to install my exact versions without worrying about affecting any of your other projects.

pip install pipenv

Make sure the terminal is inside your root project folder, e.g. /c/Coding_Vault/Finx_Fine_Tuning, and then run the following command to create a new virtual environment:

pipenv shell

This will create a new virtual environment and also a Pipfile in your project directory. Any packages you install using pipenv install will be added to the Pipfile.

To generate a Pipfile.lock, which is used to produce deterministic builds, run:

pipenv lock

This will create a Pipfile.lock in your project directory, which contains the exact version of each dependency to ensure that future installs are able to replicate the same environment.

Back to our data

Back to where we were. Our root project folder should now look like this:

Finx_Fine_Tuning
    data
        Finx_dataset.json
    Pipfile
    Pipfile.lock

Let’s go ahead and take a look at the Finx_dataset.json file we downloaded earlier to see what kind of raw data we are working with here:

[ { "subject": "5 Proxies to Investing in OpenAI", "body": "\n\n\t\n\n\n

\n\n

\u00a0

\n\n

Hey {User},

\n\n

To profit from change, we need to increase ownership of disruptive trends. Today's article covers a question that many Finxters frequently ask:

\n\n

\ud83e\udeb4 [Blog] How to Invest in OpenAI? \ud83c\udf33

\n\n

While it's not possible to invest in OpenAI directly, the blog discusses five alternatives:

\n\n

MSFT (49% stake in OpenAI),
NVIDIA (makes more revenue from OpenAI than any other company),
ARKVX (Anthropic!),
META (Llama 2!), and
TSLA (Optimus!).

\n\n

Check it out if you're interested in any of those! No financial advice. \ud83d\ude0a

\n\n

Be on the right side of change. \ud83d\ude80
\nChris

\n\n

\u00a0

\n\n

\u00a0

\n\n

\u2665\ufe0f Community Corner: Featured Resources

\n\n

TradeUnafraid is a trading platform owned and operated by Finxter community member Lee.

\n\n

Do you want to feature your own startup, YouTube channel, blog, or website as a Finxter premium member? Hit reply and let me know!

\n\n

\u00a0

\n\n

\u00a0

\n\n

\u00a0

\n\n

How are we doing?
\n\u2b50
\n\u2b50\u2b50
\n\u2b50\u2b50\u2b50
\n
\nTo make sure you keep getting these emails, please add chris@finxter.com to your address book.
\n
\nI'd love to hear your feedback so that I can improve this free email course over time. Please reply to this email and share everything on your mind!
\n
\nIf you find the Finxter Email Academy useful, please invite a friend or colleague! \u2764

\n\n

\nHere's the subscription link you can share:
\nhttps://blog.finxter.com/subscribe/
\n
\nDownload the Ultimate Python Cheat Sheet here (direct PDF download): \ud83d\udc0d

\n\n

The Ultimate Python Cheat Sheet
\n
\nNot very motivated to learn today? Consider this:
\n\"Knowledge compounds!\" -- Warren Buffett
\n
\nConsequently, if you improve your skills by 1% every day, you'll 36x your programming skills within a year!

\n\n

\u00a0

\n\n

\u00a0

\n\n

\u00a0

\n\n

\nFinxter, Dr. Christian Mayer
\n{Address}., {City}, {Country}

\n\n

Want out of the loop? I'm so sad to see you go. \ud83d\ude22 How could we have done better?
\nTo help future Finxters, please hit reply and tell us! \ud83e\udd17

\nUnsubscribe here\n\n

\u00a0

\n\n

\u00a0

\n\n

\u00a0

\n\n\n\n" }, { "subject": "Tech Deflation vs Inflation", "body": "Email no2..." }

As you can see, we have a list of objects, each with a subject and body key. The body key contains the raw HTML of the email, which we will need to clean up a bit before using it for our purposes. The only preprocessing I’ve done in the MBOX to JSON conversion is removing links and personal data for generic {Link} and {User} placeholders.

If you’re wondering what the \uxxxx characters are, like the sequence \ud83d\udc0d, they are Unicode escape sequences that represent characters in the Unicode standard. Specifically, this sequence represents the “snake” emoji (). You will see these quite a lot as Chris is of course famous for his creative emoji usage!

The full list has about 200 of these email objects, in non-chronological order. If you scroll through the data, you will see some noise in there, which will be reflected in our final product. For the purposes of this tutorial, it will be good enough. For professional use, you’d want to make sure to clean up the data more thoroughly, spending some more time here.

Preparing our data

We now have our basic data, and we know what kind of format we need for the training data, like the Harry Potter magical spells example we showed. Now let’s start wrangling the data into the format we need. As with all complex coding tasks let’s take it one step at a time, and let’s build our solution in small and reusable parts.

We’ll start with a utility to convert the email above into a more readable and simple format. Instead of the HTML above with all the emojis in a format that we cannot even read and loads of HTML tags all over the place, let’s have a utility function that takes that HTML email as input and returns a simple and readable markdown format version for us to work with instead.

So go ahead and create a new folder named utils in the root project folder, and then create a new file named html_email.py inside the utils folder:

Finx_Fine_Tuning
    data
        Finx_dataset.json
    utils
        html_email.py
    Pipfile
    Pipfile.lock

Now before we get started on the html_email.py file, we’ll need to install a library called html2text which will help us convert the HTML emails to markdown. Someone has already written a library to do this for us, so we don’t have to write it ourselves. Always use existing solutions when they exist to speed up your development cycle!

To install a specific version of a package in our Pipenv environment, you can use the pipenv install command followed by the package name and the version number. Run the following command:

pipenv install html2text==2020.1.16

This command will add html2text to our Pipfile under the [packages] section with the specified version. It will also update your Pipfile.lock to include the exact version of html2text and its dependencies.

Now let’s go ahead and open the html_email.py file and add the following code:

import html2text

def html_to_markdown(html: str) -> str:
    html = html.encode("utf-16", "surrogatepass").decode("utf-16")

    html_to_text_converter = html2text.HTML2Text()
    html_to_text_converter.ignore_links = False
    return html_to_text_converter.handle(html)

We first import the library we have just installed. Then we define a function html_to_markdown which takes an HTML string as input and returns a markdown string.

We then take the html variable, which is a string, and we
will convert any Unicode escape sequences in the string back into their corresponding characters. The "surrogatepass" error handler instructs Python to properly handle any surrogate characters in the string so that for the \ud83d\ude80 patterns we talked about earlier, after running this line, they will be turned into the corresponding emoji characters (in this case, the rocket emoji ).

This works because the .encode method converts the string to bytes using UTF-16 encoding, which includes converting Unicode escape sequences to their actual Unicode characters. Then, the .decode method converts those bytes back into a string, preserving the Unicode characters. So we basically did a round-trip conversion from Unicode escape sequences to actual Unicode characters.

We then create an instance of the HTML2Text class and set the ignore_links attribute to False to include links in the output. We then call the handle method of the HTML2Text instance and pass the HTML string as an argument to convert it to markdown, and simply return the result.

Let’s test it out

Let’s go ahead and give it a test run. Above the html_to_markdown function, add the following variable holding a test email string:

test_email = '\n\n\t\n\n\n

\n\n

\u00a0

\n\n

Hey {User},

\n\n

To profit from change, we need to increase ownership of disruptive trends. Today\'s article covers a question that many Finxters frequently ask:

\n\n

\ud83e\udeb4 [Blog] How to Invest in OpenAI? \ud83c\udf33

\n\n

While it\'s not possible to invest in OpenAI directly, the blog discusses five alternatives:

\n\n

MSFT (49% stake in OpenAI),
NVIDIA (makes more revenue from OpenAI than any other company),
ARKVX (Anthropic!),
META (Llama 2!), and
TSLA (Optimus!).

\n\n

Check it out if you\'re interested in any of those! No financial advice. \ud83d\ude0a

\n\n

Be on the right side of change. \ud83d\ude80
\nChris

\n\n

\u00a0

\n\n

\u00a0

\n\n

\u2665\ufe0f Community Corner: Featured Resources

\n\n

TradeUnafraid is a trading platform owned and operated by Finxter community member Lee.

\n\n

Do you want to feature your own startup, YouTube channel, blog, or website as a Finxter premium member? Hit reply and let me know!

\n\n

\u00a0

\n\n

\u00a0

\n\n

\u00a0

\n\n

How are we doing?
\n\u2b50
\n\u2b50\u2b50
\n\u2b50\u2b50\u2b50
\n
\nTo make sure you keep getting these emails, please add chris@finxter.com to your address book.
\n
\nI\'d love to hear your feedback so that I can improve this free email course over time. Please reply to this email and share everything on your mind!
\n
\nIf you find the Finxter Email Academy useful, please invite a friend or colleague! \u2764

\n\n

\nHere\'s the subscription link you can share:
\nhttps://blog.finxter.com/subscribe/
\n
\nDownload the Ultimate Python Cheat Sheet here (direct PDF download): \ud83d\udc0d

\n\n

The Ultimate Python Cheat Sheet
\n
\nNot very motivated to learn today? Consider this:
\n"Knowledge compounds!" -- Warren Buffett
\n
\nConsequently, if you improve your skills by 1% every day, you\'ll 36x your programming skills within a year!

\n\n

\u00a0

\n\n

\u00a0

\n\n

\u00a0

\n\n

\nFinxter, Dr. Christian Mayer
\n{Address}., {City}, {Country}

\n\n

Want out of the loop? I\'m so sad to see you go. \ud83d\ude22 How could we have done better?
\nTo help future Finxters, please hit reply and tell us! \ud83e\udd17

\nUnsubscribe here\n\n

\u00a0

\n\n

\u00a0

\n\n

\u00a0

\n\n\n\n'

Just copy it from the written version of the tutorial, and make sure you insert it above the function we wrote:

import html2text

test_email = ...

def html_to_markdown(html: str) -> str:
    ...

Now, below the html_to_markdown function, add the following code to test the function:

if __name__ == "__main__":
    markdown_content = html_to_markdown(test_email)

    with open("test.md", "w", encoding="utf-8") as file:
        file.write(markdown_content)

This code will run the html_to_markdown function with the test_email string as input, and then write the result to a file named test.md. The if __name__ == "__main__": line ensures that the code inside the block only runs when the script is executed directly, and not when we import the html_to_markdown function into another script later on.

 Python Top-tip 
In Python, when a script is run, a special built-in variable called __name__ is set to "__main__". However, if a module is imported, __name__ is set to the module's name instead. By checking if __name__ == "__main__":, the script can determine whether it's being run directly or being imported as a module.

This allows for a flexible way to organize your code. You can put code that tests the functionality of the module or demonstrates how to use the module under this if statement. When the module is imported, this code won't run, but when the script is run directly, the code will execute. This is particularly useful for unit testing or for scripts that can be used both as utility modules and as standalone programs.

Now go ahead and run the script and a new file named test.md will be created. If you check it out it will have the markdown version of the email we provided as input.

[![]({Link})]({Link})

Hey {User},

To profit from change, we need to increase ownership of disruptive trends.
Today's article covers a question that many Finxters frequently ask:

 [ **Blog** ] [How to Invest in OpenAI?]({Link}) 

While it's not possible to invest in OpenAI directly, the blog discusses five
alternatives:

  * **MSFT** (49% stake in OpenAI),
  * **NVIDIA** (makes more revenue from OpenAI than any other company),
  * **ARKVX** ( _Anthropic!_ ),
  * **META** ( _Llama 2!_ ), and
  * **TSLA** (Optimus!).

Check it out if you're interested in any of those! No financial advice. 

Be on the right side of change. 
Chris

** Community Corner: Featured Resources**

  * [TradeUnafraid]({Link}) is a trading platform owned and operated by Finxter community member Lee.

Do you want to feature your own startup, YouTube channel, blog, or website as
a [Finxter premium member]({Link})? Hit reply and let me know!

**[How are we doing?]({Link})**
[]({Link})
[]({Link})
[]({Link})

If we render this properly as markdown it will result in the following look:

###########################START##########################

Hey {User},

To profit from change, we need to increase ownership of disruptive trends.
Today’s article covers a question that many Finxters frequently ask:

[ Blog ] How to Invest in OpenAI?

While it’s not possible to invest in OpenAI directly, the blog discusses five
alternatives:

MSFT (49% stake in OpenAI),
NVIDIA (makes more revenue from OpenAI than any other company),
ARKVX ( Anthropic! ),
META ( Llama 2! ), and
TSLA (Optimus!).

Check it out if you’re interested in any of those! No financial advice.

Be on the right side of change.
Chris

Community Corner: Featured Resources

TradeUnafraid is a trading platform owned and operated by Finxter community member Lee.

Do you want to feature your own startup, YouTube channel, blog, or website as
a Finxter premium member? Hit reply and let me know!

How are we doing?

###########################END##########################

Which is good enough for our purposes for this tutorial. We will be using this markdown version of the emails as our training data for the fine-tuning process. We could go and clean up even further to have cleaner output, but for the purposes of this tutorial, this will be good enough.

Now that we have our HTML to Markdown function prepared, we’ll continue in part 2, where we will generate the actual training data for our fine-tuning of ChrisGPT. I’ll see you in part 2!

Part 2: Generating the Training Data

Hi and welcome back to part 2, where we will be generating the training data for our own ChrisGPT model! We’ll pick up where we left off in part 1 and use our HTML to Markdown converter utility and more to write and run our training dataset generator that will build the ChrisGPT training data for us.

Api Key

We need to start by setting our key for the OpenAI API, but we cannot hardcode this one in our source code. Go to https://platform.openai.com/api-keys and copy your API key. If you don’t have one, make sure to get one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env in the root folder of your project:

Finx_Fine_Tuning
    data
        Finx_dataset.json
    utils
        html_email.py
    .env                  (new file)
    Pipfile
    Pipfile.lock

And paste your API key in there like this, making sure not to use any spaces or quotes:

OPENAI_API_KEY=your_api_key_here

Then go ahead and save and close this file. (You can also use environment variables to set the key if you prefer, but this is simpler to set up for tutorial purposes as it works the same on all platforms.)

Constants

Now, let’s define some project-wide settings by creating a new file named constants.py in the root directory of our project. This file will contain the constants that we will use multiple times throughout our project:

Finx_Fine_Tuning
    data
        Finx_dataset.json
    utils
        html_email.py
    .env
    constants.py        (new file)
    Pipfile
    Pipfile.lock

Run the following command in your terminal to add the python-decouple package inside your pipenv environment:

pipenv install python-decouple==3.7

We will use this package to read the .env file and get the API key from it. Then install the openai library as well:

pipenv install openai==1.12.0

Make sure you run the installs even if you already have the packages installed, as we need to make sure they are installed in the virtual environment we are using for this project. Now, open the constants.py file and add the following code:

from pathlib import Path

from decouple import config
from openai import OpenAI

CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
DATA_DIRECTORY = Path(__file__).parent / "data"

We import Path from pathlib, the OpenAI class from the openai package, and the config function from the decouple package. We then create a CLIENT constant that will be an instance of the OpenAI class, passing the API key from the .env file to it by reading it from the .env file using the config function.

 Python Top-tip 
The 'config' function from the 'decouple' package reads the value of the environment variable from the '.env' file and returns it as a string. If you share or upload your code to a public repository, make sure to add the '.env' file to your '.gitignore' file to avoid sharing your API key with others.

We also create a DATA_DIRECTORY constant that will be a Path object pointing to the data directory. Path(__file__) returns the absolute path of the current file, and parent returns the parent directory of the current file (which is constants.py so that gives us the root folder). We then use the / operator to add the data directory to the path.

The Data Generator

Go ahead and save and close the constants.py file. Then create a new file in the root directory called chris_gpt_dataset_generator.py:

Finx_Fine_Tuning
    data
        Finx_dataset.json
    utils
        html_email.py
    .env
    constants.py
    chris_gpt_dataset_generator.py        (new file)
    Pipfile
    Pipfile.lock

Now let’s install two libraries before we dive in here. We’ll combine them into a single command this time:

pipenv install tqdm==4.65.0 && pipenv install tenacity==8.2.3

The tqdm package will be used to easily add a progress bar to our generator, and the tenacity package will be used to retry the API requests if they fail for some reason. Both are extremely simple in their usage, as you will see.

Ok now open the chris_gpt_dataset_generator.py file and we will start with our imports for the file:

import json

from tenacity import retry, stop_after_attempt, wait_fixed
from tqdm import tqdm

from constants import CLIENT, DATA_DIRECTORY
from utils import html_email

We use json to load and save back the JSON data, tenacity and tqdm for the reasons we just mentioned, and then we import the CLIENT and DATA_DIRECTORY constants from the constants module, and our html_email utility we just made. Now let’s set up some base constants for use in our generator only:

INPUT_DATA = DATA_DIRECTORY / "Finx_dataset.json"
OUTPUT_DATA = DATA_DIRECTORY / "Finx_completed_dataset.json"
MODEL: str = "gpt-3.5-turbo-0125"
TOTAL_TOKENS_USED = 0
RATE_ERROR_MESSAGE = "There was an error calling 'get_user_query'. Perhaps the OpenAI ChatGPT rate limit has been reached. Retrying one more time in 60 seconds to reset the rate limiter..."

We define INPUT_DATA and OUTPUT_DATA constants that will be Path objects pointing to the Finx_dataset.json and Finx_completed_dataset.json files, respectively. The output one doesn’t exist yet obviously, but that is fine. The MODEL constant is just a string with the model name we use.

 OpenAI Top-tip 
There are many versions of the gpt-3.5-turbo model:

 gpt-3.5-turbo-0613. This one is the older version.
 gpt-3.5-turbo-1106. Has much improved functionality and a larger context window compared to the 0613 model. This is the newest version that can be used for fine-tuning purposes.
 gpt-3.5-turbo-0125. The latest minor update to the 1106 model, this one cannot be used for fine-tuning yet. Aside from the minor improvements, it also has a reduced price.

This is why we will use the `0125` version for the data generation in this part of the tutorial, but later when we do the fine-tuning we will be using the `1106` version instead.

We also define a TOTAL_TOKENS_USED constant that will be used to keep track of the total number of tokens used in the API requests, and a RATE_ERROR_MESSAGE that we reference for an error message later. As all of the above constants will only be used for this script we defined them all inside the chris_gpt_dataset_generator.py file.

The data structure

Now we’ll define a class that will serve as the data structure “blueprint” if you will, for each example in our dataset. This is exactly the same structure we used in the Harry Potter magical spells example in part 1:

class TrainingDataEntry:
    def __init__(self, fictional_user_query, markdown_email) -> None:
        self.data = {
            "messages": [
                {
                    "role": "system",
                    "content": 'You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris\' writing style. You will respond in the following format: {"subject": "The subject of the email", "body": "The body of the email in Markdown formatting"}.',
                },
                {"role": "user", "content": fictional_user_query},
                {"role": "assistant", "content": markdown_email},
            ]
        }

We have a class named TrainingDataEntry, which has an __init__ method that takes two arguments: fictional_user_query and markdown_email. The __init__ method initializes the data attribute of the TrainingDataEntry object with a dictionary that has a messages key, which is a list of dictionaries.

The first dictionary in the list has a role key with the value "system" and a content key with a long string that explains the role of the assistant. This is going to be the same for all examples of the training data so I just hard-coded it in the class.

The second dictionary in the list has a role key with the value "user" and a content key with the value of the fictional_user_query argument. The third dictionary in the list has a role key with the value "assistant" and a content key with the value of the markdown_email argument.

So the system message is the same for each entry and the assistant message is simply the email that we already have saved in our dataset and can convert using the HTML to Markdown converter we built.

The only challenge we have left is the user message, which is supposed to hold the user query that resulted in the generated email. We will have to generate a fictional user query for each “result” email we have in our dataset. This should mostly be a short summary of the email contents for our case, as we want to train ChrisGPT to generate an email if we give it the topics to write on.

Getting the user query (that was never made)

So let’s create a function that will take an email as input and generate a fictional user query for it:

def get_fictional_user_query(email: str) -> str:
    global TOTAL_TOKENS_USED
    response = CLIENT.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": 'You will be provided with an email in the following format:{"subject": "The subject of the email", "body": "The body of the email in Markdown formatting"}. Your task is to go back in time and write a rough draft of the topics the email writer decided to discuss in the email. This will basically be a summary of the email\'s topics formatted in short bullet points, that the author would have used as a basis to then write the real email.',
            },
            {
                "role": "user",
                "content": f"Here is the output LLM generated email: {email}. Bullet point draft of the topics discussed in the email:",
            },
        ],
    )
    if not response.usage or not response.choices[0].message.content:
        raise Exception("Invalid response from OpenAI API")
    TOTAL_TOKENS_USED += response.usage.total_tokens
    return response.choices[0].message.content

We define a function named get_fictional_user_query that takes an email string argument and will return a string. Inside the function, we first reference the TOTAL_TOKENS_USED variable as a global variable, so that we can update its value from inside the function. Then we call the CLIENT.chat.completions.create method with the model and messages arguments.

For the prompt setup message we basically just explain that the LLM will receive an email in the format of a subject and body and that it should write a rough draft of the topics the email writer decided to discuss in the email. This will basically be a summary of the email’s topics formatted in short bullet points. That way we can give our trained model a list of bullet points to write and email for us later on.

The user message is then just the email we received as input, and a short message requesting the bullet points draft of the topics discussed. Note how it is stated as “Bullet point draft of the topics discussed in the email:” ending with a : colon so that the LLM will answer with the bullet points right away without giving some introduction or “Sure, I can help you with that” or something like that.

We then double-check if the response has a usage and choices attribute and if the content of the choices attribute is not empty. If any of these checks fail we raise an exception, but if everything is fine we add the total_tokens used in the response to the TOTAL_TOKENS_USED variable and return the content of the choices attribute.

Making it robust

Ok, so that is pretty good so far, but we will be calling this function many times in a row, and saving the data into an object in memory, which we write to a file at the end. If we get an error halfway then the script will crash out and we have to try again from the start. We can also hit an error because of a rate limit, as we will be doing many requests in a row. This is not ideal, so we will use the tenacity package to retry the API requests if they fail for some reason.

We can simply add the @retry decorator to our function, by adding it straight above the function definition like this:

@retry(
    wait=wait_fixed(60),
    stop=stop_after_attempt(2),
    reraise=True,
    before_sleep=lambda _: print(RATE_ERROR_MESSAGE),
)
def get_fictional_user_query(email):
    ... function here ...

We use the @retry decorator to specify that we want to retry the function if it fails due to any type of error. As you can see the tenacity library is very easy to read and use. We specify that we want to wait 60 seconds between each retry and that we want to stop after 2 attempts, maximum.

The 60 seconds are in case we hit a rate limit, as trying again right away might trigger the rate limit again. We also specify that we want to reraise the exception if the function fails after the maximum number of attempts, so we can see what the original error was.

Now the last part is a bit trickier. As tenacity will stop the program from crashing out, we won’t see the error and we won’t be aware of what happened. This is why we added a before_sleep function that will print the RATE_ERROR_MESSAGE before each retry. This way we can keep ourselves up to date in the terminal about what is happening.

The problem is that if I just pass in print(RATE_ERROR_MESSAGE) then the print statement will be triggered right away when Python first loads the function into memory because it is being called with the ellipsis. This is not what we want, so we need to wrap it in a lambda function that will then return the print call when the lambda is called.

The lambda _: print(RATE_ERROR_MESSAGE) is basically equivalent to:

# Do not put in your code
def before_sleep_function(_):
    print(RATE_ERROR_MESSAGE)

But it’s just much shorter to write. The _ is the name of the input argument, as tenacity will send an input argument to our before_sleep function, but we won’t be using it, so we just name it _ to sort of throw it away.

Ok so now we have a TrainingDataEntry class to hold the data for each entry and a function to generate the fictional user queries that will automatically retry if there is an error. See how we’re making the problem smaller and smaller?

Putting it all together

All we need to do now is create a TrainingDataEntry object for each entry in the training data, get the fictional user query by calling our function, and then save all the objects to a new JSON file to get our training data. Let’s do that now:

with open(INPUT_DATA, "r", encoding="utf-8") as file:
    input_data = json.load(file)
    output_data = []

We open the INPUT_DATA file in read mode and use the json.load function to load the JSON data from the file into the input_data variable. We then create an empty list and assign it to the output_data variable. So far so good right? Let’s finish it:

with open(INPUT_DATA, "r", encoding="utf-8") as file:
    input_data = json.load(file)
    output_data = []
    for finx_email in tqdm(input_data, desc="Generating training data"):
        finx_email["body"] = html_email.html_to_markdown(finx_email["body"])
        training_data_entry = TrainingDataEntry(
            fictional_user_query=get_fictional_user_query(finx_email),
            markdown_email=str(finx_email),
        )
        output_data.append(training_data_entry.data)

First of all, we open a loop. We’re going to loop over each finx_email in the input_data list. The reason input_data is wrapped inside tqdm() here is that this will allow us to add a progress bar to the loop. The desc argument is used to specify the description of the progress bar for tqdm, which will be “Generating training data”.

Now remember that each finx_email is a dictionary with a subject and body key. We convert the body from HTML to Markdown using our html_to_markdown utility function, and then we create a TrainingDataEntry object for the finx_email. The fictional_user_query is obtained by calling our get_fictional_user_query function with the finx_email as input, and the markdown_email is just the finx_email dictionary converted to a string. The str() conversion here is important as the OpenAI API will crash if you send it all sorts of nested objects as input, but if they’re in string format it will work fine.

We then append the dictionary that the data attribute of the TrainingDataEntry holds to the output_data list. Now when we’ve called this once for every entry in the dataset we must save our data:

with open(OUTPUT_DATA, "w", encoding="utf-8") as file:
    json.dump(output_data, file, indent=4)

print(f"Total tokens used: {TOTAL_TOKENS_USED}")

This opens the OUTPUT_DATA file in write mode and uses the json.dump function to save the output_data list to the file in JSON format. We also specify the indent argument as 4 to make the JSON data more readable. Finally, we added a print statement to see how many tokens we used in the process. I’m not going to calculate the exact costs here as it would add unneeded complexity for a 1-time use script. It will be roughly like ~$0.30 if you make the full 200 calls, possibly a bit less.

Time to sit back!

And that’s it! We now have a data generator. Go ahead and run your chris_gpt_dataset_generator.py file and you should see a progress bar in the terminal as it generates the data. It will take a few minutes to run. You may run into a rate limit in which case tenacity will trigger for you and pause the process for 60 seconds before continuing. (This is also the reason we’re not using some fancy async API calls here, as we would hit the rate limit anyway and introduce pointless complexity to a script we will not run often anyway.)

Let it run and grab a coffee or your favorite snack while you watch that glorious progress bar creep up and relax for a moment, you’ve earned it ! When it’s done you should have a new Finx_completed_dataset.json file in your data directory.

Generating training data: 100%|███████████████████████████████████████████████| 200/200 [08:56<00:00,  2.68s/it]
Total tokens used: 224800

Open it up and take a look at the data, it should be a list of dictionaries with the same structure as the data attribute of the TrainingDataEntry class:

[
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris' writing style. You will respond in the following format: {\"subject\": \"The subject of the email\", \"body\": \"The body of the email in Markdown formatting\"}."
            },
            {
                "role": "user",
                "content": "- Investment opportunities in OpenAI\n- Five proxies to invest in OpenAI:\n  - MSFT\n  - NVIDIA\n  - ARKVX\n  - META\n  - TSLA\n- Community corner featuring TradeUnafraid\n- Invitation to feature one's startup, YouTube channel, blog, or website\n- Feedback request on the email course\n- Invitation to invite a friend or colleague to the Finxter Email Academy\n- Download link for the Ultimate Python Cheat Sheet\n- Motivational quote on knowledge compounding\n- Contact information for Dr. Christian Mayer\n- Unsubscribe option"
            },
            {
                "role": "assistant",
                "content": "{'subject': '5 Proxies to Investing in OpenAI', 'body': '[![]({Link})]({Link})\\n\\nHey {User},\\n\\nTo profit from change, we need to increase ownership of disruptive trends.\\nToday\\'s article covers a question that many Finxters frequently ask:\\n\\n\ud83e\udeb4 [ **Blog** ] [How to Invest in OpenAI?]({Link}) \ud83c\udf33\\n\\nWhile it\\'s not possible to invest in OpenAI directly, the blog discusses five\\nalternatives:\\n\\n  * **MSFT** (49% stake in OpenAI),\\n  * **NVIDIA** (makes more revenue from OpenAI than any other company),\\n  * **ARKVX** ( _Anthropic!_ ),\\n  * **META** ( _Llama 2!_ ), and\\n  * **TSLA** (Optimus!).\\n\\nCheck it out if you\\'re interested in any of those! No financial advice. \ud83d\ude0a\\n\\nBe on the right side of change. \ud83d\ude80  \\nChris\\n\\n**\u2665\ufe0f Community Corner: Featured Resources**\\n\\n  * [TradeUnafraid]({Link}) is a trading platform owned and operated by Finxter community member Lee.\\n\\nDo you want to feature your own startup, YouTube channel, blog, or website as\\na [Finxter premium member]({Link})? Hit reply and let me know!\\n\\n**[How are we doing?]({Link})**  \\n[\u2b50]({Link})  \\n[\u2b50\u2b50]({Link})  \\n[\u2b50\u2b50\u2b50]({Link})  \\n  \\nTo make sure you keep getting these emails, please add _chris@finxter.com_ to\\nyour address book.  \\n  \\nI\\'d love to hear your feedback so that I can improve this free email course\\nover time. Please reply to this email and share everything on your mind!  \\n  \\n**If you find the Finxter Email Academy useful, please invite a friend or\\ncolleague! \u2764**\\n\\n  \\nHere\\'s the subscription link you can share:  \\n[https://blog.finxter.com/subscribe/]({Link})  \\n  \\nDownload the Ultimate Python Cheat Sheet here (direct PDF download): \ud83d\udc0d\\n\\n**[The Ultimate Python Cheat Sheet]({Link})**  \\n  \\nNot very motivated to learn today? Consider this:  \\n**_\"Knowledge compounds!\"_** \\\\-- Warren Buffett  \\n  \\nConsequently, if you improve your skills by 1% every day, you\\'ll 36x your\\nprogramming skills within a year!\\n\\n  \\n_**Finxter, Dr. Christian Mayer**  \\n{Address}., {City}, {Country}_\\n\\nWant out of the loop? I\\'m so sad to see you go. \ud83d\ude22 How could we have done\\nbetter?  \\nTo help future Finxters, please hit reply and tell us! \ud83e\udd17\\n\\n[Unsubscribe here]({Link})\\n\\n![]({Link})\\n\\n'}"
            }
        ]
    },
    ... many more entries ...

Perfect, this has all the data we need to train our ChrisGPT model. We have the system message that is the same for all entries, the user message that is a fictional user query for the email, and the assistant’s response message that is the email itself. That’s it for part 2 of this tutorial. I’ll see you in part 3 where we will be fine-tuning our ChrisGPT model using the data we just generated . See you there!

Part 3: Data Validation and Training Cost

Welcome back to part 3! This is where we’re going to do the last preparation and data validation steps on our dataset and also calculate how much it’s going to cost us to train the model.

JSONL format

Remember in part 1 where we discussed the training data? We discussed the data needing to be in JSONL format. Well, it’s time to come back to that now. So what is JSONL format?

JSONL, or JSON Lines, is a convenient format for storing structured data that may be processed one record at a time. Each line in a JSONL file is a valid JSON object. This is different from a regular JSON file, where the entire file is a single JSON object or array.

Each line is a separate, independent JSON object. This means that a large file can be read into memory one line at a time, instead of needing to read the entire data into memory at once, which can be a significant advantage when working with very large datasets. This makes it very useful for streaming JSON data object by object through another process like training an LLM model!

So say we have an object that looks like this:

[
  {
    "employee": {
      "name": "John Doe",
      "age": 30,
      "department": "Sales",
      "address": {
        "street": "123 Main St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  },
  {
    "employee": {
      "name": "Jane Smith",
      "age": 28,
      "department": "Marketing",
      "address": {
        "street": "456 Elm St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  },
  {
    "employee": {
      "name": "Joe Schmoe",
      "age": 35,
      "department": "Engineering",
      "address": {
        "street": "789 Oak St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  }
]

Then the JSONL version is essentially just a flattened-down version of this, with each object on a single line. Note that we can remove the brackets and the commas between different objects, as it is a given that each line contains one JSON object in this format:

{"name": "John Doe", "age": 30, "department": "Sales", "address": {"street": "123 Main St", "city": "Springfield", "state": "IL", "zip": "62701"}}
{"name": "Jane Smith", "age": 28, "department": "Marketing", "address": {"street": "456 Elm St", "city": "Springfield", "state": "IL", "zip": "62701"}}
{"name": "Joe Schmoe", "age": 35, "department": "Engineering", "address": {"street": "789 Oak St", "city": "Springfield", "state": "IL", "zip": "62701"}}

You will probably see the objects wrap around, but this is only a visual thing. In the actual file, each object is on a single line.

Flattening our dataset into a JSONL file

So let’s create a utility function to flatten our dataset into a JSONL file. In your existing utils folder, make a new file called jsonl.py:

Finx_Fine_Tuning
    data
        Finx_completed_dataset.json
        Finx_dataset.json
    utils
        html_email.py
        jsonl.py          (new file)
    .env
    constants.py
    chris_gpt_dataset_generator.py
    Pipfile
    Pipfile.lock

In jsonl.py, add the following imports to get started:

import json
from pathlib import Path
from typing import Iterable

We import the json module to read and save JSON data. We import Path and Iterable only to use them as type hints, to make sure our code is as clear and readable as possible. First, let’s make the problem smaller by creating a function that takes a list or iterable of dictionaries, and converts them into a JSONL file. Add the following function to jsonl.py:

def dicts_to_jsonl(output_file: Path, data: Iterable[dict]) -> Path:
    with open(output_file, "w") as file:
        for dict_obj in data:
            json_string = json.dumps(dict_obj)
            file.write(json_string + "\n")
    return output_file

This function takes two arguments: output_file is the path to the file we want to write, and data is an iterable of dictionaries. We open the file in write mode, and then loop through each dictionary in the iterable. We convert each dictionary to a JSON string using json.dumps, and then write it to the file. We add a newline character at the end of each line to separate the JSON objects. Finally, we return the path to the file as a Path object.

Ok, that handles directly converting a list of dictionaries stored in memory, now let’s add a second function below that will handle converting an existing JSON file into a JSONL file. Add the following function to jsonl.py:

def json_to_jsonl(input_file: Path, output_file: Path) -> Path:
    with open(input_file, "r") as in_file:
        data = json.load(in_file)

    return dicts_to_jsonl(output_file, data)

This function takes two arguments: input_file is the path to the JSON file we want to read, and output_file is the path to the JSONL file we want to write. We open the input file in read mode, and then load the JSON data into memory using json.load. We then call the dicts_to_jsonl function we defined earlier to write the data to the output file.

Using this composition, we now have two functions, one for converting dictionaries, and another for an existing JSON file, yet we did not duplicate any code. Go ahead and save and close jsonl.py

Validating our dataset

Before we train our model, we need to make sure our dataset is in the right format and we’ll also check how much this is going to cost, and make sure none of the entries exceed the token limit. This may all seem a bit overkill, but you really don’t want to start training a model and have it fail halfway due to sloppy data or a single entry that is too long. It’s also considerably more expensive than other ways of using ChatGPT because we’re creating a whole custom model, so it’s nice to know ahead of time exactly how much money you’re going to spend.

We’re writing most of these specific things in utility functions in separate files, so you can reuse all of these for your future fine-tuning projects. We’ll do the same for the validation and price-calculator logic. In your existing utils folder, make a new file called data_validation.py:

Finx_Fine_Tuning
    data
        Finx_completed_dataset.json
        Finx_dataset.json
    utils
        data_validation.py          (new file)
        html_email.py
        jsonl.py
    .env
    constants.py
    chris_gpt_dataset_generator.py
    Pipfile
    Pipfile.lock

Time to install the tiktoken library before we start writing the code. Open your terminal and run the following command:

pipenv install tiktoken==0.6.0

The tiktoken library is a Python package developed by OpenAI. We’ll use it to count the number of tokens in a text string without making any API calls.

In data_validation.py, get started by adding the following imports:

import json
from decimal import Decimal
from pathlib import Path

import tiktoken

Most of these are familiar by now, but we also import Decimal from the decimal module. We’ll use this to handle the cost calculations, as it’s more precise than using floating point numbers, not giving us the annoying rounding errors to deal with.

Now define a constant that will be used for our calculations:

TRAINING_COST_PER_1000_TOKENS = Decimal("0.0080")

This is the cost per 1000 tokens for training data at the time of writing, but it may have changed if you’re watching this tutorial in the future. You can check the current cost on the OpenAI pricing page and adjust this number accordingly.

Creating the Validator class

Now let’s create our Validator. As we’ll have a lot of related functions, let’s use a class to group them together and start with the __init__ method:

class Validator:
    def __init__(self, jsonl_file: Path) -> None:
        self.data = self._load_data(jsonl_file)
        self._token_list = None
        self.encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

So the __init__ method will get called when we instantiate a new instance of this class, and it will take a Path argument to the JSONL file we want to validate. We’ll load the data from the file and store it in the data attribute using the _load_data method we’ll define next.

We’ll also initialize the _token_list attribute to None for now, and we’ll use it to store the token count for each entry in the dataset. Finally we store the encoding for the model we’re going to use in the encoding attribute. As the tiktoken library was also made by OpenAI, it has a handy method to let us load up the proper encoding for the model we’re going to use.

Now let’s add the _load_data method. As our data file is not that massive, we’ll just load up the whole file at once and not worry about loading the JSONL one line at a time:

class Validator:
    def __init__():
        ...

    def _load_data(self, jsonl_file: Path) -> list:
        with open(jsonl_file, "r", encoding="utf-8") as file:
            data = [json.loads(line) for line in file]
        return data

No big surprises here, we take the path as input and return a list. The only different thing is since the data is in JSONL format, we use a list comprehension. For each line in the fine, we call json.loads to convert the JSON string to a Python dictionary, which will then become an element in the list saved as the variable data.

Now let’s add a method to calculate the token count for each entry in the dataset:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry(self) -> list[int]:
        extra_tokens_per_message = 2
        token_list = []
        for training_data_object in self.data:
            num_tokens = 0
            for message in training_data_object["messages"]:
                for _, value in message.items():
                    num_tokens += len(self.encoding.encode(str(value)))
                    num_tokens += extra_tokens_per_message
            token_list.append(num_tokens)
        return token_list

This method will return the approximate amount of tokens as a list of integers. We start by defining a variable extra_tokens_per_message and set it to 2. This is the current number of extra tokens we need to add to account for the object structure besides just the strings themselves to come to an accurate number. We then loop through each training_data_object in the dataset and set a counter num_tokens to 0.

As this is ChatCompletion data, we know that the messages are stored in a list under the key “messages”. We loop through each message and then through each key-value pair in the message. (We use an _ for the key because we don’t need it in this case, but we need to use it as a placeholder to unpack the tuple.)

We call self.encoding.encode to encode the value to a list of tokens, and then add the length of this list to num_tokens, as it’s only the len or length that we are interested in. We then add the extra_tokens_per_message to account for the object structure as discussed, as this also takes up tokens.

After all the key-value pairs inside each index of the messages inside a training_data_object are processed, we append the num_tokens to the token_list and then move on to the next training_data_object in the list.

Now let’s add a function to check if our formatting has any mistakes in it:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format(self, entry) -> bool:
        if not isinstance(entry, dict):
            return False

        if list(entry.keys()) != ["messages"]:
            return False

        messages = entry.get("messages", [])

        return all(
            isinstance(message, dict) and "role" in message and "content" in message
            for message in messages
        )

This function will return True if the entry is in the correct format, and False if it’s not. It will check a single entry, or training_data_object, in our dataset at a time. First, it will check if the entry is a dictionary. After that, we call keys() on the entry to get the dictionary keys and call list() on it to convert it to a list. We then check if the list is equal to ["messages"], so make sure it has one key and only one, and that key is “messages”.

We then call the get() method on the entry to get the value of the “messages” key. Now the last line uses a generator expression and might look confusing if you’re not familiar with it, so let’s break it down step by step.

A generator expression is similar to a list comprehension, but it doesn’t store the list in memory. Instead, it generates each value on the fly as you iterate over it. This can be more memory-efficient than a list comprehension for large sequences, though it doesn’t matter much for our dataset size here. The generator expression in the code is:

(message for message in messages)

This generates a sequence of message values, one for each message in messages.

The isinstance(message, dict) and "role" in message and "content" in message part is a condition that checks whether each message is a dictionary and whether it contains the keys role and content.

The all() function takes an iterable (in this case, the generator expression) and returns True if all elements of the iterable are truthy (i.e., they evaluate to True), and False if even a single entry is not True. So, in simple terms, we check whether all messages in the messages list are dictionaries that contain the keys role and content, and return either True or False.

Now, let’s add a property to get the token_list, so we can easily access it:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list(self) -> list[int]:
        if self._token_list is None:
            self._token_list = self._calculate_token_amount_per_entry()
        return self._token_list

The @property decorator here means that we can access this particular method as a property, so using self.token_list instead of calling it as a method with self.token_list(). First, this will check if self._token list is None, which it will be the first time we access it. If it is, it will call the _calculate_token_amount_per_entry method to calculate the token list and store it in the self._token_list attribute. Then it will return the _token_list attribute. If this method is called again, it will just return the _token_list attribute without recalculating it as it’s already been calculated and stored.

Note that the methods with the _ prefix are meant to be private, so the _token_list is our implementation detail here, and the token_list property is the public interface to access it. This is a good practice because it ensures that _token_list is always in a valid state when it’s accessed, and it hides the details of how _token_list is implemented and managed from the rest of your program by providing token_list as an access point.

Now let’s add a method to check if the dataset is valid:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data(self) -> bool:
        return all(self._check_single_entry_format(entry) for entry in self.data)

This method will return True if all entries in the dataset are in the correct format, and False if any of them are not. It uses a generator expression in the same style as we did before. Note that it will stop checking as soon as it finds an entry that fails the _check_single_entry_format test, because all stops iterating as soon as it encounters a False value.

Now let’s add a to get the training cost in dollars:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data():
        ...

    def get_training_cost_in_dollars(self, epochs: int = 3) -> Decimal:
        total_tokens = sum(self.token_list)
        total_cost_dollars = (
            TRAINING_COST_PER_1000_TOKENS * total_tokens / 1000 * epochs
        )
        print(
            f"Total estimated cost: ~${total_cost_dollars:.3f} for training {epochs} epochs on {total_tokens} token dataset."
        )
        return total_cost_dollars

 Machine-learning Top-tip 
Epochs are the number of times the model will go through the entire dataset during training. The more epochs, the more the model will learn and internalize our dataset. If the number is too low, it will not fully internalize our training data, but if the number is too high it will internalize our specific examples too much and lose its ability to generalize, a concept called overfitting. 3 Epochs is a good starting point for most fine-tuning tasks.

This method will return the total cost in dollars for training the model for a given number of epochs as a Decimal type object. It uses the sum function to calculate the total number of tokens in the dataset and then does simple math to get the total cost in dollars. We print the total cost with an accuracy of 3 decimal places by using the :.3f format specifier in the f-string and then return the total cost.

One last method and we’ll be done, I promise! We want to be able to make sure the longest entry is not above our token limit:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data():
        ...

    def get_training_cost_in_dollars():
        ...

    def longest_entry_token_count(self) -> int:
        return max(self.token_list)

We use the max function to get the maximum value from the token_list and return it. Token limits per training example, so for every line in our JSONL file, are the same as the context limit for the ChatGPT model we’re using. For gpt-3.5-turbo-1106, the maximum context length is 16,385 tokens, so as long as this number is below that, you’ll know you’re safe.

Here is the whole class again for reference:

class Validator:
    def __init__(self, jsonl_file: Path) -> None:
        self.data = self._load_data(jsonl_file)
        self._token_list = None
        self.encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

    def _load_data(self, jsonl_file: Path) -> list:
        with open(jsonl_file, "r", encoding="utf-8") as file:
            data = [json.loads(line) for line in file]
        return data

    def _calculate_token_amount_per_entry(self) -> list[int]:
        extra_tokens_per_message = 2
        token_list = []
        for training_data_object in self.data:
            num_tokens = 0
            for message in training_data_object["messages"]:
                for _, value in message.items():
                    num_tokens += len(self.encoding.encode(str(value)))
                    num_tokens += extra_tokens_per_message
            token_list.append(num_tokens)
        return token_list

    def _check_single_entry_format(self, entry) -> bool:
        if not isinstance(entry, dict):
            return False

        if list(entry.keys()) != ["messages"]:
            return False

        messages = entry.get("messages", [])

        return all(
            isinstance(message, dict) and "role" in message and "content" in message
            for message in messages
        )

    @property
    def token_list(self) -> list[int]:
        if self._token_list is None:
            self._token_list = self._calculate_token_amount_per_entry()
        return self._token_list

    def validate_data(self) -> bool:
        return all(self._check_single_entry_format(entry) for entry in self.data)

    def get_training_cost_in_dollars(self, epochs: int = 3) -> Decimal:
        total_tokens = sum(self.token_list)
        total_cost_dollars = (
            TRAINING_COST_PER_1000_TOKENS * total_tokens / 1000 * epochs
        )
        print(
            f"Total estimated cost: ~${total_cost_dollars:.3f} for training {epochs} epochs on {total_tokens} token dataset."
        )
        return total_cost_dollars

    def longest_entry_token_count(self) -> int:
        return max(self.token_list)

Using the Validator

So give yourself a pat on the back for that . Now let’s train us some ChrisGPT! Save and close this file, then create a new file in your root directory named chris_gpt_dataset_validation.py:

Finx_Fine_Tuning
    data
        Finx_completed_dataset.json
        Finx_dataset.json
    utils
        data_validation.py
        html_email.py
        jsonl.py
    .env
    constants.py
    chris_gpt_dataset_generator.py
    chris_gpt_dataset_validation.py          (new file)
    Pipfile
    Pipfile.lock

In chris_gpt_dataset_validation.py, add the following setup to get started:

from utils import data_validation, jsonl
from constants import DATA_DIRECTORY


JSON_FILE = DATA_DIRECTORY / "Finx_completed_dataset.json"
JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"

We import all the stuff we made and prepared ourselves, and then we define the paths to the existing JSON file and the JSONL file we want to create. Now let’s make some good use of all the hard work we’ve done so far:

jsonl.json_to_jsonl(JSON_FILE, JSONL_FILE)  # Only run once

data_validator = data_validation.Validator(JSONL_FILE)

print(f"Data valid: {data_validator.validate_data()}")
data_validator.get_training_cost_in_dollars()
print(f"Longest entry: {data_validator.longest_entry_token_count()} tokens")

We convert our JSON file to a JSONL file with the same name. It says “Only run once” so you can comment out the code after we run the file the first time. Nothing bad will happen if you don’t though, it just does some unneeded calculations to make the same file again.

Then we create a new instance of our Validator class and pass the path to the JSONL file as an argument. We call the validate_data method to check if the dataset is valid and print the result. We then call the get_training_cost_in_dollars method to get the estimated training cost, which will get printed to the console automatically, and finally, we call the longest_entry_token_count method to get the token count of the longest entry in the dataset so we can make sure we don’t exceed the token limit.

Let’s run the file we have so far just as an interim test. You should get an output in your terminal that looks something like this:

Data valid: True
Total estimated cost: ~$5.184 for training 3 epochs on 216000 token dataset.
Longest entry: 2441 tokens

Your numbers will be slightly different from mine, as the data is partly LLM generated, but it will be very close to this. We can see our data is valid, we have over 200,000 tokens in total, and the longest entry is 2441 tokens, which is well below the 16,385 token limit for the gpt-3.5-turbo-1106 model.

You’ll also notice that a JSONL file has been created in your data directory with the training data in JSONL format:

Finx_Fine_Tuning
    data
        Finx_completed_dataset.json
        Finx_completed_dataset.jsonl 
        Finx_dataset.json
    ...

Now you might be surprised by the cost here. While $5 is not a massive amount of money it is a whole lot more than we typically consume when making regular ChatGPT calls. This is the reason we took so much time on the data validation, to make sure we get the data right the first time, and to know the exact cost before we commit to the training.

For those $5 you get something pretty damn cool though, your own custom ChatGPT . That being said, I understand if you’re not willing to spend $5 on this simple test project. You can run with half the training data, which is 100 examples, or even a quarter, which is 50 examples. But your output will not be as good as mine if you do so.

Limiting the dataset size

Let’s make some small changes to the code so you can limit your dataset size if you want to:

import json

from constants import DATA_DIRECTORY
from utils import data_validation, jsonl


JSON_FILE = DATA_DIRECTORY / "Finx_completed_dataset.json"
JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"
LIMIT = 100


with open(JSON_FILE, "r", encoding="utf-8") as in_file:
    data = json.load(in_file)
    jsonl.dicts_to_jsonl(JSONL_FILE, data[:LIMIT])

data_validator = data_validation.Validator(JSONL_FILE)

print(f"Data valid: {data_validator.validate_data()}")
data_validator.get_training_cost_in_dollars()
print(f"Longest entry: {data_validator.longest_entry_token_count()} tokens")

We added an import for json, and we set a constant named LIMIT. We then simply manually load the data from the JSON_FILE and use the dicts_to_jsonl function instead of the json_to_jsonl function, passing in only the first LIMIT number of examples using a simple slice. Note how easy this is as we made the jsonl utility module out of pieces so we can simply use a different piece this time.

I’m going to set the LIMIT variable to None as I want to use the full 200 examples for mine. Choose whatever number you want to use for the LIMIT, and then run the file again. It will create the new JSONL_FILE with the limited number of examples, and then validate and tell you the new cost. Limiting to 100 examples will cost you around $2.55.

Now that we know the cost, and we know our data is valid, we can move on to the next part where we’ll actually train our model on the JSONL data. I’ll see you there!

Part 4: Training and Running ChrisGPT

Hi and welcome back to part 4, where we’ll be training and running ChrisGPT. In this part, we’ll finally be using the OpenAI fine-tuning API endpoints, which are fairly simple!

There are two ways to use the fine-tuning API, both of which are very simple. The first way to do this is programmatically, using Python code just like we do when sending normal calls to ChatGPT. We’ll be looking at this first. The second way is to use the web interface for the fine-tuning API.

Using the fine-tuning API programmatically

Go ahead and create a new file called chris_gpt_training.py in the root directory of your project:

Finx_Fine_Tuning
    data
        ...
    utils
        ...
    .env
    chris_gpt_dataset_generator.py
    chris_gpt_dataset_validation.py
    chris_gpt_training.py 
    constants.py
    Pipfile
    Pipfile.lock

So let’s start with our imports and basic setup:

from constants import CLIENT, DATA_DIRECTORY


JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"
MODEL = "gpt-3.5-turbo-1106"
SUFFIX = "chris_gpt"

We import the OpenAI Client we stored in CLIENT and DATA_DIRECTORY. Then we quickly set up a path to the JSONL data for training (make sure you don’t accidentally use the json instead of jsonl one). We also set the model to the 1106 version as this is the newest one that has fine-tuning. My testing showed 1106 to be significantly better for fine-tuning than the older 0613 version.

Finally, the SUFFIX part will allow us to choose a part of the fine-tuning model’s name ourselves. This suffix will become part of the name you use to call your model, which is quite useful for identification as the model names are a bit long and all extremely similar if you don’t have a suffix. An example:

# Example model name without suffix
ft:gpt-3.5-turbo-1106:personal::8ot8ZLJR

# Example model name with suffix
ft:gpt-3.5-turbo-1106:personal:chris-gpt:8ot8ZLJR

Now let’s have a look at the file related methods in the openai client. Make sure you don’t run the file yet:

# File related methods
file = CLIENT.files.create(file=open(JSONL_FILE, "rb"), purpose="fine-tune")

CLIENT.files.list(purpose="fine-tune")

CLIENT.files.delete(file.id)

These methods are all fairly self-explanatory. The create method will upload your file, make sure to use the rb (read-binary) mode for the file and provide the purpose as fine-tune so that the OpenAI servers know what this file is for. This returns an object we catch in the file variable. It looks like this and contains some basic file data, most importantly the id:

# Example of a file object

FileObject(
    id="file-DamWAnhgpnRvEyMZ3dOdHpvG",
    bytes=865053,
    created_at=1708303339,
    filename="Finx_completed_dataset.jsonl",
    object="file",
    purpose="fine-tune",
    status="processed",
    status_details=None,
)

The list method will list all files, allowing us to filter on a purpose, so we’ll filter for files with the purpose of fine-tune here. It just returns a list of the FileObject objects you see above. The delete method will delete a file by its id, using whatever ID you pass in to delete.

Fine-tuning methods

Now let’s take a look at the fine-tuning-job related methods. Again, don’t run this file yet, let’s just have a quick look first:

# Fine-tuning-job related methods
fine_tuning_job = CLIENT.fine_tuning.jobs.create(
    model=MODEL,
    training_file=file.id,
    hyperparameters={"n_epochs": 3},
    suffix=SUFFIX,
)

CLIENT.fine_tuning.jobs.list()

CLIENT.fine_tuning.jobs.retrieve(fine_tuning_job.id)

CLIENT.fine_tuning.jobs.cancel(fine_tuning_job.id)

The most important method is of course the create method, which will create a fine-tuning job. You pass in the model and the file.id of the file you want to do the training on. You can optionally pass in hyperparameters such as the number of epochs we discussed earlier, and the suffix we talked about. This method returns a FineTuningJob object, which looks like this:

FineTuningJob(
    id="ftjob-1OATxnQAgdY4yjPNmSBai95f",
    created_at=1708318423,
    error=Error(code=None, message=None, param=None, error=None),
    fine_tuned_model=None,
    finished_at=None,
    hyperparameters=Hyperparameters(
        n_epochs=3, batch_size="auto", learning_rate_multiplier="auto"
    ),
    model="gpt-3.5-turbo-1106",
    object="fine_tuning.job",
    organization_id="org-oMYMXpp7Cr9pG1rG5Z8a1T2w",
    result_files=[],
    status="validating_files",
    trained_tokens=None,
    training_file="file-EX13iLyISBZcreRCH3Fm51Pn",
    validation_file=None,
)

We can see that the FineTuningJob object also has an ID that we can use to refer to it, and some basic info, such as the fact that this one has not finished yet. We can see the hyperparameters, including the batch_size and learning_rate_multiplier which are set to auto by default.

The batch_size is the number of examples in each training batch. Batch size is the amount of data the model looks at before it learns something new. It’s like reading a few pages of a book, and then stopping to think about what you’ve read before continuing. At the end of the batch, the predictions are compared to the desired output, and the error is calculated and used to update the model. We’ll just leave this set to auto.

The learning_rate_multiplier is a value that multiplies the learning rate of the model. The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. If this is set to high the model may very easily overfit to the specific data it has been trained on. We’ll also leave this set to auto.

The list method, again, simply returns a list of all the FineTuningJob objects for your account. The retrieve method will return a single FineTuningJob object by its id and the cancel method will cancel a fine-tuning job by its id. These methods are extremely simple and there really is nothing more to them.

Again, don’t run this fine yet. The last method we have is for deleting a completed fine-tuned model:

# Fine-tuned-model related methods
CLIENT.models.delete("model_id_here")

For this, you need to have a fully trained model to delete obviously, and then just pass in the ID. Before we actually go ahead and run this file, I’ll comment out several of the methods to leave them in for our reference, as we naturally don’t want to delete the file we just uploaded nor cancel the fine-tuning-job:

from constants import CLIENT, DATA_DIRECTORY


JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"
MODEL = "gpt-3.5-turbo-1106"
SUFFIX = "chris_gpt"


# File related methods
file = CLIENT.files.create(file=open(JSONL_FILE, "rb"), purpose="fine-tune")

print(CLIENT.files.list(purpose="fine-tune"))

# CLIENT.files.delete(file.id)


# Fine-tuning-job related methods
fine_tuning_job = CLIENT.fine_tuning.jobs.create(
    model=MODEL,
    training_file=file.id,
    hyperparameters={"n_epochs": 3},
    suffix=SUFFIX,
)

# CLIENT.fine_tuning.jobs.list()

print(CLIENT.fine_tuning.jobs.retrieve(fine_tuning_job.id))

# CLIENT.fine_tuning.jobs.cancel(fine_tuning_job.id)


# Fine-tuned-model related methods
# CLIENT.models.delete("model_id_here")

I’ve gone ahead and added print statements around the files.list and the fine_tuning.jobs.retrieve calls so we can see the output in our terminal. Now go ahead and run this file to start your fine-tuning job!

You should see the FineTuningJob object printed to the console. From here on we’ll be switching to the web interface for the fine-tuning API, as it’s much easier to work with and has real-time progress.

Using the web interface for the fine-tuning API

As developers we sometimes think we need to do everything programmatically, but there really is no need if we have an easy web interface that is much simpler to use. There really is no point in coding up some complex programmatic solution for something you’re only going to be doing once in a while.

If you have a look at https://platform.openai.com/files, you will see the web interface for managing the files you uploaded to OpenAI:

File API:

This interface is much nicer to read than the long file object list in your terminal window, and you can also upload new files here directly.

Now let’s switch to the fine-tuning tab at https://platform.openai.com/finetune, to see the fine-tuning job that we have started:

Fine-tuning jobs API:

As you can see I’ve done quite some testing so I have a whole load of models here, but you should see your model either “Validating files…” or maybe already in the training stage. You can also create new fine-tuning jobs on this page or cancel a running fine-tuning job using the cancel button. We can also see a lot of details on our fine-tuning job as it progresses:

Fine-tuning details:

Please ignore all the “Cancelled” fine-tunes in my list, I tend to double-check the code when writing these things so I triggered a lot of fine-tunes that were not needed and therefore canceled them straight after. The important thing here is first of all, when the model is done training, you will see the full model name for your fine-tune. In this case, I have ft:gpt-3.5-turbo-1106:personal:chris-gpt-full:8ot8ZLJR as my model name here, but yours is probably not done training yet.

It can take anywhere from a couple of minutes to potentially even hours for OpenAI to train your model, depending on the size of the data and how busy the servers are. Mine typically finished within 10 to 30 minutes, and OpenAI will send you an email when the model training is done. In the meantime, you can see the progress update in real-time in the right-side panel here, where you can see the training loss go down in real-time. You can also see the messages at the bottom keeping you posted on the progress.

So go ahead, it’s time again for your well-deserved coffee break with your favorite snack. Have a small break and give your model time to train, and I’ll see you back here when it’s done!

Running our own fine-tuned ChrisGPT model!

Ok, so I assume you’re back and your model is done training, which means you now have the name of your personal fine-tuned model from the https://platform.openai.com/finetune fine-tuning page. The name will also be in the email you receive when the training is done. First, go and open the constants.py file in your root directory, as we want to make some updates to the file:

from pathlib import Path

from decouple import config
from openai import OpenAI


CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
DATA_DIRECTORY = Path(__file__).parent / "data"
## Add the below variables
OUTPUT_DIRECTORY = Path(__file__).parent / "output"
CHRIS_GPT = "ft:gpt-3.5-turbo-1106:personal:chris-gpt-full:8ot8ZLJR"

We added an output directory for our generated files, and I stored the name for my ChrisGPT model in the CHRIS_GPT variable. Make sure you replace the CHRIS_GPT string with the name of your own model and do not copy mine, as you will not be able to access my personal model. Make sure you save and close this file.

Now create the empty folder for the output:

Finx_Fine_Tuning
    data
        ...
    output 
        (empty)
    utils
        ...
    .env
    chris_gpt_dataset_generator.py
    chris_gpt_dataset_validation.py
    chris_gpt_training.py
    constants.py
    Pipfile
    Pipfile.lock

And then create a new file in your root directory called chris_gpt.py:

Finx_Fine_Tuning
    data
        ...
    output
        (empty)
    utils
        ...
    .env
    chris_gpt_dataset_generator.py
    chris_gpt_dataset_validation.py
    chris_gpt_training.py
    chris_gpt.py 
    constants.py
    Pipfile
    Pipfile.lock

Now let’s start with our imports for chris_gpt.py:

from constants import CLIENT, OUTPUT_DIRECTORY, CHRIS_GPT
import time

We import the CLIENT, OUTPUT_DIRECTORY, and CHRIS_GPT model-name from the constants.py file, and we also import the time module so we can easily give our output files unique names.

Now I’m going to simply declare a string variable that will contain the contents I want our email to have:

leaf_blower = """
Introduction to the AI-powered leaf blower and its innovative features in the realm of yard maintenance equipment.
Description of how the AI technology enhances the efficiency and performance of the leaf blower compared to traditional models.
Overview of the specific AI algorithms and sensors integrated into the leaf blower for optimized leaf collection and debris management.
Real-world application scenarios demonstrating the effectiveness of the AI-powered leaf blower in various landscaping and gardening tasks.
Discussion on the environmental benefits of using the AI-powered leaf blower, such as reduced noise pollution and energy consumption.
Insights into the future development and potential advancements in AI-powered yard maintenance equipment, including further automation and integration with smart home systems.
"""

Either copy this from the written version of the tutorial or come up with your own topics that you want Chris to talk about. You can also ask ChatGPT to generate bullet points on a topic for you if want.

Now we’ll just define a simple chris_gpt function that will take a string of topics and then call our custom model to ask RoboChris to write an email about them:

def chris_gpt(topics: str) -> str:
    response = CLIENT.chat.completions.create(
        model=CHRIS_GPT,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris' writing style. You will respond in Simple text format. Don't insert any newline characters and such but use an actual newline. Make sure that the subject makes sense in regards to the content of the email. Keep the email CONCISE AND TO THE POINT, and STAY ON TOPIC. Do not repeat yourself. Don't forget to add Chris' signature emoticons. Also don't make up nonsense terms that do not exist, and make sure you ALWAYS USE CORRECT SPELLING! The user will inform you about the topics of the email:",
            },
            {"role": "user", "content": topics},
        ],
    )

    return (
        response.choices[0].message.content
        or "There was an error with the response. Please try again."
    )

Note that we edited the prompt a bit from the training data to emphasize conciseness and staying on-topic. Turns out Chris-GPT really goes wild if left unchecked, really talking about anything and everything. It’s very funny actually. I’ve also decided to ask it for simple text output for now. While we have baked much of the behavior into the model, we can still steer it.

Now let’s finish off the code. Add the following below and outside the chris_gpt function:

current_unix_time = int(time.time())

filename = f"chris_gpt_output_{current_unix_time}.txt"

with open(OUTPUT_DIRECTORY / filename, "w", encoding="utf-8") as file:
    file.write(chris_gpt(leaf_blower))

First, we get the Unix time in seconds, which is a unique number that will be different every second and simply refers to the number of seconds that have passed since 1970. This makes sure that files don’t overwrite each other as long as we don’t generate multiple files in the same second.

We then use this to create a filename for our output file. We open the file in write mode and write the output of our chris_gpt function to the file, using the utf-8 encoding to make sure we don’t blow up the code when emoticons are used.

Lets put RoboChris to work!

Go ahead and run the file, and your output will appear in the output folder. You can open the file and see the email Chris wrote for you. Here is a random example of what I got:

Hey {User}! This is one of the most unique products I've seen in a long time.

AI Leaf Blower: A Must-Have Garden Tech for 2022?

I found it on Twitter, 500,000 views in the last 24 hours! It's the next level in leaf blowers. The AI can identify holes, sticks, mud, and leaves.

Perception and decision-making are impressive: It assesses the weight and size of sticks, identifying problematic areas not only by visual information but also friction.

For example, if you collide with something hard, it'll learn from this feedback to avoid those spots in the future.

It also listens to the sound it makes on areas with a lot of leaves compared to an area with just a few leaves in order to gain a sensory perception of the leaf bed to optimize collection and airflow.

Overall a great machine for both pickup and distribution!

It's easy to see more products like this coming out and we're only at the beginning of AI integration in everyday life, first visually (showing the gardens), then providing communication and sensor data based on subscriber inputs.

A systems engineer with 5-10 years of experience could put this together quite easily so you won't see prohibitive costs anymore. This is a massive trend!

Check out the one-minute Twitter AI-Blower-Cheetah in action:
AI-Powered Leaf Blower

With cool new tech disrupting every industry, let's be on the right side of change!
Chris

PS: This is the future! Any dirt you throw at it with a shovel, it'll pick up in seconds! Scroll up to see the video demonstration if you missed it.

That is fascinating, right!? My mind is blown. If you compare this with the leaf_blower prompt we fed it, you can see that the original prompt is totally generic, yet this email reads convincingly like a Finxter email written by Chris! It’s sort of scary almost, but also very cool.

While this is not perfect, and you definitely cannot send these emails without some editing and checking, this really is a talking ChrisGPT now. It is admittedly a bit wild, but that’s because of our import data, which was a bit messed up because of the formatting and shorter much more attention-grabbing style of speech one uses in emails. We deliberately created a type of idea-generator here intended to be used by the creator themself.

While this is perhaps not a useful commercial application in itself, it clearly demonstrates the power and potential of fine-tuning. It’s also a lot of fun to play around with!

I think you can easily see that this type of fine-tuning will absorb your business’s communication tone, as long as you feed it somewhat more boring and professional clean data as opposed to short promotional email data which we didn’t clean up perfectly.

Let’s have some fun

I’ll show you a couple of my other test inputs/outputs:

super_robot = """
new AI-powered robot:
All-in-one household chore solution
Capable of cleaning, laundry, dishwashing, cooking, and car washing
Utilizes advanced AI and robotics technology
Customizable chore scheduling options
Maximizes efficiency and productivity
Seamlessly integrates with smart home systems
"""

Hey {User},

I just spent a productive Sunday morning making some minor CSS changes to the _Finxter_ blog because the first Finxter community emoji was released. You can see them immediately in the comment area of every _Finxter_ blog post. Let's continue with some Finxter industry updates.

The coming years will be exciting for roboticists and us consumers. Many robotic companies are headed for an IPO. If Tesla and Bitcoin 2.0⃣ (i.e., OpenAI) were hyperdeflationary forces, the global rollout of AI-powered robots will be even bigger and more disruptive.

First, I've ordered a house-cleaning robot from Altan. I'll receive it next month in February.

_Disclosure: As an early Finxter of only [5] Bitcoin_

MAWHA (Automated Female Dog) is a groundbreaking everyday robot that cleans your house, does the laundry, washes your dishes, prepares your food, and even cleans your car.

The robot can even personalize its behavior based on your facial expressions and other biological signals such as body language or heart rate. No other robot does this. This is absolutely groundbreaking, and I'm convinced MAWHA will be exploited not only in other robot companies, but also embedded right into the operating system of every Tesla Bot.

Check it out: www.youtube.com/watch?v=O6fGGdUnyjo

I'll keep you posted on this.

Let me know if your personal bot shall get in touch with my personal bot. Okay, just kidding. But mindblowing developments are right around the corner.

Happy coding!

Chris

Now, there is some nonsense in here, but it’s still pretty good! I have not edited these at all, they are 100% direct ChrisGPT output. Of course, it’s based on fictional products as we are asking it to write about fictional products here.

I’ll ask ChrisGPT to write about something factual:

sharks = """
Briefly introduce the topic of great white sharks and why you're discussing them.
Describe the appearance and size of great white sharks, highlighting their distinctive features.
Discuss where great white sharks are commonly found and their preferred habitats, such as coastal waters and oceanic regions.
Diet and Feeding Behavior: Explain what great white sharks eat and how they hunt, including their role as apex predators in marine ecosystems.
Provide information about the conservation status of great white sharks, including any threats they face and conservation efforts to protect them.
Discuss human interactions with great white sharks, including encounters in the wild, conservation initiatives, and safety measures for beachgoers and divers.
"""

Hey {User},

Let's talk about great white sharks today. Nobody knows sharks, which create an ocean environment that is anything but monolithic. Maddening creatures indeed, bringing the oceans alive in a thoughtful way.

Anyway, enjoy the following educational snippets! Feel free to send this quick tutorial to other ocean nerds out there:

**Great White Sharks** 101

Appearance: The great white shark is one of the most iconic species. Named for its gray coloration and white underside, it has a torpedo-shaped body and impressive size, with an average length of 13-16 feet; 20 feet is common, and some sharks grow even larger.

Distribution: They are found mostly in cool coastal waters of all major oceans, often close to shore, but also travel hundreds of miles into open oceanic regions. Juveniles tend to stay closer to shore, while the adults roam through deeper waters.

Diet and Feeding Behavior: Being top predators for marine ecosystems, great white sharks hunt marine mammals such as seals, sea lions, and small toothed whales. Fast and powerful, they can launch themselves out of the water to catch prey, often attacking from below.

Conservation Status: As an apex species, the health of great white shark populations provides an indication of the health of the overall marine ecosystem. Pressure from overfishing, bycatch, pollution, and loss of prey have taken their toll, and great whites are now considered vulnerable.

Human Interactions: Great white sharks are frequently studied, with many researchers interested in conserving and understanding these animals better. More and more programs have been established to protect the sharks; providing a balanced view on whom we should really fear.

While the chance that you run into a great white is luckily low, I roughly calculated your lifetime risk of dying from any shark attack to be in the same ball park as your lifetime win from a shark lottery of 1 in 1,000.

We'll dive into the 2023 edition of Shark Diving technology once available.

Enjoy your breaks today, and keep hitting the books!

Best, Chris.

Now you can see it does a pretty good job writing about real and existing topics as well. As you can see it needs some minor edits, but the above is 100% unedited raw output. While talking about a real topic it still uses the fun emoticons and that engaging Chris-specific style of speech that makes it so much more engaging to read. It’s those things like “Feel free to send this quick tutorial to other ocean nerds out there:” that really make me chuckle and make it feel like Chris is really writing this.

Be careful what you train the model to do

Notice that ChrisGPT is sometimes a bit wild and makes stuff up. I left it like that on purpose. The reason it is like this proves an important point here. We trained ChatGPT on data that had only a few bullet points in the user’s request but way more information in the response.

What this means is we trained ChrisGPT over and over, specifically, to come up with stuff that was not present in the second item which represented the user query, because all the examples we fed it showed ChatGPT responses that had more information than the user request had.

This taught ChrisGPT to include more information that was not present in the original user request because that’s what all the examples were doing. Now, in our case, that’s what we wanted, so that is absolutely fine. But when designing your training data be very careful and aware of exactly what you are teaching the model to do. Your input will equal your output here.

When not to use fine-tuning

Before we end this tutorial series, let’s discuss when to use, and when not to use fine-tuning. You should not use fine-tuning for any tasks where you can get good results using prompt engineering and giving examples in the prompt setup. Most problems can be solved using simple prompt engineering and do not require the extra complexity and time investment of fine-tuning. Do not assume you need fine-tuning just because it sounds fancy.

You will see loads of video tutorials on YouTube that ‘fine-tune’ chat GPT to be a sarcastic Reddit commenter etc.. This is completely pointless!! You can do this with simple prompt engineering and perhaps a couple of examples in the prompt setup. This is missing the point of fine-tuning altogether, and the reason is that there isn’t that much that ChatGPT cannot do yet. Training it to do something it can already do is a waste of time, energy, and money.

Single-correct-answer type tasks

Fine-tuning can be used for very specific and focused tasks. Say you have a very narrow task that you want ChatGPT to do over and over, like extracting very complex CSV or JSON objects from unstructured data, with a lot of edge cases and exceptions, for all of which you’ll need to provide many examples.

You may find that GPT-4 is powerful enough to just do this, but ChatGPT 3.5-turbo is not quite able to do the same task reliably. Yet, you may still want to use ChatGPT 3.5-turbo for that specific task. Why? Because ChatGPT 3.5 is smaller and therefore much faster and cheaper. If you fine-tune GPT 3.5-turbo on that specific task, then it will generally be able to reach GPT-4 level quality on that specific task. This is a good use of fine-tuning.

This seems to actually reflect a general trend in the LLM space as well, where first the models just got bigger and bigger, but now more and more models are coming out that are specialized for certain specific uses, like writing code or handling math problems, etc, as opposed to having the one model to rule them all. This is a good thing, as it allows for more efficient use of resources and more specialized and accurate results for specific tasks.

Another benefit that you get here is OpenAI will let you submit a testing and a training portion of the data, and after training on the 70% training data, OpenAI will let you know how accurate the model is by testing on the 30% testing data that has not been shown to the model during training so it makes the perfect test.

This is also useful for increasing accuracy on complex function calls and such, as these are also a JSON object output type task, though the models have improved so much with each iteration lately that you probably will not need to fine-tune for this yourself anymore as OpenAI has sort of done this for us already.

Customer service chatbots

Fine-tuning is not a data retrieval system. You might think that you can simply train ChatGPT on your data and it will answer all questions about this data correctly from there on in. Like you feed all your business data and questions and answers from your customer service history and now your fine-tuned model will know everything about your business right? Well, the answer is yes, and no.

I tested this out to make sure my statement here is correct and fine-tuned a model on training data containing over 80 frequently asked questions and their answers from the customer service of a fictional product. Though the fine-tuned model was able to answer most questions correctly, it did give some wrong answers conflicting with the data it was trained on. You cannot use this as a foolproof data retrieval system for your customer service, as being correct 90% of the time there is not good enough.

That being said fine-tuning is being used for customer-service bots, but they should be used in combination with a data retrieval strategy that provides the correct answer to ChatGPT before answering the user, in which case they act as an extra reinforcement where the effect of both is added together. The fine-tuning takes care of the company-specific style of communication, and trains the actual data into the model to some degree, while the retrieval system takes care of the actual data retrieval, feeding the model with the exact and correct information to generate the response for the end user.

The retrieval part is usually achieved by storing the entire dataset cut into pieces in embeddings in an embedding database. You would then retrieve ONLY the pieces of text from your dataset that are similar to the user query to ChatGPT, giving it the perfect source material it needs to give a factually accurate answer. ChatGPT will then generate a response using the retrieved pieces of text it was handed as the source material. If you want to know more about embeddings check out chapters 6 and 7 of my “function calls and embeddings” related course here on the Finxter Academy

Thank you for participating!

That’s it for the fine-tuning course, I hope you enjoyed it and it has given you a good idea of when you can use fine-tuning in your own projects. As always, it’s been a pleasure and an honor to take this journey together with you!

Dirk.

The post [Full Tutorial] OpenAI Fine-Tuning: Creating a Chatbot of Yourself (Example: ChrisGPT) appeared first on Be on the Right Side of Change.

How to Set Up AutoGen Studio with Docker

Dirk van Meerveld — Sun, 11 Feb 2024 21:00:30 +0000

Hi and welcome to this tutorial series on AutoGen Studio. My name is Dirk van Meerveld, and I’ll be your host and guide for this three part tutorial series where we’ll take a look at AutoGen Studio.

Go Back to the Full Course: Next Level Prompt Engineering with AutoGen Studio

AutoGen Studio is an open-source interface that runs on top of AutoGen. This combination will allow us to work with LLMs and give them skills (like tools/functions) that they can use and also allows us to use agents and even define multi-agent workflows. Think of it as a multi-agent framework. One of the most amazing features is that it can write and execute code on the fly to solve problems!

On top of that AutoGen Studio provides us with a sleek and easy-to-use interface to define all the above and chat with our agents, much of it without even having to write code! This truly is next-level prompt engineering and over the coming lessons, we will harness the power of digital agents and see them collaborate with each other!

In this first part, we’ll focus on the setup and take some time to properly get started. If you’re a more seasoned developer and intimately familiar with a particular topic like Docker, feel free to skip ahead a bit here and there, but we’ll cover all the bases so that everybody can follow along nicely. This is especially so as much of AutoGen Studio can be used via the interface which makes it more accessible to non-coders. We will leave no man or woman behind!

AutoGen safety measures

In order to answer more complex requests AutoGen will actually write Python code for you and execute this Python code on your local computer.

For example, there is a demo question where you ask the following question: "Plot a chart of NVDA and TESLA stock price for 2023. Save the result to a file named nvda_tesla.png". AutoGen will solve this question by writing Python code to get the stock prices, install needed libraries to execute the code, write code for creating the graph in MathPlotLib, etc… It will then execute this code on your local computer and return the result to you.

The problem occurs where operating systems of course have some kind of built-in security. Generally, the Execution Policy on your OS will prevent AutoGen (and others) from just running random code and scripts on your computer. And this is a good thing! One solution would be to loosen up the Execution Policy on your computer, but this is not really the best idea. Even if you don’t have an Execution Policy problem on your system, do you really want to have any and all AI-generated code running on your machine without any checks or balances? It’s a bit dangerous, to say the least.

Why we won’t use Anaconda

Many tutorials try to take the 100% no-code approach, using Anaconda for virtual environments and not using an IDE like VS Code altogether in an attempt to claim that ‘You don’t need to know any code at all’. Then you will:

Still have to install Docker as well as Anaconda, in order to allow safe code execution inside of a Docker container.
Set your API key over and over every time you reload.
Type the same Anaconda commands over and over.
You might be stuck with a buggy experience trying to get it to use Docker for safe code execution, causing you a lot of frustration.

While there is absolutely nothing wrong with Anaconda, it doesn’t remove the need for Docker for safe code execution as it doesn’t provide any isolation from your local system. So why don’t we just use Docker instead of Anaconda AND Docker? This biting the bullet will make this part 1 of the tutorial a bit harder, but after that, it will be smooth sailing all the way!

Docker to the rescue

Using this approach we can skip Anaconda altogether as Docker will also be separate from our system-wide Python environment, and we have only 1 piece of software to worry about for both the separate environment and the safe code execution. We will use some bare basic code in the form of a ‘docker file’ and basic commands, but you can just copy mine, so no worries! Again, if you’re already a Docker expert, feel free to just skip ahead to the Dockerfile and get started.

Docker uses operating system-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries, and configuration files. All containers are run by a single operating system kernel and are thus more lightweight than virtual machines. Crudely stated this means that docker will let us have a separate container with Linux, Python, and all the libraries we need to run AutoGen Studio in it, kind of like running a different computer system within a virtual machine.

We will use this ‘virtual machine’, or Docker container, to run AutoGen Studio inside of it, which automatically also means that any code it generates is executed within the container, solving our code execution policy problems and safety concerns in one go. Caveat: Theoretically even Docker is not 100% secure and isolated from the rest of your system, but it’s a lot better than just running code directly on your local machine and good enough for non-enterprise use.

Getting Docker installed and running – WSL2

I’m going to assume you’re on a Windows machine. If you’re on a Mac or Linux machine, you’re in luck and can probably just run the installer for ‘Docker Desktop’ and be done with it. Just head over here and scroll down to select the Docker download for your platform. https://docs.docker.com/desktop/ (or Google for the updated link).

Continuing for the Windows users now, as they have a couple of extra steps to take. Most tutorials will just tell you to install Docker and give no details which might leave you will a hard and long process of figuring out why it won’t work on your system. So let’s cover the pitfalls and make sure we get it right the first time!

First, we need to install WSL2 (Windows Subsystem for Linux 2) on our system. This is a Windows feature that allows us to run Linux on our Windows machine. This is needed because Docker Desktop for Windows requires WSL2 to run. (You can also use Hyper-Vinstead, but WSL2 is the recommended way).

(If you already use WSL but need to check if you have version 2, you can check your version by entering the command: wsl -l -v in PowerShell or Windows Command Prompt.)

You can find the details for installing WSL2 here, but basically you do this:

Open a PowerShell or Windows Command Prompt in administrator mode by right-clicking and selecting “Run as administrator”
Run the following command in the terminal: wsl --install
Restart your computer when prompted

This will enable WSL on your Windows machine and install a Linux distribution (Ubuntu) on your system. You may need to create a Username and Password for your Linux distribution and save them somewhere. More details can be found here or google "Setting up your Linux username and password."

Some things to check before we install Docker Desktop

A quick caveat: If anywhere along the way you run into problems I haven’t covered here, you’ll need to do some googling to find out what it means and what needs to be done, as I cannot prepare you for every possible problem you might run into.

The software development world sometimes requires messing around and googling for an hour or even two before you get something to work, it’s part of the game. That being said I will try my best to cover all bases so you’ll hopefully have a very smooth experience .

We have a couple more things to check before we can install Docker Desktop. First, we need to make sure the Virtual Machine Platform option is enabled. This is fairly easy! Just press the Windows key or click in the Windows search bar and search for “Turn Windows features on or off”.

Open this and make sure the “Virtual Machine Platform” option is checked. If it’s not, check it and click OK. You may need to restart your computer after this.

(You don’t have to match the other checkboxes with the settings in the image!)

Next, we need to make sure that Virtualization is enabled in our BIOS. The easiest way to check if this is enabled is to open the Windows Task Manager by pressing Ctrl+Alt+Delete and selecting Task Manager. Then click on the Performance tab and click on CPU on the left-hand side. If Virtualization is enabled, you will see “Virtualization: Enabled” in the bottom right information block like this:

If Virtualization is not enabled, you will have to go into your BIOS settings and enable Virtualization. This is where you will have to do some googling and research on your own, as every system has slightly different keys to get into the BIOS setup menu, and the settings may be located in different parts of the BIOS menu for different manufacturers. I’ll leave you with two links to get you started, the first one describes the general process of getting this setting enabled in your BIOS:

When you’ve got that ready and set up to go, let’s continue on.

Installing Docker Desktop

Finally, it’s time! Head over to the Docker Desktop download page here and download the appropriate Docker version for your OS. I’m running Docker 4.26.1, but just download the latest version and you should be fine. When the download finishes start the installer. The installer will give you the following options:

Just accept both of these options and click OK, unless you chose not to install WSL 2 and use Hyper-V instead. Whether or not you want a shortcut on your desktop is entirely up to you of course .

Now just let the installer do its magic:

Done!

Then go ahead and run the Docker Desktop application, where you’ll have to accept the service agreement:

And then just choose “Use recommended settings” and click Finish:

Now Docker Desktop will start and you will be prompted to either sign up or sign in. Docker is free for personal and even small-business use, so press the button to sign up and create an account. You can even use your Google or GitHub account to create one really fast. (You can also continue without signing in if you want to, and it should still work fine). I’m just going to go ahead and sign in with my Google account.

If everything was successful, you should be greeted by the following screen:

Congratulations! You’ve now installed Docker Desktop and are ready to go! If you still have problems, first try the below, and if that doesn’t work, google will have a solution. Never despair!

(Only for those who still have problems )

- Hypervisor enabled at Windows startup -
If you have completed the steps described above and are still experiencing Docker Desktop startup issues, this could be because the Hypervisor is installed, but not launched during Windows startup. Some tools (such as older versions of Virtual Box) and video game installers turn off hypervisor on boot. To turn it back on:

    - Open an administrative console prompt.
    - Run bcdedit /set hypervisorlaunchtype auto.
    - Restart Windows.

Creating a Dockerfile

Ok, now that we all have Docker Desktop installed and running, let’s move on to the next step! I’ll be using VS Code for this, just because it feels convenient to me. You can also use any other code editor or literally just copy the text into Notepad and use a separate terminal window, it makes no difference.

Create a base project directory and open it in VS Code. I’ll simply call my directory AUTOGEN:

AUTOGEN (root project folder)

Now inside the AUTOGEN folder create a new file called Dockerfile.base:

    AUTOGEN (root project folder)
        Dockerfile.base

What is a Dockerfile? A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image and ultimately run a Docker container. Think of this dockerfile as a recipe that Docker will follow to build a custom image that can be used to create new containers.

In a Dockerfile, you can specify the base image to use, define the working directory, copy files from your local system to the container, run commands to install packages, expose ports for the application, and specify the command that should be run when a container is launched from the image.

So open up your Dockerfile.base, and let’s type out our Docker recipe! This will be mostly based on the example recommendations from AutoGen itself, with some minor tweaks. First, let’s specify the base image:

FROM python:3.11-slim-bookworm

This will use the official Python image from Docker Hub, which is based on Debian Linux. We’ll use the slim-bookworm version, which is a lightweight version of Debian Linux. We’ll also use Python 3.11, which is in the range of current recommended versions for AutoGen at the time of writing.

Then continue in your Dockerfile.base with the following:

RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        software-properties-common sudo\
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

I’m going to explain what all the Docker commands do as these topics are interesting and very helpful as a developer, but if you’re not interested in Dockerfiles and Linux commands and don’t want to do any coding stuff, you have my blessings to skip the explanations and just scroll down to the finished Dockerfile at the end of this section and copy it into your own Dockerfile.base .

RUN is a Dockerfile command that executes a command in the container. In this case, we’re running the apt-get update command, which refreshes the local package index with the latest versions. && is a command separator, which lets us chain commands.
DEBIAN_FRONTEND=noninteractive is an environment variable that is set to noninteractive to prevent the apt-get install command from asking any questions during the installation and just apply default settings.
apt-get install is the command to install packages. The -y flag is used to automatically answer yes to any questions that may come up during the installation process. The --no-install-recommends flag is used to prevent the installation of any recommended packages that are not strictly required for the package to function, keeping it light.
software-properties-common is a package that provides utilities for managing software repositories. and sudo is a utility that allows us to run commands as a superuser.
apt-get clean is a command that cleans up the local repository of retrieved package files, and rm -rf /var/lib/apt/lists/* is a command that removes the package lists that were downloaded during the apt-get update command. This is done to save disk space.

Next in our Dockerfile, we’ll set up a root user with superuser access:

RUN adduser --disabled-password --gecos '' autogen
RUN adduser autogen sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER autogen
WORKDIR /home/autogen

The first line creates a new user named autogen without a password and without prompting for additional information, whereas --gecos '' simply sets the user info to an empty string. The next line adds the user autogen to the sudo group, granting it administrative privileges.

The third line configures the sudo group to allow members to execute commands as root without requiring a password. After that, we switch to the autogen user and set the working directory to /home/autogen.

Next up:

ENV PATH="/home/autogen/.local/bin:$PATH"
ENV OPENAI_API_KEY=paste_your_api_key_here

First, we set the PATH environment variable (using Docker’s ENV command) to include the .local/bin directory in the autogen user’s home directory.

Then we set the OPENAI_API_KEY environment variable to the API key that we got from OpenAI. This is needed so that AutoGen Studio can access the OpenAI API.

Make sure you paste your own ChatGPT API key in there instead of paste_your_api_key_here, making sure not to add "" double quotes or anything. You create new or extra keys by going to the OpenAI API keys page:

Ok now to continue in our Dockerfile:

RUN pip install pyautogen==0.2.8 autogenstudio==0.0.34a0 numpy pandas matplotlib seaborn scikit-learn requests urllib3 nltk pillow pytest beautifulsoup4

We just pre-install some of the popular packages and of course pyautogen and autogenstudio themselves. I have specified two specific versions for pyautogen and autogenstudio, as these are the versions I’m using when writing this tutorial. I advise you to use the same ones to make sure you have exactly the same experience as me. You can always upgrade the packages after you finish the tutorial series. (Don’t worry, at the end of the last part I’ll show you how and you won’t lose any of your work!)

Now next up in our Dockerfile:

# Expose port
EXPOSE 8081

# Start Command for AutoGen Studio
CMD ["autogenstudio", "ui", "--host", "0.0.0.0", "--port", "8081"]

The EXPOSE command exposes port 8081, which is the port that AutoGen Studio will run on. The CMD command specifies the command that will be run when the container is launched from the image. In this case, we’re running the autogenstudio command with the ui option, which will start the AutoGen Studio interface. The --host and --port options specify the host and port that the interface will be available on. This means we’ll only have to start the container and AutoGen Studio will be available on port 8081 automatically!

Note that we used the address 0.0.0.0 instead of the usual localhost 127.0.0.1. If we bind to the 127.. version it’s only accessible from the same machine – in this case, the same Docker container. To make our application accessible from outside the Docker container, you need to bind it to 0.0.0.0 instead. This will make it accessible from any IP address, including from your host machine.

To finish off our Dockerfile, we’ll add a comment at the bottom for our own future reference:

# command to build the image:
    # docker build -t autogenstudio -f Dockerfile.base .
# command to run the container:
    # docker run -it --rm -p 8081:8081 --name autogenstudio autogenstudio
# Access AutoGen Studio at http://localhost:8081 make sure you don't click the 0.0.0.0:8081 link in the terminal, it won't work!

This is just a comment that you can use for future reference so you don’t have to search the tutorial to find the command to build and run the container, we’ll use and explain these commands in a second, just copy them for your future reference.

Building the Docker image and running the container

Your whole Dockerfile.base should now look like this:

FROM python:3.11-slim-bookworm

RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        software-properties-common sudo\
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Setup a non-root user 'autogen' with sudo access
RUN adduser --disabled-password --gecos '' autogen
RUN adduser autogen sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER autogen
WORKDIR /home/autogen

# Set environment variable
ENV PATH="/home/autogen/.local/bin:$PATH"
# Follow the = with your OpenAI API key (no quotes or anything, just OPENAI_API_KEY=sk-lotsOfLettersAndNumbers)
ENV OPENAI_API_KEY=

# Pre-load popular packages
RUN pip install pyautogen==0.2.8 autogenstudio==0.0.34a0 numpy pandas matplotlib seaborn scikit-learn requests urllib3 nltk pillow pytest beautifulsoup4

# Expose port
EXPOSE 8081

# Start Command for AutoGen Studio
CMD ["autogenstudio", "ui", "--host", "0.0.0.0", "--port", "8081"]

# command to build the image:
    # docker build -t autogenstudio -f Dockerfile.base .
# command to run the container:
    # docker run -it --rm -p 8081:8081 --name autogenstudio autogenstudio
# Access AutoGen Studio at http://localhost:8081 make sure you don't click the 0.0.0.0:8081 link in the terminal, it won't work!

Ok, so with that out of the way, first make sure that the Docker Desktop application is running, as we’ll need the Docker Engine process to be up and running. Then open up a terminal window and make sure you’re in the root project folder, in my case AUTOGEN:

admin@DirkMasterPC /c/Coding_Vault/AUTOGEN

Then run the following command to build the image:

docker build -t autogenstudio -f Dockerfile.base .

This will build the image and tag (-t) it with the name autogenstudio. The -f flag specifies the name of the Dockerfile to use, in this case Dockerfile.base. The . at the end specifies the build context, which is the current directory, as this is where our Dockerfile.base is located.

The Docker Image will be built and you will see the progress running through each of the steps in the Dockerfile. When it is done we can run the following command to start the container which will automatically run AutoGen Studio:

docker run -it --rm -p 8081:8081 --name autogenstudio autogenstudio

This works by running the docker run command, which creates a new container from the autogenstudio image that we just built. The -it flag specifies that we want to run the container in interactive mode, which means we can interact with the container via the terminal.

The --rm flag specifies that we want the container to be removed when it is stopped. The -p flag specifies that we want to map port 8081 from the container to port 8081 on our local machine. The --name flag specifies the name of the container, in this case, autogenstudio.

Now bring up your Docker Desktop application and open up the Containers tab (topmost icon in the left menu bar):

We can see that our container by the name autogenstudio, based on the image autogenstudio is running on port 8081! Hooray!

Victory!

You can either click on the blue 8081:8081 link to open up AutoGen Studio in your browser, or just open up your browser and go to http://localhost:8081. There is also a link in your terminal window that says that Uvicorn is running on http://0.0.0.0:8081. Do not click this link as the 0.0.0.0:8081 address is only accessible from within the Docker container and you are outside of it, so it will not work.

Give yourself a pat on the back for a job well done! You’ve set up AutoGen Studio properly and inside a Docker container! One quick sidenote, when we wrote the Dockerfile.base we hardcoded the ChatGPT API key into the Dockerfile to keep things as simple as possible. We can of course avoid this but this would complicate the tutorial further and I want to keep this one as low-code as possible.

Just make sure you do not share the Dockerfile.base or remove your API key from it first as anyone with your API key can obviously use OpenAI’s API on your credit. Your Docker Image also has the API key hardcoded into it. There is no reason you’d ever really want to share the image though, just don’t leave copies behind on a public computer or something.

That’s it for part 1! I hope it wasn’t too difficult. From now on the fun part begins. I’ll see you in part 2 where we’ll dive straight into AutoGen Studio.

Go Back to the Full Course: Next Level Prompt Engineering with AutoGen Studio

The post How to Set Up AutoGen Studio with Docker appeared first on Be on the Right Side of Change.

OpenAI Whisper – Speeding Up or Outsourcing the Processing

Dirk van Meerveld — Thu, 25 Jan 2024 19:57:21 +0000

Course: This article is based on a lesson from our Finxter Academy Course Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it!

Hi and welcome back! In this part, we’re going to look at some alternatives to speed stuff up or outsource the processing power to OpenAI’s servers altogether. First, we’ll look at faster-whisper at a basic level. If you’re not sure whether you want to use this you can also just watch this part and decide whether or not you want to install it for further use later as we’re just going to cover it quickly before moving on to the web API version for the rest of this part.

So what is faster-whisper? Faster-Whisper is a quicker version of OpenAI’s Whisper speech-to-text model. As OpenAI released the whisper model as open-source this has naturally allowed others to try to build on and optimize it further. It uses CTranslate2, a fast engine for Transformer models, and is up to 4 times faster and uses considerably less memory than the original openai/whisper while claiming to maintain the same accuracy. You can find the GitHub repository here.

You can use this for the same apps we have built so far, just as a faster version of the Whisper model, so we won’t be building a new app specifically for this, as it would get repetitive and I don’t want to waste your time! You just need some syntax changes to make your app work with faster-whisper instead of the original whisper model. So we’ll take a look at the basics of fast-whisper, let you decide if you want to use/implement it, and then move on to the web-API version.

Installing faster-whisper

Note: If you do not plan on using faster-whisper or are not quite sure, there is no point in going through the install procedures, and you can skip ahead a couple of minutes to the web-API version, or just watch/read along and decide later if you want to use it.

Basically, to install faster-whisper you just have to run the following command in your terminal:

pip install faster-whisper

And to support GPU execution you need to have the appropriate libraries for CUDA installed, which are cuBLAS and cuDNN. This can be the slightly trickier part of the install, and again I cannot really give you platform-specific instructions or help you with the specific troubleshooting if you run into challenges. As always in software development, if you’re lucky you won’t have any problems, and if you’re not, you spend some time on Google and Stackoverflow to find the solution. If you just want to run faster-whisper on your CPU, which will of course be slower but may not be a big deal for small-scale development on your own machine, you can skip the cuBLAS and cuDNN installs.

Using faster-whisper

So let’s give it a spin to see how it works! First create a new file in your project root directory called 4_faster_whisper.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    4_faster_whisper.py   (new file)
    settings.py
    .env

And inside let’s start with our imports:

from faster_whisper import WhisperModel
from settings import TEST_AUDIO_DIR

model_size = "small"

We import the WhisperModel class from the faster_whisper package, and the TEST_AUDIO_DIR variable from our settings.py file, and then set a string variable to the value small. Like whisper, faster-whisper also comes with different sizes of models. Using the same naming convention we have tiny.en, base.en, small.en, and medium.en as our English-only models. For the multi-language models, we can choose between tiny, base, small, medium, or one of several versions of the full-size model, namely: large-v1, large-v2, large-v3, or large.

Next, we’ll create a new instance of the WhisperModel class, picking only one of the two options below:

model = WhisperModel(model_size, device="cpu", compute_type="int8")
# Choose only one of these, depending on if you're running on CPU or GPU (cuda). (I'll be using the second option)
model = WhisperModel(model_size, device="cuda", compute_type="float16")

More options are available, like running on cuda using int8_float16 or even using float32, see here for more details.

The .transcribe method for faster-whisper is slightly different:

segments, info = model.transcribe(
    str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    beam_size=5,
)

As you can see we get two returns when calling model.transcribe instead of the single dictionary output we had before. The first is a list of segments which contains the transcription. The second is a NamedTuple (a Tuple with named fields) which allows us to access information like the language (info.language), language probability (info.language_probability), etc. So let’s add some print statements to print the information and then the transcription itself to the console:

print(f"Detected language '{info.language}' with probability {info.language_probability}")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

The first print statement just has us access some of the properties of the info object we discussed. The second print statement loops over the list of segments, and for each segment it will print the segment’s start time, end time, and the text of the segment itself. The :.2f is a formatting string that tells Python to print the number with two decimal places, for example: 1.23 instead of 1.23456789.

One interesting thing to note here though is that segments is not actually a list. Segments is a generator, which is a different type of iterable. What this means is that the segments will be generated when you request them and not beforehand. In other words, the transcription only begins when we iterate over the segments and not before. Calling .transcribe() on our model did not start the transcription as vanilla whisper did. You can either loop over the segments as we did above, or you can convert the generator to a list by converting it to a list list(segments).

One of the nice things about this generator is that we can very easily see the live transcription and print it to the console while it is still generating, which is exactly what this code will do. So let’s run it and see what happens:

Estimating duration from bitrate, this may be inaccurate
Detected language 'nl' with probability 0.931703
[0.00s -> 3.04s]  Hoi allemaal, dit is weer een testbestandje.
[3.04s -> 6.88s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[6.88s -> 12.68s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[12.68s -> 13.88s]  Ik ben benieuwd.
[13.88s -> 16.84s]  Hoi allemaal, dit is weer een testbestandje.
[16.84s -> 20.72s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[20.72s -> 26.48s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[26.48s -> 27.68s]  Ik ben benieuwd.
[27.68s -> 30.72s]  Hoi allemaal, dit is weer een testbestandje.
[30.72s -> 34.60s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[34.60s -> 40.36s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[40.36s -> 41.52s]  Ik ben benieuwd.

You can see the output streaming to the console as the model transcribes. Unless you run over CPU you will also notice a pretty good speed. Now as you’re probably not Dutch I’ll just tell you the transcription above is perfect except for the one small (herkent/herkend) issue we had before, but as you know this can be fixed by loading a larger model size.

Play around with any audio file you want and see what model size you need. If you use English files pick a .en model for greater efficiency. Also be aware that you can pass in options into the .transcribe method much like the vanilla whisper model, for instance:

segments, info = model.transcribe(
    str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    beam_size=5,
    word_timestamps=True,  # uncomment this line to get word timestamps
    # without_timestamps=True,  # uncomment this line to get rid of timestamps and just transcribe
)

In conclusion, faster-whisper is a nice optimization to look into if you’re considering deploying this model in a production application somewhere. There are also other optimized versions of the whisper model out there that you can check out, like distil-whisper. Play around and see which gives you the best trade-offs between speed and accuracy. I’ll leave the rest up to you as we move on from faster-whisper to check out the web-API version.

Web-API version

Another option we have is to simply not deploy the model anywhere but outsource this to OpenAI’s fast servers. This is kind of like making a ChatGPT call except we request a transcription instead of a chat completion. The OpenAI servers are also very optimized for machine-learning calculations (obviously) and as you’ll see they are therefore quite fast!

So let’s take a look at the pricing first. The cost for using the Whisper API is $0.006 per minute transcribed, rounded to the nearest second. This means a 20-minute video would cost you $0.12. This is a good solution if you don’t want to deploy the model yourself, perhaps your application will only be used occasionally and it’s simply not worth it to invest that much into having a model running somewhere. For a high-use application dealing with longer files and many users, this is not the way to go though.

So let’s take a quick look at how this would work practically, by building one last quick application, but this time using the web API. Our application will take any video in any language as input and will return a short quiz with questions about the video. First, create a new file in your utils folder named openai_api.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        command.py
        openai_api.py   (new file)
        podcast.py
        subtitles.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    4_faster_whisper.py
    settings.py
    .env

Inside openai_api.py, let’s start with our imports and some basic setup:

import typing
from pathlib import Path

from decouple import config
from openai import OpenAI


CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
MODEL = "whisper-1"

ResponseFormat = typing.Literal["text", "srt", "vtt"]

We’ll use typing to define our allowed response formats. The rest is all imports we have used before, config as we’ll need to load our API key and OpenAI to call the APIs for Whisper and ChatGPT. We create our CLIENT just like last time and we save the MODEL in a string variable, whisper-1 is the only option for the Whisper API for now.

Finally, we define a type alias named ResponseFormat which is a Literal type, which means it can only be one of the three strings we have defined, text, srt, or vtt. We can use this as a type hint later to indicate that if a particular variable is of type ResponseFormat then it should have one of these three values and nothing else. (json and verbose_json are also possible if you prefer JSON object output, but we will be skipping them as they are useless for our purposes.)

Now we’ll define our transcription utility function:

def transcribe(
    file: Path,
    language: str | None = None,
    translate: bool = False,
    response_format: ResponseFormat = "text",
) -> str:

    print("Transcribing file...")
    options = {
        "file": file,
        "model": MODEL,
        "response_format": response_format,
    }

    if translate:
        transcript = CLIENT.audio.translations.create(**options)
    else:
        if language:
            options["language"] = language
        transcript = CLIENT.audio.transcriptions.create(**options)

    if type(transcript) != str:
        raise TypeError(
            f"Expected a string value to be returned, but got {type(transcript)} instead."
        )
    print(f"Transcription successful:\n{transcript[:100]}...")

    return transcript

We define a function called transcribe which takes a file of type Path, a language of type str or None, which defaults to None, in which case the API will try to detect the language automatically. We also have a translate boolean which defaults to False, and a response_format which has to be of type ResponseFormat, so one of the three values we defined in the type alias, and defaults to text. The function returns a string.

We print a message to indicate the transcription is starting and then create a dictionary named options in which we pass in some options that are needed for both a translation and a transcript call, so the shared options if you will. These are the file, model, and response_format. If the user requests a translation we call the CLIENT.audio.translations.create method, passing in the **options dictionary as arguments as is. If translation = False it must be a transcription. For transcriptions, we can add the language key to the options dictionary to specify the language, but if the user didn’t provide it we can leave it out and it will just take a bit longer to do the auto-detection. This time we call the CLIENT.audio.transcriptions.create method, again passing in the **options dictionary which optionally now contains the language key.

Finally, we check if the transcript is a string, and if not we raise a TypeError to indicate something went wrong, just to make sure the user is not requesting JSON from this endpoint, which is possible and would crash the rest of our code. Otherwise, we print a message to indicate the transcription was successful and return the transcript.

Video to Quiz

As we’re going to be building a video-to-quiz app, we need one more utility function inside this openai_api.py file, which will take a transcript and generate some questions for us. Continue below the transcribe function:

PROMPT_SETUP = """You are a text-to-quiz app. The user will provide you a video transcription in textual format. You will generate a list of questions for the user to answer about this video. Depending on the length of the transcription, stick to a maximum of 5 questions. All questions should be solely about the video transcription content provided by the user and should be answerable by reading the transcription. Do not provide the answers, but only the questions. The transcription the user provides is based on a video, and may include timestamps, please ignore these timestamps and just treat it as one single transcription containing all the content in the video.
List and number each item on a separate line.
"""

from tenacity import retry, stop_after_attempt, stop_after_delay

First, we define a constant to hold the prompt setup instruction for ChatGPT. Just go ahead and copy mine. It’s a fairly basic setup that asks for questions related to the video so we can make a quiz tailor-made for the input video. We also import retry, stop_after_attempt, and stop_after_delay from the tenacity package. (Go ahead and move the tenacity imports line to the top of your file with the other imports instead of here in the middle.) We can use these to make our code a bit more robust when calling APIs or taking actions that do not have a 100% success rate. It’s fairly easy to use and I just want to show you that this tool is out there, you’ll see how it works in a second.

Let’s code up the function:

def text_to_quiz(text: str) -> str:
    print("Converting text to quiz...")
    messages = [
        {"role": "system", "content": PROMPT_SETUP},
        {"role": "user", "content": text},
    ]
    result = CLIENT.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
    )
    content = result.choices[0].message.content
    if content == None:  # Just a quick sanity check
        raise ValueError("There was an error while trying to generate the quiz.")
    print(f"Text to quiz conversion completed.")
    return content

Our function takes a string which is the transcription and returns a string as output. We create a list of messages with the first being the system message, holding our PROMPT_SETUP, and the second being the user message which has the transcription as its content. We then call the CLIENT.chat.completions.create method, passing in the model and messages as arguments. We’ll use gpt-3.5-turbo-1106 which is the newest gpt-3.5 model out there and is frankly good enough. You can use gpt-4 but make sure you consider the cost, it is considerably more expensive and not really needed for this use case. If you’re worried about the lower maximum input size, or ‘context window’ of gpt-3.5, know that it has a 16k context limit that can easily handle long video transcriptions, though most are not really as long as you might think they are.

We then access the content of the first choice’s message in the result object, which should hold our quiz. We do a quick sanity check to make sure we received a valid response, and then print a message to indicate the conversion was successful and return the content.

So that’s pretty simple, right? But what if we get no content back? Do we really want to just raise an error and give up immediately? Let’s use the tenacity library so we can try again in case of a failure. The only single thing we have to change is to add the @retry decorator before our function, the only thing that changes is the first line:

@retry(stop=stop_after_attempt(3) | stop_after_delay(60))
def text_to_quiz(text: str) -> str:
    print("Converting text to quiz...")
    messages = [
        {"role": "system", "content": PROMPT_SETUP},
        {"role": "user", "content": text},
    ]
    result = CLIENT.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
    )
    content = result.choices[0].message.content
    if content == None:  # Just a quick sanity check
        raise ValueError("There was an error while trying to generate the quiz.")
    print(f"Text to quiz conversion completed.")
    return content

And just like that, our function is set up to try up to three times or (|) for a max of 60 seconds, just in case the API call fails for some reason. Notice how easy it is to use the Tenacity library. This is not required but it’s a nice way to make your code more robust just in case.

Putting it all together

That’s our openai_api.py file done! Go ahead and save and close it. Now let’s create a new file in our project root directory called 4_vid_to_quiz.py to put it all together:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        command.py
        openai_api.py
        podcast.py
        subtitles.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    4_faster_whisper.py
    4_vid_to_quiz.py   (new file)
    settings.py
    .env

Inside 4_vid_to_quiz.py let’s start with our imports:

import os
import uuid
from pathlib import Path

import gradio as gr

from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR
from utils import openai_api, video


API_UPLOAD_LIMIT_BYTES = 26214400  # 25mb

We will use os to check the size of the file we will upload, as there is a size limit to the API. We have some imports you’ve seen before, and some of our directories from the settings file plus our openai_api and video utilities. We also define a constant API_UPLOAD_LIMIT_BYTES which is the maximum size of the file we can upload to the API, which is 25 MB.

Let’s start with a quick function to check if the file is not too big:

def check_upload_size(input_file: str) -> None:
    """Check the video file size is within the API upload limit."""
    input_file_size = os.path.getsize(input_file)
    if input_file_size > API_UPLOAD_LIMIT_BYTES:
        raise ValueError(
            f"File size of {input_file_size} bytes ({input_file_size / 1024 / 1024:.2f} MB) exceeds the API upload limit of {API_UPLOAD_LIMIT_BYTES} bytes ({API_UPLOAD_LIMIT_BYTES / 1024 / 1024:.2f} MB). Please use a shorter video or lower the audio quality settings."
        )

We take an input file path as a string and then use os.path.getsize to get the size of the file in bytes, and then check if it is larger than our API_UPLOAD_LIMIT_BYTES. If it is, we raise a ValueError to indicate the file is too large. We also print a message to indicate the file size and the API upload limit. That’s all there is to this function.

Let’s move on to our main function:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a quiz as string."""
    unique_id = uuid.uuid4()

    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=OUTPUT_TEMP_DIR / f"{unique_id}.mp3",
        mono=True,
    )

    check_upload_size(mp3_file)
    transcription = openai_api.transcribe(
        Path(mp3_file), language="en", translate=False, response_format="text"
    )

    quiz = openai_api.text_to_quiz(transcription)
    return quiz

This is the function the gradio button will call when clicked. It takes an input_video as string input and will return the quiz in string format. We don’t really care about the name of the mp3 file we’ll extract from the video here so we just use a uuid to make it unique. Now we use our video.to_mp3 utility function from the previous part to extract the audio from the video.

We pass in the input_video as the video file, our project root directory as the log_directory, and our output_path is the OUTPUT_TEMP_DIR with the uuid and .mp3 extension pasted on. Finally, this is the time to use the mono option we built into the to_mp3 function but didn’t use last time. So far the size of our files has not been that important, but now that we have a web API it suddenly becomes relevant.

Whisper down-mixes audio to mono before processing anyway, and the API has an upload limit of roughly 25MB per transcription request. So we can save a lot of space by dropping the channels to 1, from stereo to mono audio, which allows us to make much longer requests as we can drastically lower the bitrate with only 1 audio channel.

Sending stereo audio would exceed the file limit after about 20 minutes of audio at 192kbps quality. We more than halved the quality to 80kbps which is still considered decent quality for mono mp3 files and allows us to transcribe way longer files. You can also try playing with the other audio quality settings or lower the bitrate even further to 64kbps for mono if you want to go even further.

After that, we run our check_upload_size check to make sure the file is not too large, and then we call our openai_api.transcribe function, passing in the mp3_file as the file, language="en" as the language, translate=False as we don’t want to translate, and response_format="text" as we want the transcription in text format. We then call our openai_api.text_to_quiz function, passing in the transcription as the text and returning the resulting quiz.

Gradio Interface

Finally, we’ll create our gradio interface:

if __name__ == "__main__":
    block = gr.Blocks(
        css=str(STYLES_DIR / "vid2quiz.css"),
        theme=gr.themes.Soft(primary_hue=gr.themes.colors.yellow),
    )

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                
                
                
                """
            )
            with gr.Row():
                input_video = gr.Video(
                    label="Input Video", sources=["upload"], mirror_webcam=False
                )
                output_quiz_text = gr.Textbox(label="Quiz")
            with gr.Row():
                button_text = " Make a quiz about this video! "
                btn = gr.Button(value=button_text, elem_classes=["button-row"])

            btn.click(main, inputs=[input_video], outputs=[output_quiz_text])

    block.launch(debug=True)

All of this will be familiar by now, I just used a different CSS file we’ll have to create, and used a slightly different primary_hue for the team than last time. The ‘imgur’ image link has changed as well to give you a new header logo and below that, we just take an input video and have an output Textbox. Our button has a CSS class of button-row again so we can style it and clicking the button runs the function with the input video and the output going to the output textbox.

Let’s add the CSS file to our styles folder:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
        subtitle_master.css
        vid2quiz.css      (new file)
        whisper_pods.css
    test_audio_files
    utils
        command.py
        openai_api.py
        podcast.py
        subtitles.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    4_faster_whisper.py
    4_vid_to_quiz.py
    settings.py
    .env

And inside vid2quiz.css let’s add the following:

.header {
  display: flex;
  justify-content: center;
  align-items: center;
  padding: 2em 8em;
}

.header-img {
  max-width: 50%;
}

.header,
.button-row {
  background-color: #0c1d36;
}

We use flex to center the header image vertically and horizontally and apply the usual padding. We give the header-img class a max-width of 50% so it doesn’t take up the entire width of the screen. Finally, we give the header and button-row classes a background color of #0c1d36 which is a dark blue color.

Ok, you know the drill, let’s run it and see what happens!

Ok, looking good, so let’s upload a video and then request a quiz about it. I used a random video from YouTube, namely Hot Dr Pepper from the 1960s, just because it showed up when I opened the YouTube website. Let’s see how it does:

Perfect, exactly what we wanted, and this was all powered by the OpenAI API! You’ll also notice it was probably reasonably fast, considering it had to convert the whole video and then transcribe it and generate a quiz.

One important limitation of the app in this particular form is that it can handle videos up to about ~48 minutes in length (with the 80kbps mono settings), because of the upload limit. If you want to handle longer videos you could split them up and put the transcripts back together, but honestly, if you’re going to be handling files of that length you’re probably better off deploying the model yourself to save cost as it is calculated per minute of audio.

A fun idea is that you can also use the translation option in our utils.get_transcription function to have foreign language videos as input and then English questions about the foreign language video as output. This could be cool for a foreign language learning app or test.

So that’s it for the whisper course. I hope you enjoyed it and now have a good idea of how to use Whisper, what you can use it for, and the various deployment options. The next step is up to you and limited only by your imagination!

As always, it was an honor and a pleasure to take this journey together, and I hope to see you next time!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

Check out our full OpenAI Whisper course with video lessons, easy explanations, GitHub, and a downloadable PDF certificate to prove your speech processing skills to your employer and freelancing clients:

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper – Speeding Up or Outsourcing the Processing appeared first on Be on the Right Side of Change.

OpenAI Whisper Example – Building a Subtitle Generator & Embedder

Dirk van Meerveld — Thu, 25 Jan 2024 19:57:05 +0000

Welcome back to part 3, where we’ll use Whisper to build another really cool app. In this part, we’ll look at how to work with video files. After all, many of the practical applications of speech recognition don’t come in convenient MP3 files, but rather in video files. We’ll be building a subtitle generator and embedder, which will take a video file as input, transcribe it, and then embed the subtitles into the video file itself, feeding the result back to the end user.

Before we can get started on the main code, we will need to write some utilities again, just like in the previous part. The utilities we’ll need this time are:

Subtitles -> We just can reuse the subtitle-to-disk utility from the previous part. (Done)
Video -> We will need a way to convert a video file to an mp3 file so that we can feed it to Whisper.
Commands -> We will need a way to run commands on the command line, as there are multiple ffmpeg commands we’ll need to run both for the video conversion and the subtitle embedding.

So let’s get started with the command utility. Inside the utils folder, first create a new file named command.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
        command.py   (new file)
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    settings.py
    .env

Then inside the command.py file let’s start with our imports:

import datetime
import subprocess
from pathlib import Path

We’re going to run commands and provide some very basic logging as well. We imported the datetime module so we can add timestamps to our logs, and pathlib should be familiar by now. The subprocess module in Python is used to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. It allows you to execute system commands and interact with them programmatically. It’s basically a bit like opening a terminal window inside your Python code.

Next, we’ll start with an extremely simple function that will print a message but in blue letters:

def print_blue(message: str) -> None:
    print(f"\033[94m{message}\033[00m")

The \033[94m and \033[00m are ANSI escape codes, which are used to add color and formatting to text in terminal output. The 94 is the code for blue, and the 00 is the code for reset. You can find a list of all the codes here: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors. We will print the commands we execute to the terminal in blue, which helps them stand out from the other white text output and makes it easier for us to check our commands.

Running system commands

Next, we’ll create a function that will run a command like you would run on the command line:

def run_and_log(command: str, log_directory: Path) -> None:
    print_blue(f"Running command: \n{command}")
    with open(log_directory / "commands_log.txt", "a+", encoding="utf-8") as file:
        subprocess.call(
            command,
            stdout=file,
            stderr=file,
        )
        file.write(
            f"\nRan command: {command}\nDate/time: {datetime.datetime.now()}\n\n\n\n"
        )

We create a function called run_and_log, which takes two arguments: command which is a string, and log_directory which is a Path and indicates the directory where we want to save the log file. We then print the command we’re about to execute in blue, and then open the log file in append mode. The a+ means that we will append to the file if it exists, and create it if it doesn’t. Again, we use the encoding="utf-8" argument to make sure that we can write non-ASCII characters to the file as well. If you do not do this you will eventually run into trouble.

Inside the with open context manager, so while the file is open, we call the subprocess.call function. This function takes a command as input and executes it, so as the first argument we pass the command variable. The second argument is stdout=file, which means that we will write the output of the command to the file (instead of the console). The third argument is stderr=file, which means that we will write any errors to the file as well. So we basically execute the command and whatever output there is gets logged inside the text file.

After that, we write what command we executed and a timestamp to the file, and use a couple of \n to add some newlines to the file so that the next command will be lower down, making them easy to distinguish from each other.

Now let’s run a quick test, using the extremely simple terminal command echo 'hello', which will simply print hello to the console. Let’s run this command and see if our function works:

run_and_log("echo 'hello'", Path.cwd())

For the path we’ve used the Path.cwd() method in Python’s pathlib module which returns the current working directory as a Path object. This is the terminal’s current directory when you run the script. (This is just for a quick test, we don’t want to go through the trouble of importing the base directory in here)

Go ahead and run the command.py file, and whatever directory your terminal was in when you ran the script should now have a file named commands_log.txt with the following inside:

hello

Ran command: echo 'hello'
Date/time: 2024-01-14 12:13:49.535692

It worked! We’ve successfully logged the output of hello followed by our logging information of the time and command executed. Make sure you remove or comment out the run_and_log line before we continue, as we don’t want to run this command every time we run the script.

# run_and_log("echo 'hello'", Path.cwd())

A peculiar issue with slashes

With our run_and_log function completed, we have just one more function to create in here. There is a small discrepancy between the file paths where ffmpeg will expect a different format for the system commands than our Python code will give us. So we need to write a short utility to fix the path. This issue only occurs with the subtitle path when trying to embed the subtitles using ffmpeg system commands, and I’m honestly not sure why it occurs, but this is the type of thing you will run into during your software development journey.

If you keep looking you’ll always find a solution, never despair, but I’ll save you this time and tell you about the issue ahead of time!

The path C:\Users\dirk\test/subtitle.vtt will not work in the command and will give errors as it get’s messed up and then is unable to be parsed as a valid path.\
What we need is C\:\\Users\\dirk\\test\\subtitle.vtt instead. Notice there is an extra \ after the C and after every \ in the path. The first \ is an escape character, which means that the second \ is not interpreted as a special character but as a literal \.
This issue only affects the subtitle path and not the input or output video paths, so we only need to fix the subtitle path.

Below the run_and_log function inside the command.py file, add a new function:

def format_ffmpeg_filepath(path: Path) -> str:
    """Turns C:\Users\dirk\test/subtitle.vtt into C\:\\Users\\dirk\\test\\subtitle.vtt"""
    string_path = str(path)
    return string_path.replace("\\", "\\\\").replace("/", "\\\\").replace(":", "\\:")

We take a Path as input, and then first convert it to a string so we can use string methods on it to fix the format. We then use the replace method to replace all the \ with \\ and all the / with \\. We also replace the : with \:. Now I see you looking mighty confused! Why so many slashes? Well, remember the first \ is the escape character so that the second slash is interpreted not as an operator but as a literal slash string-character.

So in order to replace \ we need to target it using \\, as we need the escape character to indicate we want to target the \ string-character and not use it as an operator, so a single \ won’t work as it would be interpreted as the slash operator.
Likewise, to replace it with \\ we need to use \\\\ as each slash typed needs a slash to escape it, so that each second slash is interpreted as a literal slash string-character.
So the above function just means that \ is replaced by \\, / is replaced by \\, and : is replaced by \:. It just looks so confusing because of all the extra escape characters which also happen to be slashes! Phew.

Video utility functions

Okay so with that out of the way, go ahead and save and close the command.py file. It’s time for our video utility file next, so create a new file called video.py inside the utils folder:

    FINX_WHISPER (project root folder)
        output_temp_files
        output_video
        styles
        test_audio_files
        utils
            __init__.py
            podcast.py
            subtitles.py
            command.py
            video.py   (new file)
        1_basic_call_english_only.py
        1_multiple_languages.py
        2_whisper_pods.py
        settings.py
        .env

Don’t worry, this one won’t be so bad ! Open up your new video.py file and let’s start with our imports:

from pathlib import Path
from . import command

All we need is Path for input argument type-hinting and the command module we just created. Next, we’ll create a function that will convert a video file to an mp3 file so it can be fed to Whisper:

def to_mp3(
    input_video: str, log_directory: Path, output_path: Path, mono: bool = False
) -> str:
    output_path_string = str(output_path)

    channels = 1 if mono else 2
    bitrate = 80 if mono else 192

    command_to_run = f'ffmpeg -i "{input_video}" -vn -ar 44100 -ac {channels} -b:a {bitrate}k "{output_path_string}"'
    command.run_and_log(command_to_run, log_directory)
    print(f"Video converted to mp3 and saved to {output_path_string}")

    return output_path_string

We define a function named to_mp3 which takes an input_video as a string, a log_directory as a Path, an output_path as a Path, and a mono option as a boolean. The function returns a string in the end, which holds the output path. The input_video path is a string because gradio will feed it to us, which is why it is not a Path object like the log_directory and output_path. Make sure you always keep track of what type all the variables are or you will run into trouble eventually passing in a Path object where a string is expected, or vice versa.

First, we get a string version of the output_path and save it in output_path_string. Then we check if the mono option is set to True or False, and set the channels and bitrate variables accordingly. If mono is True we set channels to 1 and bitrate to 80, and if mono is False we set channels to 2 and bitrate to 192. We won’t actually need this mono option until part 4, but we might as well add it now.

Then we get to the command, first preparing it in a variable named command_to_run. We use the ffmpeg command and pass in the input_video as the input file (-i). We then use the -vn option to disable video recording, the -ar option to set the audio sampling frequency to 44100 Hz, the -ac option to set the number of audio channels to channels, and the -b:a option to set the audio bitrate to bitrate kbps. We then pass in the output_path_string as the output file location.

Notice that the command is contained inside an f-string which has single quotes on the outside (f'command'). Make sure you imitate this perfectly, using the single quotes on the outside and the double quotes around the variable names of "{input_video}" and "{output_path_string}". We need these double quotes because the user input video file is likely to have spaces in the name, and not having double quotes around a name with spaces inside will cause the command to fail.

Then we call the run_and_log function from our command module, passing in the command and the directory we want to log to, printing a message to the console, and returning the output_path_string.

That completes our video.py file, go ahead and save and close it. We’re ready to start on the main code now!

Subtitle Master – Putting it all together

In your root folder, create a new file named 3_subtitle_master.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
        command.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py   (new file)
    settings.py
    .env

Inside, let’s start with our imports:

import os
import uuid

import gradio as gr
import whisper
from whisper.utils import WriteVTT

from settings import BASE_DIR, OUTPUT_TEMP_DIR, OUTPUT_VIDEO_DIR, STYLES_DIR
from utils import command, subtitles, video

We import os to do some filename splitting, and all the other imports are familiar from previous parts. To finish up we import several directories from our settings file and the command, subtitles, and video modules from our utils folder, reusing the subtitles module from the previous part.

Next up are our constants for the file:

MODEL = whisper.load_model("base.en")
VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))

We just load up a model, I’ll start with base.en as it will probably be good enough to get started. Then we instantiate a WriteVTT object like we did last time, indicating we want to save the subtitles in the temp directory.

As we are going to be returning a video to the end user this time, I would like to include the original video name in the output file, though we’ll still need a uuid as well to guarantee unique names (the user might upload the same file twice!). So let’s create a quick function that gets us a unique project name. Say the user inputs a file named my_video.mp4, we want the function to return my_video_0f646333-0464-43a1-a75c-ed57c47fbcd5 so that we basically have a uuid with the filename in front of it. We can then add .mp3 or .srt or whatever file extension we need at the end, making sure all the files for this project have the same but unique project name.

def get_unique_project_name(input_video: str) -> str:
    """Get a unique subtitle-master project name to avoid file-name clashes."""
    unique_id = uuid.uuid4()
    filename = os.path.basename(input_video)
    base_fname, _ = os.path.splitext(filename)
    return f"{base_fname}_{unique_id}"

The function takes the input path as a string and then generates a uuid. We then get the filename using os.path.basename, which takes a path like C:\Users\dirk\test\my_video.mp4 and returns my_video.mp4. We then use os.path.splitext to split the filename into a base filename and an extension, so my_video.mp4 becomes my_video and .mp4. We catch the base name as base_fname and the extension under the variable name _ as we don’t need it. We then return the base filename with the uuid appended to it.

Now let’s get started on our main function below that will tie it all together:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a video file with subtitles embedded as string path."""
    unique_project_name = get_unique_project_name(input_video)
    get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}"
    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=get_temp_output_path(".mp3"),
    )

We’ll take an input video, which gradio will pass to our main function as a string path. The function will return a string path pointing towards the processed video file with embedded subtitles back to gradio. First, we get a unique project name using the function we just wrote. Then we create a simple lambda function like the one we had in part 2. It takes an extension like .mp3 as input and returns output_dir/project_name.mp3, as we’ll need temporary directories for both our .mp3 and our .vtt files, and this way we only have one place to change if we ever need to change the output directory.

Then we call the to_mp3 function from our video module, passing in the input video, the project’s base directory as the log directory, and the output path as the get_temp_output_path lambda function with .mp3 as the extension. We save the return of the function as the variable named mp3_file.

Continuing on:

def main(input_video: str) -> str:
    ...previous code...

    whisper_output = MODEL.transcribe(mp3_file, beam_size=5)
    vtt_subs = subtitles.write_to_file(
        whisper_output,
        writer=VTT_WRITER,
        output_path=get_temp_output_path(".vtt"),
    )

We call the transcribe method on our MODEL object, which has an instance of Whisper, passing in the mp3_file as the input file, and setting the beam_size to 5. We then call the write_to_file function from our subtitles module, passing in the whisper_output as the transcript, the VTT_WRITER as the writer, and the get_temp_output_path lambda function with .vtt as the extension as the output path.

So what is this beam_size parameter? Well, it’s one of a number of possible parameters we can pass into the transcribe method. The beam_size parameter is the number of beams to use in the beam search. The higher the number, the more accurate the transcription will be, but the slower it will be as well. The default is 5, and I’ve found that this is a good balance between speed and accuracy. The only reason I’ve passed it in explicitly here is to make you aware of these parameters. It basically refers to the number of different potential paths that will be explored, from which the most likely one is chosen. Here are some of the other possible parameters:

temperature -> The higher the temperature, the more likely it is that the model will choose a less likely character. You can think of it in a similar way as the temperature setting you get with ChatGPT calls. The default is 0 and will simply always return the most likely predictions only, 0 is what we have been using so far.
beam_size -> The number of beams to use in the beam search. We just discussed this one above. It is only applicable when the temperature is set to 0, and its default value is 5.
best_of -> Selects multiple random samples, only for use with a nonzero temperature and will generate more diverse (and possibly wrong) samples.
task -> Either transcribe or translate. We’ve used this one before and it defaults to transcribe.
language -> The language to use when task = translation. Defaults to None which will perform a language detection first.
device -> The device to use for inference. Defaults to cuda if you have a cuda enabled GPU, otherwise, it will default to cpu.
verbose -> Whether to print out the progress and debug messages, defaults to True.

And there are more. For general use, you’ll probably do fine with the defaults most of the time, but be aware that you can tweak these parameters to get better results if you need to.

Back to our code, let’s continue:

def main(input_video: str) -> str:
    ...previous code...

    vtt_string_path = command.format_ffmpeg_filepath(vtt_subs)
    output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4"
    embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"'

    command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR)

    return str(output_video_path)

We need to run another ffmpeg system command to embed the subtitles we have created into our video file. We first get the vtt_string_path by passing in the vtt_subs path we already have into that crazy function with all the //// backslashes we called format_ffmpeg_filepath, remember? After that, we save our desired output video path in a variable by just combining our OUTPUT_VIDEO_DIR with the unique_project_name and pasting _subs.mp4 at the end for good measure.

Now we prepare the ffmpeg command we’re about to run in a separate variable for readability. We use the input_video as the input file (-i), and then use the -vf option to add a video filter. The video filter we use is subtitles and we pass in the vtt_string_path as the subtitle file. We then pass in the output_video_path as the output file.

Notice again that the whole command is inside single brackets ' inside of which we have path variables in double brackets " to avoid trouble if there are spaces in the filename. But as we have to pass in "subtitles='{vtt_string_path}'" which requires another level of brackets again, going back to the single brackets ' would cause trouble as we have already used these to open the string variable at the start, so we have to escape them using the backslash \' instead.

Then we call the run_and_log function from our command module, passing in the command we just wrote, and the BASE_DIR as the log directory. We then return the output_video_path as a string, as gradio doesn’t want a Path object.

The whole main function now looks like this:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a video file with subtitles embedded as string path."""
    unique_project_name = get_unique_project_name(input_video)
    get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}"
    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=get_temp_output_path(".mp3"),
    )

    whisper_output = MODEL.transcribe(mp3_file, beam_size=5)
    vtt_subs = subtitles.write_to_file(
        whisper_output,
        writer=VTT_WRITER,
        output_path=get_temp_output_path(".vtt"),
    )

    vtt_string_path = command.format_ffmpeg_filepath(vtt_subs)
    output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4"
    embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"'

    command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR)

    return str(output_video_path)

Building the interface

Now all we need to do to run this is create another gradio interface. As you are already familiar with gradio now we’ll go through this one a bit more quickly, the principles are the same as last time. Below your main function, continue with:

if __name__ == "__main__":
    block = gr.Blocks(
        css=str(STYLES_DIR / "subtitle_master.css"),
        theme=gr.themes.Soft(primary_hue=gr.themes.colors.emerald),
    )

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                
                
                
                """
            )
            with gr.Row():
                input_video = gr.Video(
                    label="Input Video", sources=["upload"], mirror_webcam=False
                )
                output_video = gr.Video()
            with gr.Row():
                button_text = " Subtitle my video! "
                btn = gr.Button(value=button_text, elem_classes=["button-row"])

            btn.click(main, inputs=[input_video], outputs=[output_video])

    block.launch(debug=True)

We use the if __name__ == "__main__": guard to make sure that the code inside only runs when we run the file directly. We create the gradio block object just like we did before, passing in a css file that doesn’t exist yet, but this time we also pass in a theme. I’ll pass in the gr.themes.Soft() which has a bit of a different style to it, and set the accent color to emerald by passing in primary_hue=gr.themes.colors.emerald when calling Soft(). This will match nicely with the logo I have prepared for you with this application.

Then we open the block object using the with statement, and open up a new Group inside of it, just like we did before, so we can build our block interface. The HTML object is the same as in the last part, except I changed the image link URL to give you a new logo for this app. Then we open up a new Row and add a Video object for the input video, passing in sources=["upload"] so that the user can upload a video file, and setting mirror_webcam=False as we don’t want to take the user’s webcam as input. Still on the same Row, so next to the input video, we declare another Video object for the output video file.

We then have a row that only has a button for which we provide a text and a class of button-row so we can target it with CSS. The btn.click declaration is a lot simpler this time as we just call the main function with only a single input of input_video and only one output of output_video. Finally, we call .launch on the block just like last time.

That’s our code done! You’re probably dying to run it, but wait! We have to create a quick CSS file to finish it off. Create a new file named subtitle_master.css inside the styles folder:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
        subtitle_master.css   (new file)
        whisper_pods.css
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
        command.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    settings.py
    .env

Inside we’ll just write some quick CSS styles:

.header {
  padding: 2em 8em;
}

.header,
.button-row {
  background-color: #1d366f7e;
}

We just gave the header class some padding to stop the logo image from being too large and then gave both the header and button-row classes a background color of #1d366f7e which is a nice dark blue half-transparent color. Save and close the file, and we’re ready to run! Go ahead and run the 3_subtitle_master.py file, and give it some time to load. Click the link in your terminal window again to open the interface in your browser, and you should see something like this:

Yours won’t have Korean in the input video box though, but whatever your computer’s language is set to. Go ahead and upload a video file, wait a second for it to load, and then press the subtitle my video button. This may take quite a while if you’re not on the fastest system with a powerful GPU, but you’ll see the commands and steps being executed in your terminal window just like we set up. Eventually, you’ll see the output video appear with the subtitles embedded, each one perfectly in time with the video, and you can play it back and download it!

You can check the commands_log.txt file in the root directory to see all the commands that were run, and you can check the output_temp_files folder to see the temporary files that were created during the process, and the output_video folder to see the final output video file. If you need some extra quality, set a higher model like small.en or medium.en.

Conclusion

That’s pretty awesome! An automatic subtitler that will subtitle any video for you all on its own. You can build on this maybe by accepting YouTube links or adding translation functionality so you can have English subtitles on foreign language videos, which could be cool for language learning. Make sure you don’t use the .en model if you want to use other languages obviously.

To make a real production-grade application use a front-end framework and have some kind of progress or stream the live transcription to the page to stop the user getting bored, or allow them to do something else while the file processes in the background. A production app would have to run on a server with good processing power and GPU.

That’s it for part 3, I’ll see you soon in part 4 where we’ll look at ways to speed up Whisper or outsource the processing using the OpenAI API endpoint in the cloud. We’ll also build one more app using the cloud API to round off the series. See you there soon!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper Example – Building a Subtitle Generator & Embedder appeared first on Be on the Right Side of Change.

OpenAI Whisper Speech-to-Text (Course Overview)

Dirk van Meerveld — Thu, 25 Jan 2024 19:56:32 +0000

Hi and welcome to this tutorial series on the OpenAI Whisper speech-to-text model. Whisper is a very powerful automatic speech recognition system and in this series, we’re going to learn all about it and create cool projects along the way.

In part 1 we’ll take a look at the basics of setting up and using the Whisper library to transcribe audio files on your local computer.
In the next part, we are going to create a podcast application with a user interface where the user will be able to input any Google Podcasts link and they will get a transcript and summary of the podcast and even subtitle files for good measure.
In part 3 we’ll look at dealing with transcribing video files by creating an application where the user inputs any video file and the output will be that same video file but with subtitles embedded in it.
Finally, in the last part, we’ll take a look at alternatives, first looking at faster-whisper to speed things up, and then looking at using the Web-API version that runs in the cloud. We’ll create a final video-to-quiz application to show how the Web-API version works.

So I hope you’re excited to learn about Whisper and let’s get started!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper Speech-to-Text (Course Overview) appeared first on Be on the Right Side of Change.

OpenAI Whisper – Building a Podcast Transcribing App in Python

Dirk van Meerveld — Thu, 25 Jan 2024 19:56:17 +0000

Welcome back to part 2, where we’ll start practically applying our Whisper skills to build useful stuff. We obviously cannot just rely on the user needing to give us MP3 files to transcribe, they may want to just link a podcast for example. Here, we’ll be building a real application that can transcribe podcasts to text or subtitle format by taking just a podcast link as input.

Before we get started on the main code, we’ll do some basic setup work and create the helper functions we need to run in our main code. Keeping things separated across multiple functions and files will help keep our code a lot more clean and readable compared to just having one big script that does everything at the same time.

Saving our constants to a separate file

First, there are a couple of settings we’ll be using again and again over the next three parts, namely the paths to the input and output folders for the mp3 files, subtitles, and whatever else we will be processing. Instead of importing pathlib in every single file and then writing BASE_DIR = Path(__file__).parent we’ll just write this in a separate file and import it everywhere we need it. This will also make it easier to change the paths later if we need to.

In your project folder create a new file called settings.py, making sure to put it in the root folder of your project:

FINX_WHISPER (project root folder)
    test_audio_files
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py

In settings.py, write the following code:

from pathlib import Path

BASE_DIR = Path(__file__).parent
OUTPUT_TEMP_DIR = BASE_DIR / "output_temp_files"
OUTPUT_VIDEO_DIR = BASE_DIR / "output_video"
STYLES_DIR = BASE_DIR / "styles"
TEST_AUDIO_DIR = BASE_DIR / "test_audio_files"

We first get the root directory of the project using Path(__file__).parent, and then we create a few more paths relative to the root directory. We’ll use these paths in our main code to save the output files to the correct folders. Go ahead and also create empty folders for the output_temp_files, output_video, and styles folders, making sure to spell them correctly:

FINX_WHISPER (project root folder)
    output_temp_files     (new empty folder)
    output_video          (new empty folder)
    styles                (new empty folder)
    test_audio_files      (already existing folder)
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py

That’s our folders and paths setup done. We can just import these variables to access the folders from any file in our project. There is one more setting we need to define, but we cannot hardcode this one in our source code. We need to get our API key for OpenAI, as we’ll be using some ChatGPT in this part of the course. You’ll also need your API key for later parts. Go to https://platform.openai.com/api-keys and copy your API key. If you don’t have one, make sure to get one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env in the root folder of your project:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py
    .env                  (new file)

And paste your API key in there like this, making sure not to use any spaces or quotes:

OPENAI_API_KEY=your_api_key_here

Then go ahead and save and close this file.

Creating a utils folder for our helper functions

Now let’s create a new folder named utils to hold our helper functions, and then inside this new folder create an empty file called __init__.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils                 (new folder)
        __init__.py       (new empty file)
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py
    .env

The __init__.py file is required to make Python treat the utils folder as a package, which will allow us to import the functions from within our other files. You don’t need to write anything in this file, just create it and leave it empty.

Our first utils file will deal with the podcast-related functions, so create a file called podcast.py in the utils folder:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py        (new file)
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py
    .env

Inside podcast.py get started with our imports:

import re
import uuid
from pathlib import Path

import requests
from decouple import config
from openai import OpenAI

The re library deals with regular expressions and will help us find the podcast download page link amongst the page text. The uuid library lets us generate unique id’s, pathlib is familiar to us by now, and requests will help us download the podcast mp3 file. decouple will help us read our API key from the .env file, and openai will help us use the OpenAI API. If you have not used decouple before, make sure you run the install command in your terminal:

pip install python-decouple

Back in podcast.py let’s create a few constants that we’ll be using in our functions:

GPT_MODEL = "gpt-3.5-turbo-1106"
CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))

First, we set the ChatGPT model we’ll be using to request a podcast summary later on. Then we create a CLIENT object that we’ll use to make requests to the OpenAI API. We pass in our API key as a string, and we use config to read the API key from the .env file. Note that config("OPENAI_API_KEY") already returns a string value, the str() call surrounding it is just there to make it explicit and will not convert values that are already strings to a string again for the second time or something weird like that.

Scraping the podcast download link from the podcast page

So what are some of the functions we’ll need in here? For this example application I will be using Google Podcasts as our podcast source. This means we will get an input link like this:
https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjExMTk3MDo4MA?sa=X&ved=0CAgQuIEEahcKEwiIzMnavduDAxUAAAAAHQAAAAAQAQ

If you load this page in your browser, you will see an HTML page, with a play button. This is the kind of page link the user will input into our app, so first of all we will need a function to extract the .mp3 download link from this page’s HTML.

Let’s get started on a function to do exactly that:

def scrape_link_from_page(page_url: str) -> str:
    podcast_page = requests.get(page_url).text
    regex = r"(?P\;https?://[^\s]+)"
    ...

We start by defining our function which takes the page_url as a string and will return a string value as well. Then we use requests to get the HTML page text by sending a GET request to the URL, much like your internet browser would if you type a URL in the address bar. Now we define a regular expression that will match the pattern of the download link we want to extract. We’ll use this regex to find the download link in the HTML page text. Here’s how it works:

(?P...) This is a named group. The matched text can be retrieved by the name URL. So basically the URL pattern we will find will be stored in a variable called URL.
\; This matches a semicolon character. The backslash is used to escape the semicolon, as it has special meaning in regular expressions. We don’t want this special meaning but the literal semicolon character, as there is a semicolon in front of the https that we want to match for the URL we need. (This is just a characteristic of this particular podcast page, other pages might have different patterns.)
https? This matches either http or https. The s? means “match zero or one s characters”. This allows the regex to match both http and https.
:// This matches the string ://, which is part of the standard format for URLs.
[^\s]+ This matches one or more (+) of any character that is not (^) a whitespace (\s) character. So basically this will match any character that is not a space, tab, or newline character. This will match the rest of the URL we need and stop adding characters as soon as a space appears which indicates the end of the URL.

So, in simple terms, this regular expression matches a semicolon followed by a URL that starts with either http or https, and continues until a whitespace character is encountered. The URL is captured in a group named url.

Now let’s complete our function:

def scrape_link_from_page(page_url: str) -> str:
    podcast_page = requests.get(page_url).text
    regex = r"(?P\;https?://[^\s]+)"
    podcast_url_dirty = re.findall(regex, podcast_page)[0]
    podcast_url = podcast_url_dirty.split(";")[1]
    return podcast_url

So after we declared the regex pattern, we use re.findall to find all matches of the pattern in the podcast page text. This will return a list of matches, and we take the first match with [0]. This will return a string that looks something like this:

;https://download.ted.com/talks/etcetcetc;

Which is pretty good, we just need to get rid of the ; characters before and after the URL. We do this by splitting the string on the ; character, and then taking the second item in the list with [1]. This will return the clean URL we need: https://download.ted.com/talks/etcetcetc

Downloading the podcast mp3 file

Ok, so now our utils file has a function to scrape the download link. It stands to reason we’ll also need a function to download the mp3 file from the URL. Let’s get started on that:

def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path:
    print("Downloading podcast...")
    podcast_audio = requests.get(podcast_url)
    save_location = output_dir / f"{unique_id}.mp3"
    ...

We define a function called download that takes 3 input arguments. The podcast_url is the URL we scraped from the podcast page as a string variable. The unique_id is a unique ID we’ll use to name the downloaded file, so we can avoid name clashes where files have the same name. This argument should be an instance of the UUID class from the uuid built-in Python library, which we’ll have a look at in a bit. The output_dir is the directory where we want to save the downloaded file as a Path object. Finally, our function will also return a Path object, which will be the path to the downloaded file.

We print a simple message to the console to show it is busy actually doing something, and then we use requests to download the podcast audio file by sending a GET request to the URL just like we did in the previous function. Then we create a save_location variable which is the path to the file we want to save. We use the output_dir argument as the parent directory, and then we use an f-string to create a filename that is the unique_id followed by the .mp3 extension.

Now let’s complete our function:

def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path:
    print("Downloading podcast...")
    podcast_audio = requests.get(podcast_url)
    save_location = output_dir / f"{unique_id}.mp3"

    with open(save_location, "wb") as file:
        file.write(podcast_audio.content)
    print("Podcast successfully downloaded!")

    return save_location

We use the open function to open the save_location file in write binary (wb) mode, and we write the podcast_audio.content to the file. This will save the podcast audio file to the save_location path. Then we print a message to the console to show the download was successful, and we return the save_location path which points to the mp3 file we just downloaded, awesome!

Getting a summary

Now there is one more function we need in our utils/podcast file. Besides just the transcription, we will also provide the user with a summary of the podcast. We’ll use ChatGPT to generate this summary, so we’ll need a simple function to do that. This one will be easy, so let’s just whip it up:

def get_summary(transcription: str) -> str:
    print("Summarizing podcast...")
    prompt = f"Summarize the following podcast into the most important points:\n\n{transcription}\n\nSummary:"

    response = CLIENT.chat.completions.create(
        model=GPT_MODEL, messages=[{"role": "user", "content": prompt}]
    )

    print("Podcast summarized!")
    summary = response.choices[0].message.content
    return summary if summary else "There was a problem generating the summary."

I assume you’re familiar with ChatGPT (if not, check out my other courses on the Finxter Academy!). We just have a simple function that takes the full transcription as a string and will return a summary as a string. We have a console print message again just to keep ourselves posted that it is doing some work and then we have a simple ChatGPT prompt.

Note the prompt ends with Summary: to prompt the model to start the summary right away without including any awkward introduction text, this is just a neat little trick you can use. We then use our CLIENT object to call the chat.completions.create endpoint, passing in the GPT_MODEL and a list of messages. We’ll just pass in the prompt as a user message. We then extract the summary from the response.choices[0].message.content. Just in case there was a problem and the summary is empty, we return a default message to inform the user.

Subtitles

Awesome! Our podcast utils are done now. Let’s move on to the subtitles utils. This one will be a much shorter file with a function that will allow us to output the transcription in subtitle format, with timestamps and everything. So go ahead and create a new file called subtitles.py in the utils folder:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py      (new file)
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py
    .env

And inside subtitles.py get started with our imports:

from typing import Callable
from pathlib import Path

Both of these imports will be used solely to indicate the type of our function arguments (type hinting). We’ll use Callable to indicate that a function is expected as an argument, and we’ll use Path to indicate that a Path object is expected as an argument. This just makes our code clearer to read and easier to understand. Now let’s write our function, whose purpose will be to take a transcription done by Whisper and then convert it to a valid subtitle file:

def write_to_file(whisper_output: dict, writer: Callable, output_path: Path) -> Path:
    """Takes the whisper output, a writer function, and an output path, and writes subtitles to disk in the specified format."""
    with open(output_path, "w", encoding="utf-8") as sub_file:
        writer.write_result(result=whisper_output, file=sub_file)
        print(f"Subtitles generated and saved to {output_path}")

    return output_path

We take a whisper_output argument which is a dictionary containing the output Whisper gives us after we transcribe the podcast’s mp3 file. We also take a writer argument which is a function that will write the subtitles to disk, so we type-hint it with Callable. Finally, we take an output_path argument which is a Path object to the file we want to save the subtitles to. We then simply open the output path in write mode, calling the file sub_file. We then call the writer.write_result function, passing in the whisper_output and the location to save the subtitles to. Finally, we print a message to the console to show the subtitles were generated successfully, and we return the output_path which is the path to the subtitle file we just created.

Two important things to note here:

When you open the subtitle file, make sure you use the encoding="utf-8" argument. For normal English characters, this is not necessary, so you might think this is not needed. However, the AI likes to use ♪ symbols when music starts playing to make the subtitles more interesting, and you crash if you don’t specify utf-8 encoding which can actually map and save these special characters!
You might be wondering what this magical writer function is. Whisper actually comes with some utility functions that will allow us to write subtitles in correct formatting, like SRT or VTT. These utilities have a .write_result function which is what we’re calling in our code above. So we’ll be able to pass in a SRT-writer or a VTT-writer depending on what subtitle type we want to save.

Ok, so that is all our utility functions done. Now let’s move on to the main code.

Installing gradio

Before we get started you’ll need to install gradio, so in your terminal window, run:

pip install gradio

What is gradio? Gradio is a Python library that allows us to quickly create user-friendly interfaces for testing, demonstrating, and debugging machine learning models. We’ll use gradio to create a UI for our app with just a few lines of code, and it supports a wide range of input and output types like video, audio, and text. Using this super simple framework we can keep the focus on whisper and not on building a user interface. It’s pretty self-explanatory, so you’ll understand the idea as we just code along.

Creating the main file

Now let’s get started on our main code, where mostly we’ll just have to call our utility functions and tie it all together, plus create a quick gradio interface to make it user-friendly. Create a new file called 2_whisper_pods.py in the root folder of your project:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py   (new file)
    settings.py
    .env

And inside 2_whisper_pods.py get started with our imports:

import uuid
from pathlib import Path

import gradio as gr
import whisper
from whisper.utils import WriteSRT, WriteVTT

from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR
from utils import podcast, subtitles

uuid is Python’s built-in library to generate unique id’s, pathlib is familiar to us by now, and gradio is the library we just installed. We also import whisper and two writer utilities from whisper.utils, which are the writer functions we talked about in the previous section. Then we import our directory Path constants from the settings and our podcast and subtitles utils. Now continue below the imports:

WHISPER_MODEL = whisper.load_model("base")
VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))
SRT_WRITER = WriteSRT(output_dir=str(OUTPUT_TEMP_DIR))

We load the WHISPER_MODEL from the base model, and we create two writer objects by creating instances of the WriteVTT and WriteSRT classes we imported from Whisper’s utilities, passing in the output_dir as a string.

Now let’s create a function to tie it all together:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    unique_id = uuid.uuid4()

    podcast_download_url = podcast.scrape_link_from_page(page_link)
    mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR)
    ...

We define a function called transcribe_and_summarize which takes a page_link as a string and will return a tuple so we can have multiple outputs to this function. These four outputs will feed back into the gradio interface we will create later and will be:

The podcast summary
The podcast transcription
The VTT subtitle file (path)
The SRT subtitle file (path)

We then create a new unique_id which we’ll use to name the downloaded mp3 file. Note we do this inside the function as we need a unique identifier for every single transcription run to avoid name clashes. Then we use our podcast.scrape_link_from_page util to scrape the download link from the podcast page, and we use our podcast.download function to download the podcast mp3 file, passing in the podcast_download_url, unique_id, and the OUTPUT_TEMP_DIR as arguments. We then catch the mp3 file path in a variable called mp3_file. Notice how easy everything is to read because we used logical and descriptive names for all our variables and utility functions and files.

Let’s continue with our function:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    ...previous code...

    whisper_output = WHISPER_MODEL.transcribe(str(mp3_file))
    with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f:
        f.write(str(whisper_output))

    transcription = str(whisper_output["text"])
    summary = podcast.get_summary(transcription)

We call the .transcribe function by passing in the mp3_file path as a string. This will return a dictionary with the transcription and other information we catch in whisper_output. We then open a file called pods_log.txt in our root directory in write mode, and we write the whisper_output to the file. This is just for debugging purposes, so we can see what the output looks like (it’s too long to print to the console). We then extract the transcription from the whisper_output dictionary. Note that whisper_output["text"] is already a string, the reason we wrapped inside a str() call is just to make it explicit that this is a string for typing purposes. This will not add any extra overhead or computing time as values that are already a string will just pass through the str() function unaltered. Then we call our podcast.get_summary function, passing in the transcription as an argument.

Now we just need to write the subtitles to disk and return all the outputs. Continue on:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    ...previous code...

    get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}"
    vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt"))
    srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt"))

    return (summary, transcription, str(vtt_subs), str(srt_subs))

We create a lambda (nameless) function that takes a file extension as input and then returns the path to the subtitle file with that extension. For example, inputting .vtt will yield output_temp_files/unique_id.vtt, but giving it .srt will yield output_temp_files/unique_id.srt, just so we can avoid repeating the same code twice. Then we call our subtitles.write_to_file function twice, passing in the whisper_output, the VTT_WRITER and SRT_WRITER writer functions, and the get_sub_path lambda function to get the path to the subtitle file. We catch the output of these two functions in vtt_subs and srt_subs respectively. Finally, we return a tuple containing the summary, transcription, vtt_subs, and srt_subs to finish off our function.

The whole thing now looks like this:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    unique_id = uuid.uuid4()

    podcast_download_url = podcast.scrape_link_from_page(page_link)
    mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR)

    whisper_output = WHISPER_MODEL.transcribe(str(mp3_file))
    with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f:
        f.write(str(whisper_output))

    transcription = str(whisper_output["text"])
    summary = podcast.get_summary(transcription)

    get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}"
    vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt"))
    srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt"))

    return (summary, transcription, str(vtt_subs), str(srt_subs))

Creating the gradio interface

That’s very nice and well, but a typical end user does not know how to use Python and this function is not very user-friendly. So let’s create a quick gradio interface to make it easy for the user to use our app. Continue below the function:

if __name__ == "__main__":
    block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css"))

    with block:
        with gr.Group():
            # Header

            # Input textbox for podcast link

            # Button to start transcription

            # Output elements

            # btn.click definition

    block.launch(debug=True)

This is going to be the basic structure of our gradio application. First, we use if __name__ == "__main__": to make sure the code inside this block only runs if we run this file directly, and not if we import it from another file. Then we create a block object by calling gr.Blocks and passing in the path to our whisper_pods.css file in the styles directory as a string. This will allow us to style our app with CSS, which we’ll do in a bit (this .css file doesn’t exist yet). Then we open a with block: block, and inside this block we open a with gr.Group(): block. This will allow us to group elements together in our app. Then we have a bunch of comments to indicate what we’ll be doing in each block, which we’ll fill in in a moment. Finally, we call block.launch to launch our app, passing in debug=True so we get extra feedback in the console if anything goes wrong.

The header will hold a logo image for our application. We’ll use HTML to load it from the internet. We can call gr.HTML to create an HTML element, and we can pass in the HTML code as a string. We’ll use a div element with a header class, and inside this div we’ll have an img element with a link to our logo image, which I just quickly uploaded to “imgur”. We’ll also set the referrerpolicy to no-referrer to avoid any issues with the image not loading (imgur doesn’t work with a localhost referrer, which is what you’ll have when you run this app locally).

gr.HTML(
    f"""
    
    
    
    """
)

The input textbox will be where the user can paste in the podcast link. We can just call gr.Textbox to create a textbox element, and we can pass in a label to indicate what the textbox is for. We’ll call it “Google Podcasts Link” and we’ll catch the input in a variable called podcast_link_input.

podcast_link_input = gr.Textbox(label="Google Podcasts Link:")

The button will be the trigger that starts the main function. I want a full row button so we’ll call gr.Row to create a row element, and then we’ll call gr.Button to create a button element. We can just pass in the button text we want to display and associate the button with the variable name btn. We’ll use this btn object later to define the button’s behavior.

with gr.Row():
    btn = gr.Button(" Transcribe and summarize my podcast! ")

The output elements will be the summary, transcription, and two subtitle files. The first two are just a gr.Textbox which does what you’d expect and allows us to pass in a label, placeholder, and the number of lines to display by default. The autoscroll behavior will scroll all the way down to the bottom if a large transcription text is passed into the input box. Since we want the user to be able to start reading from the beginning instead of the end we set this behavior to False. We then have another gr.Row with two gr.File elements which will end up side-by-side in a single row. The label is just a label and the elem_classes is a list of classes gradio will give the element, so we can target it with CSS later on using the names vtt-sub-file and srt-sub-file.

summary_output = gr.Textbox(
    label="Podcast Summary",
    placeholder="Podcast Summary",
    lines=4,
    autoscroll=False,
)

transcription_output = gr.Textbox(
    label="Podcast Transcription",
    placeholder="Podcast Transcription",
    lines=8,
    autoscroll=False,
)

with gr.Row():
    vtt_sub_output = gr.File(
        label="VTT Subtitle file download", elem_classes=["vtt-sub-file"]
    )
    srt_sub_output = gr.File(
        label="SRT Subtitle file download", elem_classes=["srt-sub-file"]
    )

The btn.click is where we define which function to call when the button is clicked, so we give it our transcribe_and_summarize function as the first argument. The second argument is a list of inputs, in this case only our podcast_link_input. The third argument is a list of outputs, in this case, our summary_output, transcription_output, vtt_sub_output, and srt_sub_output. We’ll use these outputs to display the results of our function to the user. We just told gradio what function to run, and how to map all of the input and output elements we defined in the interface to the input and output arguments of our function!

btn.click(
    transcribe_and_summarize,
    inputs=[podcast_link_input],
    outputs=[
        summary_output,
        transcription_output,
        vtt_sub_output,
        srt_sub_output,
    ],
)

whisper_pods.py now looks like this:

imports

CONSTANTS


def transcribe_and_summarize(...)...
    ...


if __name__ == "__main__":
    block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css"))

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                
                
                
                """
            )

            podcast_link_input = gr.Textbox(label="Google Podcasts Link:")

            with gr.Row():
                btn = gr.Button(" Transcribe and summarize my podcast! ")

            summary_output = gr.Textbox(
                label="Podcast Summary",
                placeholder="Podcast Summary",
                lines=4,
                autoscroll=False,
            )

            transcription_output = gr.Textbox(
                label="Podcast Transcription",
                placeholder="Podcast Transcription",
                lines=8,
                autoscroll=False,
            )

            with gr.Row():
                vtt_sub_output = gr.File(
                    label="VTT Subtitle file download", elem_classes=["vtt-sub-file"]
                )
                srt_sub_output = gr.File(
                    label="SRT Subtitle file download", elem_classes=["srt-sub-file"]
                )

            btn.click(
                transcribe_and_summarize,
                inputs=[podcast_link_input],
                outputs=[
                    summary_output,
                    transcription_output,
                    vtt_sub_output,
                    srt_sub_output,
                ],
            )

    block.launch(debug=True)

Creating the CSS file

See how easy it was to write an interface using gradio! There is just one thing left to do, the STYLES_DIR / "whisper_pods.css" file we loaded into gradio doesn’t actually exist! Go ahead and create a new file in the styles directory called whisper_pods.css:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
        whisper_pods.css  (new file)
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    settings.py
    .env

Inside whisper_pods.css paste the following code:

.header {
  padding: 2em 8em;
}

.vtt-sub-file,
.srt-sub-file {
  height: 80px;
}

We set some padding on the header image by targeting the header class, to stop the image from getting too big. Then we set the height of the subtitle file download boxes to 80px, so they don’t get smaller than this, keeping them nice and visible.

Now go back to your 2_whisper_pods.py file and run it. Give it some time to load up and you’ll see the following in your terminal:

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

CTRL + click the link to open it in your browser. You should see the following:

Go ahead and get a Google podcasts link to input. I’ll use a short podcast just for the initial test:
https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjEwNzMyNDo4MA?sa=X&ved=0CAgQuIEEahcKEwiImYLqr8qDAxUAAAAAHQAAAAAQAQ

And then click the button and wait (I’ve blurred out the transcription to respect the speaker’s copyright as this course will be published publicly):

Check the summary, transcription, and subtitle files. Try other podcasts from https://podcasts.google.com/. play around and have fun! My transcription was very good using just the base whisper model we loaded up and I never even used a bigger one! If you use non-English languages you may need a bigger model though. You can also use a .en model like base.en or small.en to get higher accuracy if you will only input English podcasts.

Also take a look at the pods_log.txt file you wrote in the root directory of your project, which holds the full whisper output. It may help you pinpoint where the problems are and how confident the model is while transcribing.

Conclusion

There we go, that is a pretty good initial minimum viable product! Of course, it has much room for improvement, for instance by using a proper front-end framework like React and streaming the transcription live to the page so the user is not left waiting so long before seeing results.

You could also use asyncio to make the ChatGPT summary call asynchronous slightly speeding up the code by writing the subtitle files to disk while the ChatGPT summary call is running at the same time, and of course, you’d want some kind of cleanup function to get rid of all the downloaded mp3 files hanging around in your output_temp_files folder. If you check it you will see all the files with the names like 0e0f5d05-9379-4124-a84d-81de7eb3e314.mp3 we generated, plus all the subtitle files with the same name for each mp3 file.

I’ll leave the rest up to your imagination! That’s it for part 2, I’ll see you soon in part 3, where we’ll be using Whisper to create a fully automatic video subtitling tool that takes only a video file as input, then transcribes the audio, creates subtitles, and embeds them into the video at the correct times! It will be fun, see you there!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper – Building a Podcast Transcribing App in Python appeared first on Be on the Right Side of Change.

OpenAI Whisper – Python Installation, Setup, & First Steps to Speech-to-Text Synthesis

Dirk van Meerveld — Thu, 25 Jan 2024 19:55:30 +0000

Welcome to this first part of the Whisper course. My name is Dirk van Meerveld and it is my pleasure to be your host and guide for this tutorial series where we will be looking at OpenAI’s amazing speech-to-text model called Whisper.

We’ll first take a look at what it is and how its basic usage works, and then we’ll explore ways in which we can practically use it in our projects. Along the way, we’ll learn about the balance between model size and accuracy, and in the final part, we’ll look at alternative options to speed it up or outsource the processing to OpenAI’s servers.

The local installation process should not be too much of a problem but is a bit different for all operating systems and system setups. Unfortunately, I cannot cover every single possible system setup configuration, so you may have to do some googling and trial and error along the way.

This is an inevitable part of software development, don’t give up and you will always get it working eventually, we all get stuck trying to get something to work with our particular system sometimes, it’s just part of the job.

If you do not like a particular configuration like running the model locally, rest assured we will cover both the different ways to run Whisper and various implementation projects over the series, so just watch through the whole thing and then take whatever projects you like and combine it with whatever version of running Whisper you liked.

Installing Whisper

First, we need to install Whisper. We’ll be using the pip package manager for this, so make sure you have that installed, but you should if you’re a Python user. In a terminal window run the following command:

pip install -U openai-whisper

The -U flag in the pip install -U openai-whisper command stands for --upgrade. It means that Whisper will either be installed or upgraded to the latest version if it is already installed.

The second thing we need to have installed is ffmpeg. What is ffmpeg? FFmpeg is a versatile multimedia framework that allows us to work with audio and video files. It supports a wide range of formats, and is highly portable, running on pretty much any operating system.

The simplest way to install ffmpeg is to use a package manager. If you’re on Windows, you can use Chocolatey to install ffmpeg by running the following command in a terminal window:

# on Windows / Chocolatey
choco install ffmpeg

If you’re on MacOS using Homebrew, you can install ffmpeg by running the following command in a terminal window:

# on MacOS / Homebrew
brew install ffmpeg

If you’re on Linux, well you probably know what to do and don’t need instructions! sudo apt update && sudo apt install ffmpeg

This may be the most challenging part of the tutorial series, to be honest. You may not run into any issues if your system is already set up well, or you may need to do quite some googling and setup work to get everything up and running. It took me some messing around to get everything working properly on my system and it’s unfortunately impossible to know exactly what you will need to do to resolve any issues you may run into. Google is your friend! Remember we’ll also cover the API in part 4 if you don’t want to run the model locally, but don’t just skip ahead as you’ll miss out on a lot of useful information.

What is Whisper?

Whisper is a speech-to-text model developed by OpenAI. What is really cool is that they open-source released this model to the public. It is a neural network that takes audio as input and outputs text. It is trained on a large dataset of audio and text pairs and has learned the text that corresponds to the audio. What is exciting about the model is that it’s not just effective at transcribing high-quality ‘gold-standard’ audio that has been recorded on studio microphones, but is also very good at transcribing audio that has considerably lower quality, or even imperfect pronunciation with a foreign accent. If you compare it with auto-generated subtitles from Youtube, for example, you will see that it really is a level apart.

Instead of diving deep into the model’s architecture and technical details that make it work behind the scenes, this course will focus on the practical application of what we can do with it and how to use it to make cool stuff.

Model sizes

There are different sizes available for the Whisper model. The smaller the size of the model, the less processing power and VRAM it needs, and the faster it will run. This comes at the cost of a lower accuracy. On the contrary, the larger the model size, the more processing power and VRAM it needs, and the longer it will take to run, but the more accurate it will be and the better it will deal with foreign languages, noise, and poor audio quality.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative Speed
tiny	39M	tiny.en	tiny	~1GB	~32x
base	74M	base.en	base	~1GB	~16x
small	244M	small.en	small	~2GB	~6x
medium	769M	medium.en	medium	~5GB	~2x
large	1550M	N/A	large	~10GB	1x

As we can see in this table from the Whisper GitHub, we have 5 different model sizes in total. There are 4 sizes for the English-only model, namely tiny.en, base.en, small.en, and medium.en. As this model only deals with the English language it is highly recommended to use one of these when you know you’re going to be transcribing English as these models are specialized at only dealing with English and therefore will give greater accuracy at a much smaller model size and run-time. This is why there is no large.en model as the medium.en model is already sufficient in size to equal the accuracy of the large multilingual model.

For the multilingual models, we have the tiny, base, small, medium, and large sizes. This whisper is trained on a whopping 680,000 hours of audio data covering a total of 97 different languages, though the performance does vary per language as more obscure languages may not work quite as well. The larger the model size the easier it will deal with such languages, specific accents, and poor audio quality.

Now if you don’t have 10GB of VRAM, don’t worry, you can often get away with using the smaller-size models as you will see. Later on, in the last part of the series, we’ll look at smaller ‘distilled’ versions of the model that can help us optimize speed further, or just outsourcing the processing to the lighting-fast OpenAI servers. Just keep watching! That being said, I actually recommend you always use the smallest version that you can get away with for your specific task. There is simply no point in adding more cost and complexity to your apps. If you don’t need it the extra model size will only slow down and raise the cost of your application.

Basic usage

Now that we have Whisper, fire up your favorite code editor, and let’s get started! I’ll be using VSCode, but you can use whatever IDE you like. Create a root folder for your project, I’ll call mine FINX_WHISPER, and then inside make a new file called 1_basic_call_english_only.py. (I’m using numbers for the file names so you can easily reference them later when you are busy coding some cool new project, but this is obviously not a good general naming convention):

FINX_WHISPER (project root folder)
    1_basic_call_english_only.py

Then open up the new Python file and start with the imports:

import whisper
from pathlib import Path

The whisper import is obvious, and pathlib will help us get the path to the audio files we want to transcribe, this way our Python file will be able to locate our audio files even if the terminal window is not currently in the same directory as the Python file. Now let’s declare some constants:

MODEL = whisper.load_model("base.en")
AUDIO_DIR = Path(__file__).parent / "test_audio_files"

First, we declare MODEL and load the base.en model. We start with the second-smallest English-only model and will scale up if and when we need to. Then we declare AUDIO_DIR and use pathlib to get the path. This works by first getting the path to the current file (1_basic_call_english_only.py), using __file__, and then getting the parent directory of that file, using .parent. Then we add the test_audio_files folder to the path using the / operator. This way we can easily access the audio files in the test_audio_files folder from our Python file.

Now let’s create the test_audio_files as it doesn’t actually exist, make sure you spell it correctly:

FINX_WHISPER (project root folder)
    test_audio_files
    1_basic_call_english_only.py

Then go ahead and add the audio files provided into the folder. They should be provided together with this video tutorial, but if for any reason you cannot find them, go to the Finxter GitHub repository for this course or you can find a copy at:

https://github.com/DirkMeer/finx_whisper

Download all the test files and put them in the folder (you can also add your own audio files if you want to, these are just provided for your convenience):

FINX_WHISPER (project root folder)
    test_audio_files
        dutch_long_repeat_file.mp3
        dutch_the_netherlands.mp3
        high_quality.mp3
        low_quality.mp3
        terrible_quality.mp3
    1_basic_call_english_only.py

Ok, back to our 1_basic_call_english_only.py file. Below the MODEL and AUDIO_DIR variables, let’s create a function that will transcribe the audio files for us:

def get_transcription(audio_file: str):
    result = MODEL.transcribe(audio_file)
    print(result)
    return result

This function takes an audio file’s path in string format as input. We then call the .transcribe() method Whisper provides for us, and pass in the audio file’s path in string format. Then we simply print and return the result for a basic test. Looks really simple right?

First, let’s try and transcribe a high-quality English audio file, as a sort of best-case scenario:

get_transcription(str(AUDIO_DIR / "high_quality.mp3"))

Notice that the function we wrote above takes a path as a string variable. This is because Whisper requires the path to the audio file as a string. AUDIO_DIR / "high_quality.mp3" returns a Path object, so we use str() to convert it to a string, or else Whisper will crash.

Getting a transcription

So go ahead and save and run the file, and you will see a large object containing all the output. Let’s take a quick look at the information available to us here, read the comments for an explanation:

{
    # First we get the full transcription
    "text": " Hi guys, this is just a quick test audio file for you. Let's see how well it does and if my speech is recognized and converted to text properly. I'm really excited to see how well this works and I hope that it will be a good test for you guys to see how well the whisper model works.",
    # Now we have the list of segments
    "segments": [
        {
            "id": 0,
            "seek": 0,
            # Start and end times in seconds
            "start": 0.0,
            "end": 3.52,
            "text": " Hi guys, this is just a quick test audio file for you.",
            # list of tokenized words from the transcription, where each word is represented by a unique number
            "tokens": [ 50363, 15902, 3730, 11, 428, 318, 655, 257, 2068, 1332, 6597, 2393, 329, 345, 13, 50539 ],
            "temperature": 0.0,
            # In the context of machine learning, temperature is a parameter that controls the randomness of predictions. A temperature of 0.0 suggests no randomness, or the model always selecting the tokens(words) with the highest probability (This is similar to the ChatGPT API temperature setting). You can pass a temperature value to the transcribe function when calling it if you want to introduce more randomness into your generations.
            # For instance: model.transcribe(audio_file, temperature=0.2)
            "avg_logprob": -0.1399546700554925,
            # The average log probability of the tokens in the segment. The closer to 0 the better, which means if the numbers get more negative, like -0.2 for instance, it means it's much less confident in it's transcription (and there are probably more errors).
            "compression_ratio": 1.5898876404494382,
            "no_speech_prob": 0.0045762090012431145,
            # Represents the probability that the segment contains no speech. We can see that it is very low.
        },
        {
            '... more segments with the same structure as above, cut for brevity ...'
        },
    ],
    "language": "en",
}

As we can see, we really get a lot of information back from the model! What is most interesting is of course the transcription itself. Notice that it is a perfect word-for-word transcription even though we used the second smallest base.en model possible. Very impressive for such a small version of the real model! Now let’s try a lower-quality audio file:

replace the last call:

get_transcription(str(AUDIO_DIR / "high_quality.mp3"))

with:

get_transcription(str(AUDIO_DIR / "low_quality.mp3"))

And when we run this with the considerably lower quality audio file, still on the base.en model, I still get a perfect transcription. If we look closely at the output object though we can clearly see the avg_logprob (explained above) has moved further away from 0, moving from -0.1399546700554925 to -0.2179246875974867 indicating the model is now much less confident in it’s transcription (though still correct).

Now let’s try a really poor-quality audio file:

get_transcription(str(AUDIO_DIR / "terrible_quality.mp3"))

And if we run this we can see that it is still half correct even though a human would have trouble understanding it:

Hi guys. This is just a quick test audio file for you. Let's see how well it does and if my speech is recognized, thank you for the context properly. I'm really excited to see how well this works and I hope that it will be a quick test for you guys to see how well the whisper model works.

We have clearly reached the limits of the base model here as part of this is incorrect, and it’s time to step up to a bigger model size. (Remember, you generally want to use the smallest model you can get away with for your use case!)

I’m going to change the model to small.en by editing the MODEL variable at the top of our file:

MODEL = whisper.load_model("small.en")

Now if we run it again:

Hi guys, this is just a quick test audio file for you. Let's see how well it does, and if my speech is recognized and converted to text properly, I'm really excited to see how well this works, and I hope that it will be a good test for you guys to see how well the Whisper model works.

There is an awkward super-long sentence with a bit too many commas but apart from that it’s perfect, even though the audio quality of this file is pretty terrible. Switching to medium.en fixes the last small imperfection with the multiple commas by the way. This is the power of Whisper!

Taking a deeper look

Now let’s take a slightly deeper look at what is happening inside Whisper while looking at using other languages and even translation. Make a new file in your root folder called 1_multiple_languages.py:

FINX_WHISPER (project root folder)
    test_audio_files
    1_basic_call_english_only.py
    1_multiple_languages.py

Then open up the new 1_multiple_languages.py file and start with the imports:

import whisper
from pathlib import Path

AUDIO_DIR = Path(__file__).parent / "test_audio_files"
model = whisper.load_model("base")

Make sure to use the base model this time, and not the base.en model, as we want to use all available languages.

First, we’ll take a slightly deeper down look to have a rough idea of what is going on as this will help us understand some important nuances. After that, we’ll greatly simplify the whole thing using the higher-level code again. Let’s write a function that will detect the language and transcribe a file for us and we’ll explain it line by line.

def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)

We define a function, which takes the path to an audio_file as a string argument. We then call Whisper’s .load_audio() method and pass in the audio file’s path. This returns a NumPy array containing the audio waveform, in float32 datatype, or in other words, an array containing the audio data as a giant list of numbers.

    audio = whisper.pad_or_trim(audio)

Next, we get a 30-second sample, either padding with silence if the file is shorter than 30 seconds or trimming it if it is longer. This is because the Whisper model is built and trained to take 30 seconds of audio as its input data each time. This doesn’t mean you cannot transcribe longer files but does have some implications we’ll get back to later.

    mel = whisper.log_mel_spectrogram(audio).to(model.device)

Make a log-Mel spectrogram and move it to the same device as the model (e.g. your GPU). A log-Mel spectrogram is a representation of a sound or audio signal that has been transformed to highlight certain perceptual characteristics.

 Spectrogram: A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time. It's essentially a heat map where x is time, the y-axis is frequency, and the color represents the loudness.

 Mel Scale: The Mel scale is a perceptual scale of pitches that emulates the human ear's response to different frequencies. We humans are much better at distinguishing small changes in pitch at low frequencies than at high frequencies. The Mel scale makes the representation match more closely with human perception as opposed to the exact mathematical frequencies.

 Logarithmic Scale: Taking the logarithm of the spectrogram values is another step to make the representation more closely match human perception. We perceive loudness on a logarithmic scale (which is why we use decibels, a logarithmic measurement, to express the loudness of sound).

 Combining these, a log-Mel spectrogram is a representation of sound that is designed to highlight the aspects that are most important for human perception. It's commonly used in audio processing tasks, including speech and music recognition.

Now that we have this log-Mel spectrogram, we can use it to detect the language of our audio file. We do this by passing it to the .detect_language() method of our model:

    language_token, language_probs = model.detect_language(mel)

This returns the language_token, which is a number we will not be using, and the language_probs which is a huge list of numbers indicating the probability for possible languages matching the sound file. As we won’t actually be using the language_token variable we can replace it with a _ to indicate that we won’t be using it. This makes it into a sort of throwaway variable that we don’t care about.

    _, language_probs = model.detect_language(mel)

Let’s take what we have so far, add a print statement to check out the language_probs, and run it using the dutch_the_netherlands.mp3 file I prepared for you:

def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    _, language_probs = model.detect_language(mel)
    print(language_probs)

detect_language_and_transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"))

Now when we run this we can see the massive language_probs list printed to our console:

{
    '.. cut for brevity ..'
    "yi": 2.012418735830579e-05,
    "ka": 2.161949907986127e-07,
    "nl": 0.9650669693946838,
    "en": 0.010499916970729828,
    "ko": 9.358442184748128e-05,
    "mn": 5.96029394728248e-06,
    "de": 0.010318436659872532,
    '.. cut for brevity ..'
}

We have a huge list of numbers here as you can see. The higher the number the more likely the the language, many are to the power of -4, -5, -6, or even lower. We can clearly see that nl (the Netherlands) is by far the highest probability, close to a perfect 1 score with 0.965. The second and third highest are en (English) and de (German) with 0.010 and 0.010 respectively which is not even close so we can be very confident that this is Dutch. Impressive for the base model that small that deals with so many languages, and Dutch not really being that big a language.

Of course, we don’t want this whole list, we just want to know the most probable language, so we can use the max function to get the highest probability.

def detect_language_and_transcribe(audio_file: str):
    ...
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")

max returns the key of the largest value in the dictionary. We pass in the dictionary as the first argument. The key argument is a function that is called on each item in the dictionary, and the item for which the function returns the largest value is the result of the max function. We can just use the built-in .get() method as the function to get the value of each item in the dictionary.

The language name codes are in ISO 639-1 format and can be found here. We add a print statement to print the detected language. I removed the previous print statement print(language_probs) we added before.

def detect_language_and_transcribe(audio_file: str):
    ...
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")
    options = whisper.DecodingOptions(language=language, task="transcribe")
    result = whisper.decode(model, mel, options)
    print(result)
    return result.text

Now we’ll decode this 30-second audio file into text. First, we create a DecodingOptions object and save it in the variable named options. The DecodingOptions object lets you set more advanced decoding options, but we’ll stick to basics for now, passing in the language we detected and the task of “transcribe”. We then call the whisper.decode function which performs decoding of the 30-second audio segment(s), provided as log-Mel spectrogram(s). We pass in the model, the mel spectrogram, and the options. This returns a DecodingResult object which we save in the variable named result. We then print the result and return the result.text.

The whole function now looks like this:

def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    _, language_probs = model.detect_language(mel)
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")
    options = whisper.DecodingOptions(language=language, task="transcribe")
    result = whisper.decode(model, mel, options)
    print(result)
    return result.text

Now let’s run it with the dutch_the_netherlands.mp3 file again:

dutch_test = detect_language_and_transcribe(
    str(AUDIO_DIR / "dutch_the_netherlands.mp3")
)

When you run this the object printed to the console will have the following transcription:

'Hoi, allemaal. Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze text te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.'

There we go, a perfect transcription! Now you probably don’t speak Dutch, but the above is a perfect word-for-word transcription of the spoken text.

Back to .transcribe

Now I’ll be honest, that was a little bit overcomplicated if we don’t need to do much personalization and just want to call the model. Also, we don’t want to limit ourselves to just 30 seconds of audio. Let’s take it back to whisper’s higher level .transcribe function which basically does all the above for us.

Make sure you comment out the dutch_test code so it doesn’t keep running:

# dutch_test = detect_language_and_transcribe(
#     str(AUDIO_DIR / "dutch_the_netherlands.mp3")
# )

Now all we need to do to use .transcribe is load a model (model = whisper.load_model("base")) which we already did in this file, and then call the .transcribe method on the model and pass in the path to the audio file as a string:

result = model.transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True)
print(result["text"])

It also has some options, in this case, we’ve set verbose to True so it will give us extra information in the console. If you go ahead and run this you will get the exact same transcription in the output as we did above:

'Hoi, allemaal. Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze text te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.'

Again, you probably don’t speak Dutch, but that’s not the point. So underneath the hood, the .transcribe function reads the entire audio file and basically processes it in 30-second windows. You could also see it did the language detection part for us automatically before starting.

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Dutch

Working with longer files

So that’s pretty good, right? Well, let’s try a longer audio file and see what happens. I’ve provided dutch_long_repeat_file.mp3 which is just the same audio file but it repeats 3 times, totaling just over 40 seconds. Let’s see what happens when we try to transcribe this file (make sure you comment out the run above):

# result = model.transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True)
# print(result["text"])


result = model.transcribe(
    str(AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    verbose=True,
    language="nl",
    task="transcribe",
)
print(result["text"])

Note we can pass in the language if we already know it, so we can skip the detection step and save some time there. So for applications where you always know the language ahead of time just pass it in to optimize your application. We pass in nl as it is the ISO-639-1 code for the Netherlands.

Now let’s run this and check the output (yours will look different from mine):

Hoi j allemaal! Dit is weer een testbestandje! Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. Je en bırak�� collecte geval. Je gievous raakt deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd! Hoi jlynn allemaal! Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. Je en driesbredmontie kunt wiring die text er metυτ�� mesma halen te laten vertalen naar het Engels om te zien hoe goed dat gaat! Ik ben benieuwd. Hoi allemaal! Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.

Now I’m not going to make you read this, but as a Dutch person, I will tell you this output is terrible and there are several characters and many words here that do not even exist in the Dutch language! So what happened? It’s the same model and the audio file is exactly the same as before, it’s just a bit longer and repeats itself. We should have gotten the same output right?

Well, it is because Whisper’s machine-learning model is limited to audio segments of only 30 seconds as its input. Because of this, it is more challenging for it to transcribe longer audio files. The .transcribe function took care of cutting the audio into 30-second segments for us and feeding them through and sort of stitching them back together, making our life a lot easier, so we didn’t really notice this extra challenge.

While whisper does use some clever tricks to improve the quality for transcribing longer audio files that need to be cut into 30-second pieces and put back together again this is inherently just a bit trickier so we saw a significant drop in transcription quality even though the audio we were transcribing was the exact same as before (just repeated 3 times in a row to make it longer).

Does this mean Whisper is only good for small files? Not at all! All we need to solve this bigger challenge of a minor language (Dutch) combined with files longer than 30 seconds is to just step up to a bigger model!

When changing the model to small instead of base:

model = whisper.load_model("small")

I got an almost perfect output with only a single very minor spelling mistake. When I changed to medium afterward it was absolutely perfect. It’s just a matter of using a bigger model until it works. Pick the model size that corresponds to the size of your challenge.

Translating

Besides just transcribing, as if that wasn’t awesome enough, Whisper can also translate pretty much all major languages to English. (If you get very hacky it can even translate English to other languages, but that is not an intended or supported feature).

So now let’s give it an audio file in a non-English language and then ask it for an English translation. We’ll feed it the dutch_the_netherlands.mp3 file again, but this time ask it for a translation (to English) so you can finally find out what I said in the audio!

result = model.transcribe(
    str(AUDIO_DIR / "dutch_the_netherlands.mp3"),
    verbose=True,
    language="nl",
    task="translate",
)
print(result["text"])

Make sure you comment out any calls above so you don’t run them by accident. I’ve already tested this out and you’ll need to load around the medium model size to get a good translation, so make sure you load that BEFORE the call above (if your computer can handle it, otherwise just try a smaller one).

model = whisper.load_model("medium")

The output is:

Hey everyone, this is a test file again. This time to test whether the Dutch language will be recognized well. After this, we can also try to translate this text into English to see how well that goes. I'm curious.

It’s really quite a decent translation, straight from spoken text. That is very impressive. For sloppy pronunciation it still works quite well – I tested this using my Korean pronunciation which is not great and the results were still pretty good.

So the different languages, longer files or perhaps slightly less native pronunciation will benefit a lot from going to larger versions of the model (as long as you have the VRAM for it). I’ll be sticking with the lower end of the spectrum models for this series as much as possible, as not everyone will have the GPU to run the larger models, but feel free to use a larger model if you have the VRAM for it.

On the flip side, if you can only run the small or even the base models, do not despair! The next two tutorials will actually do very well for accuracy running on these smaller models, and again, in the last part, we’ll look at speeding up, optimizing, or outsourcing the processing altogether.

Now that we’ve got the more boring basics out of the way, it’s time to build some cool and fun stuff and look at practical applications and integration in the next couple of parts! See you there!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper – Python Installation, Setup, & First Steps to Speech-to-Text Synthesis appeared first on Be on the Right Side of Change.

OpenAI JSON Mode & DALL·E 3 Seeds (Examples)

Dirk van Meerveld — Tue, 28 Nov 2023 11:50:35 +0000

This article originally appeared on the Finxter Academy for premium members (including course lesson video). Check out the video course here.

All right, welcome back to part 2, where we’re going to be looking at JSON mode and seeds.

This will allow us to use only a part of function calls. Namely, when the model generates the arguments it wants to use to call the function, it will return these in valid JSON or JavaScript Object Notation.

You saw in the previous part that we parsed these arguments and then passed them into our functions.

So what if we always want a JSON response from ChatGPT? We can now use the new JSON mode to do this.

Why would this be useful? Well, it’s really easy to parse into an object, that we can manipulate with code or feed into some kind of software or API, just like we did in the previous part. This can be really helpful for extracting data from text.

If we ask GPT to generate something in textual form, it’s pretty hard to use the output in our Python code, for example.

Still, if we ask it to output the data in JSON in exactly the way we specify, it’s very easy to parse this into a dictionary and then save the data in a database or manipulate it in some other way.

So, let’s get started with a simple example to see how this works. You’ll then be able to adapt this to your specific use case.

Preparing Some Data

Let’s get something simple to extract data from. Remember the data could also be generated or acquired in some other way, the point here is the output.

Make a file called chapters.py in a new folder named 2_JSON_mode_and_seeds like this:

FINX_OPENAI_UPDATES (root project folder)
    1_Parallel_function_calling
    2_JSON_mode_and_seeds
        chapters.py
    .env

Now go and visit https://gutenberg.org/cache/epub/72064/pg72064.txt in your browser.

This will take you to the text version of the book “The book of Scottish story: historical, humorous, legendary, and imaginative”, which is in the public domain (copyright expired), so we can use it for our example.

Copy the entire list of contents (it’s pretty long) all the way from 'The Henpecked Man' to 'Catching a Tartar' and paste it into the chapters.py file.

It should look like this:

table_of_contents = """
CONTENTS.

The Henpecked Man,                              _John Mackay Wilson_

Duncan Campbell,                                _James Hogg_

...loads more entries in between...

The Fight for the Standard,                     _James Paterson_

Catching a Tartar,                              _D. M. Moir_
"""

Notice it’s a simple variable named table_of_contents which is a very long multiline string so we can easily import this later.

The formatting of the table of contents is wonky with underscores, and some have “quotes” around them while others don’t, so this will make an excellent simple example.

JSON Mode

Go ahead and save this chapters.py file.

Now, create a new file in the 2_JSON_mode_and_seeds folder called json_mode.py.

FINX_OPENAI_UPDATES (root project folder)
    1_Parallel_function_calling
    2_JSON_mode_and_seeds
        chapters.py
        json_mode.py
    .env

Inside, let’s get started with our imports:

from decouple import config
from openai import OpenAI
from chapters import table_of_contents
import json
import pprint

client = OpenAI(api_key=config("OPENAI_API_KEY"))

We have all our basic imports here,

config,
OpenAI,
the table_of_contents variable we just defined,
json, and
pprint.

We’ll use pprint, or pretty print to print the output in a nice way. It will print objects like dictionaries to the console in a much more readable manner, as you’ll see later.

We then initialize our client as before.

Now, let’s start our json_gpt function:

def json_gpt(query, model="gpt-3.5-turbo-1106", system_message=None):
    if not system_message:
        system_message = "You are a JSON generator which outputs JSON objects according to user request"

We’re going to be using 3.5-turbo the new version for this one, don’t worry, we’ll get to 4-turbo very soon!

But for now, it’s simply not needed to get good results and as 3.5 turbo is much cheaper it’s better to use it when 4 is not needed.

More on pricing details later.

Again make sure you have the 1106 version and not any older one because only the newest GPT3.5 turbo and GPT4 turbo versions support JSON mode.

We define our function and set a default for the model and system message but allow the user to overwrite either. Still, inside the function, define the messages list:

    messages = [
        {"role": "system", "content": system_message},
        {
            "role": "user",
            "content": f"Please return Json for the following as instructed above:\n{query}",
        },
    ]

Note that the user query is preceded by a specific request for JSON output even in the user message.

Even though we will enable JSON mode, we still have to specifically mention the word JSON in the user message.

If we don’t the model may create weird generations which is actually why a failsafe error will be returned if we forget to include this word in our context.

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        response_format={"type": "json_object"},
    )

Now we make a pretty normal request to ChatGPT using the new client syntax.

Note we cannot just set the response_format variable to json_object, but we have to specifically pass in a dictionary with the key-value pair "type": "json_object".

    content: str = response.choices[0].message.content
    content: dict = json.loads(content)
    print(f"\033[94m {type(content)} \033[0m")
    pprint.pprint(content)
    return content

The content is initially in string format even though it represents JSON.

We then convert it to a dictionary so we can work with the data like any other dictionary.

Note that whatever format you want with whatever keynames and values is possible, as we’ll demonstrate later.

We then print the type of content to show that ChatGPT’s output is, in fact, a valid dictionary object (after conversion from JSON) and pretty print it to the console.

Finally, we return the content so we can use it in our code.

The whole function is as follows:

def json_gpt(query, model="gpt-3.5-turbo-1106", system_message=None):
    if not system_message:
        system_message = "You are a JSON generator which outputs JSON objects according to user request"

    messages = [
        {"role": "system", "content": system_message},
        {
            "role": "user",
            "content": f"Please return Json for the following as instructed above:\n{query}",
        },
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        response_format={"type": "json_object"},
    )

    content: str = response.choices[0].message.content
    content: dict = json.loads(content)
    print(f"\033[94m {type(content)} \033[0m")
    pprint.pprint(content)
    return content

A Simple Test

Let’s start with a very simple test by adding the following print statement:

json_gpt(
    "Give me a Json object with the height in cm and age in years of all people in the following text: John is 6 feet tall and 500 months old. Mary is 5 feet tall and 30 years old. Bob is 170cm in length and was born 25 years ago."
)

And we can see it does absolutely fine and converts all the ages and heights to the same units just like we requested, even using 3.5-Turbo.

{'people': [{'age_years': 41.67, 'height_cm': 182.88, 'name': 'John'},
            {'age_years': 30, 'height_cm': 152.4, 'name': 'Mary'},
            {'age_years': 25, 'height_cm': 170, 'name': 'Bob'}]}

This is a valid dictionary that we can straight away manipulate in our code or store in a database without having to do any additional parsing though we could round out the values if we wanted to.

So this can be used for data extraction, even if the values are given in different units or formats, interweaved in a piece of text. Also notice that the pprint function made it nice and easy to read by lining up the values in the dictionary.

A More Complex Test

Make sure you comment out the print statement above and then let’s use our table of contents file and give it a very specific output format, so we can basically use GPT as a data parser without having to write a real output parser.

json_gpt(
    query=table_of_contents,
    system_message="""
    You are a JSON generator which outputs JSON objects according to user request.
    Please extract the author and title for all lines going all the way from start to end in the following text and return it as a JSON object following the example provided below.

    Example input:
    The Lily of Liddisdale,                         _Professor Wilson_

    The Unlucky Present,                            _Robert Chambers_

    The Sutor of Selkirk                            “_The Odd Volume_,”

    Example output:
    {'contents': [
        {'author': 'Professor Wilson', 'title': 'The Lily of Liddisdale'},
        {'author': 'Robert Chambers', 'title': 'The Unlucky Present'},
        {'author': 'The Odd Volume', 'title': 'The Sutor of Selkirk'},
    ]}
    """,
)

Note that the only guarantee we get with JSON mode is JSON output, not the specific format!

We still have the responsibility to be very specific to get the output we desire. Providing specific examples like the above is your best friend, as GPT tends to perform much better this way.

Now go ahead and run the file and you should get the following:

gtp3_5 = {
    "contents": [
        {"author": "John Mackay Wilson", "title": "The Henpecked Man"},
        {"author": "James Hogg", "title": "Duncan Campbell"},
        ... many many more entries in between ...
        {"author": "James Paterson", "title": "The Fight for the Standard"},
        {"author": "D. M. Moir", "title": "Catching a Tartar"},
    ]
}

Notice that it followed our example perfectly. It also got rid of the pesky extra quotes and underscores that appeared on the entries. This is just 3.5 Turbo, we haven’t even tried GPT-4 Turbo yet!

If you do have something harder to parse, try GPT-4 Turbo, and it will do a better job. But in this case, 3.5 Turbo was more than enough to get the job done.

So yeah, that’s JSON mode, pretty darn cool and useful.

Have ChatGPT extract structured data for you from any text, and return it in an object format that doesn’t require any complex parsing, or even use ChatGPT as a parser without having to write a real parser to account for all the edge cases.

It’s pretty clever at handling even unforeseen edge cases as long as you provide a solid example of the end output you want.

The Seed Parameter

Go ahead and save and close this file and now let’s look at the Seed parameter. Create a new file called seed_param.py:

FINX_OPENAI_UPDATES (root project folder)
    1_Parallel_function_calling
    2_JSON_mode_and_seeds
        chapters.py
        json_mode.py
        seed_param.py
    .env

Now the idea behind seed parameters is of course that they can make some type of random generator predictable, provided you pass in the same seed, like generating the same Minecraft world by copying the seed from a friend.

While ChatGPT can now use a seed parameter, the very nondeterministic nature of ChatGPT means that it’s not quite a 100% guarantee, but the answers are definitely more similar and predictable than without a seed, so let’s check it out.

Inside the seed_param.py file go ahead and start with our imports and basic setup:

from decouple import config
from openai import OpenAI

client = OpenAI(api_key=config("OPENAI_API_KEY"))

This should be fairly familiar by now.

Now let’s code up a very simple printing utility to help us clean our code by cutting out the repetitive stuff:

def consistency_printer(response):
    response_content = response.choices[0].message.content
    system_fingerprint = response.system_fingerprint
    print(f"\033[94m {response_content} \033[0m")
    print(f"\033[92m {system_fingerprint} \033[0m")

What this function will do is receive the response we get from ChatGPT, extract the message’s content and the system fingerprint, and print them to the console in respectively blue and green colors.

So what is the system fingerprint?

The system fingerprint, as the name implies, identifies the exact backend configuration that the model works with. This system fingerprint will change if you change the request parameters or if OpenAI updates the models in some way behind the screens, which is likely to happen a couple of times per year.

If these fingerprints are the same, therefore, it means that both your configuration and the remote configuration are the same between both requests.

When we make concurrent requests in a moment, you’ll notice this fingerprint is basically always the same, but if you have a model run for months it is likely the backend configuration on OpenAI’s end will change at some point which will affect determinism and therefore the output.

Simply said, as long as the fingerprint and the seed remain the same between calls, the output should be similar or even the same.

Bedtime Stories

So let’s code up a very simple function that outputs something very nondeterministic, like bedtime stories!

def bedtime_stories(query, seed=None, model="gpt-3.5-turbo-1106"):
    messages = [
        {
            "role": "system",
            "content": "You make up fun children's stories according to the user request. The stories are only 100 characters long.",
        },
        {"role": "user", "content": query},
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        seed=seed,
        temperature=0.7,
        stop=["\n"],
    )
    consistency_printer(response)

We set up a very simple system message and then pass in the user query in the second message entry.

We call the GPT-3.5 Turbo model, again making sure to use the new 1106 version as older models don’t support the seed parameter, and we pass in the messages and the seed.

We also set the temperature to 0.7 and the stop parameter to a newline character so we don’t get a huge wall of text.

The stop parameter simply means that the model will stop generating text when it encounters a newline character, limiting the length of the output we need to compare.

Testing the Seed Parameter with Bedtime Stories

Now let’s add a print statement and run 3 calls without a seed parameter:

for i in range(3):
    bedtime_stories(
        "Tell me a story about a unicorn in space.",
    )

Go ahead and run it.

Note how the unicorn has a different name in every single story, and the stories are quite different:

Once upon a time, a unicorn named Luna soared through the galaxy, spreading stardust and kindness wherever she went.
fp_eeff13170a

Once upon a time, a brave unicorn named Stardust soared through the galaxy, spreading magic and joy to all the stars.
fp_eeff13170a

Once upon a time, a unicorn named Nova flew through space, sprinkling stardust and bringing light to dark corners.
fp_eeff13170a

Now change the print statement like this, and run it again:

for i in range(3):
    bedtime_stories(
        "Tell me a story about a unicorn in space.",
        seed=2424,
    )

Note that the seed can be an arbitrary number, we chose 2424 at random. If we run this we get:

Once upon a time, a magical unicorn flew through space, sprinkling stardust on planets and making new friends.
fp_eeff13170a

Once upon a time, a magical unicorn flew through space, sprinkling stardust on planets and making new friends.
fp_eeff13170a

Once upon a time, a magical unicorn soared through space, sprinkling stardust on planets and granting wishes to lonely stars.
fp_eeff13170a

We can see they are not quite the same. The first and second ones are identical but the third is similar but different. If you run this several times you’ll sometimes get 3 of the same outputs, and sometimes they’ll all be different.

This is because the seed parameter is not a 100% guarantee, but it does make the output more consistent and similar.

You might think that the temperature setting of 0.7 is the culprit, but this is not the problem. Setting it to 0 does not make much difference in this case.

If we swap out our function’s default 3.5 Turbo model for GPT-4 Turbo (more on GPT-4 Turbo in the next part):

for i in range(3):
    bedtime_stories(
        "Tell me a story about a unicorn in space.",
        seed=2424,
        model="gpt-4-1106-preview",
    )

We see a similar story:

Star Unicorn zooms, finds a comet friend. Together, they race across the Milky Way! 
fp_a24b4d720c

Star Unicorn zooms, finds a comet friend. Cosmic races begin!
fp_a24b4d720c

Star Unicorn zooms, finds a comet friend. Together, they race across the Milky Way! 
fp_a24b4d720c

Very similar, and the unicorn has the same name, but the last part is different in the middle generation. Just know that the seed parameter provides no guarantees.

Fruitclopedia, More Deterministic Questions

So let’s try with something a little more stable, like fruits.

Fruits: Where children’s stories can be about literally everything and therefore there is no definition as to what ChatGPT should be outputting, fruits are quite predictable. Asking about a Pineapple is a very concrete question and not open to artistic interpretation as to what the answer should be.

We have a very basic function, just copy this:

def fruit_gpt(query, seed=None, temperature=0.2):
    messages = [
        {
            "role": "system",
            "content": "You are the fruitclopedia. Users name a fruit and you give information.",
        },
        {"role": "user", "content": query},
    ]

    response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
        seed=seed,
        temperature=temperature,
        stop=["\n"],
    )
    consistency_printer(response)

It is basically the same but the temperature has been set to 0.2 for this one.

We still use the stop parameter to limit the output length to one paragraph, so when the model inserts a newline to go to the next paragraph, it will stop generating text as it hits our stop condition.

Testing the Seed Parameter with Fruitclopedia

Running this without a seed:

for i in range(3):
    fruit_gpt(
        "Grapefruit.",
        temperature=0,
    )

And we can interestingly see that they start quite the same but then diverge:

Grapefruit is a subtropical citrus fruit known for its sour to semi-sweet taste. It is a hybrid of the sweet orange and the pomelo. Grapefruits are rich in vitamins C and A, and they also contain fiber and antioxidants. They are often enjoyed fresh, juiced, or added to salads and desserts. There are different varieties of grapefruit, including white, pink, and red, each with its own unique flavor profile.
fp_eeff13170a

Grapefruit is a subtropical citrus fruit known for its sour to semi-sweet taste. It is a hybrid of the sweet orange and the pomelo, and it is typically larger than an orange with a thicker rind. Grapefruits are rich in vitamins C and A, as well as antioxidants. They are often enjoyed fresh, juiced, or added to salads and desserts. There are different varieties of grapefruit, including white, pink, and red, each with its own unique flavor profile.
fp_eeff13170a

Grapefruit is a subtropical citrus fruit known for its sour to semi-sweet taste. It is a hybrid of the pomelo and the sweet orange. Grapefruits are rich in vitamins C and A, as well as dietary fiber. They are often enjoyed fresh, juiced, or added to salads and desserts. There are different varieties of grapefruit, including white, pink, and red, each with its own unique flavor profile.
fp_eeff13170a

This is not so much because we set the temperature to 0.2 but more that our question is much more specific.

Tell me a children’s story about a unicorn could have a million answers, all of which are correct. The number of correct answers for basic info about Pineapples is limited.

So let’s try this with a seed, which is where the seed parameter really shines:

for i in range(3):
    fruit_gpt(
        "Grapefruit.",
        seed=123,
        temperature=0,
    )

As you can see below, the answers are now 100% identical!

Grapefruit is a subtropical citrus fruit known for its slightly bitter and sour taste. It is a hybrid of the pomelo and the sweet orange. Grapefruits are rich in vitamins C and A, as well as dietary fiber. They are often enjoyed fresh, juiced, or added to fruit salads. There are different varieties of grapefruit, including white, pink, and red, each with its own unique flavor profile.
fp_eeff13170a

Grapefruit is a subtropical citrus fruit known for its slightly bitter and sour taste. It is a hybrid of the pomelo and the sweet orange. Grapefruits are rich in vitamins C and A, as well as dietary fiber. They are often enjoyed fresh, juiced, or added to fruit salads. There are different varieties of grapefruit, including white, pink, and red, each with its own unique flavor profile.
fp_eeff13170a

Grapefruit is a subtropical citrus fruit known for its slightly bitter and sour taste. It is a hybrid of the pomelo and the sweet orange. Grapefruits are rich in vitamins C and A, as well as dietary fiber. They are often enjoyed fresh, juiced, or added to fruit salads. There are different varieties of grapefruit, including white, pink, and red, each with its own unique flavor profile.
fp_eeff13170a

However, remember that this is not guaranteed 100%! You will see variation if you run this multiple times. If you use this to write tests for your application you should make sure to include the fingerprint, because if OpenAI updates the system configuration on their end, the output will change. Also make multiple calls and pass the test if one of them matches.

So yeah, that’s the seed parameter.

Pretty reliable but not guaranteed, as long as you ask somewhat focused questions. If you ask something very open-ended it will still be more similar but less effective.

That’s it for part 2. In the next part, we’ll look at GPT-4 Turbo and it’s really exciting new abilities like vision! See you there!

Take Me Back to the Full Course

Full Course: OpenAI API Mastery: Innovating with GPT-4 Turbo, Text-to-Speech (TTS), and DALL·E 3

The post OpenAI JSON Mode & DALL·E 3 Seeds (Examples) appeared first on Be on the Right Side of Change.

OpenAI Parallel Function Calling (WeatherGPT Example)

Dirk van Meerveld — Mon, 27 Nov 2023 21:48:19 +0000

Welcome to part 1 of the course! My name is Dirk van Meerveld and I will be your host and guide for this series in which we will be exploring all the new features of the OpenAI APIs and what we can do with them.

To get started, we’re going to be looking at the OpenAI function calling updates, especially the new ability to call multiple functions in parallel. We’ll also discuss some of the important syntax changes to go along with this and other new functionality in the API.

This article originally appeared on the Finxter Academy for premium members (including course lesson video). Check out the video course here.

Let’s create a new folder and file in our base directory to get started.

FINX_OPENAI_UPDATES (root project folder)
    1_Parallel_function_calling
        function_descriptions.py

Function Descriptions

Open up function_descriptions.py.

Here we’ll describe the functions. These objects are for ChatGPT to know what functions are available and what names they have.

It describes what the function does and what parameters it needs as input. Notice that this is all text.

The only purpose of these objects is for ChatGPT to know what functions are available, when it should use a particular function, and what arguments it needs to provide to call a specific function. As such, they are not the functions themselves, which we have separately, but merely a description of the functions.

Let’s get started:

describe_get_current_weather = {
    "type": "function",
    "function": {

Note the syntax has slightly changed from what function calling used to be. We now wrap the entire object inside a "function" key and also have a "type": "function" key-value pair on the outermost level.

        "name": "get_current_weather",
        "description": "This function provides the current weather in a specific location.",

The name we provide here is the name that ChatGPT will use when it wants to call this particular function.

        "parameters": {

Here you describe to ChatGPT when you want it to call this function, and what the purpose of this function is.

            "type": "object",

Here, you describe what parameters this function needs to be able to run.

The overall parameters are an object, and as properties, it needs a location which is of type string. We also provide a description of what this parameter should contain, namely the name of a city.

Note the required key, which is an array of the required parameters (you can specify multiple parameters here).

            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location as a city name, e.g. Amsterdam.",
                },
            },
            "required": ["location"],
        },
    },
}

So the whole description is :

describe_get_current_weather = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "This function provides the current weather in a specific location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location as a city name, e.g. Amsterdam.",
                },
            },
            "required": ["location"],
        },
    },
}

Now we have the second one, which is much the same:

describe_get_weather_forecast = {
    "type": "function",
    "function": {
        "name": "get_weather_forecast",
        "description": "This function provides the weather forecast in a specific location for a specified number of days.",

Here, we have multiple parameters. Note that only one of them is required.

        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location as a city name, e.g. Amsterdam.",
                },
                "days": {
                    "type": "integer",
                    "description": "The number of days to forecast, between 1 and 14.",
                },
            },
            "required": ["location"],
        },
    },
}

The entire function_descriptions.py file now looks like this:

describe_get_current_weather = {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "This function provides the current weather in a specific location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location as a city name, e.g. Amsterdam.",
                },
            },
            "required": ["location"],
        },
    },
}


describe_get_weather_forecast = {
    "type": "function",
    "function": {
        "name": "get_weather_forecast",
        "description": "This function provides the weather forecast in a specific location for a specified number of days.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location as a city name, e.g. Amsterdam.",
                },
                "days": {
                    "type": "integer",
                    "description": "The number of days to forecast, between 1 and 14.",
                },
            },
            "required": ["location"],
        },
    },
}

Prompt Setup

Ok, go ahead and close that and create another file in the '1_Parallel_function_calling' folder called 'prompt_setup.py'. This is where we’ll set up the prompt for ChatGPT to use.

FINX_OPENAI_UPDATES (root project folder)
    1_Parallel_function_calling
        function_descriptions.py
        prompt_setup.py

Inside, put the following variable:

current_and_forecast_setup = "You are a regular ChatGPT chatbot, just like normal, however, you also have access to some functions that can be called if you need them. One will provide the current weather and one will provide the weather forecast. IF THE USER DOES NOT ASK A WEATHER RELATED QUESTION JUST ANSWER THEM AS NORMAL WITHOUT CALLING ANY FUNCTIONS."

This is just a basic prompt setup telling the model it has functions available but also emphasizing that we don’t want to use them if they are not needed to answer the question. You can always play around with the specific wording and details of this prompt to see what works best for you.

We put it in a separate file to keep large string variables outside the main code to keep it readable as in a larger project the setup would likely be longer and have several versions.

Weather API

Now save and close that file as well. It’s time to create the actual functions that we’re going to be giving to ChatGPT to call. Create a new file in the same folder called 'weather.py':

FINX_OPENAI_UPDATES (root project folder)
    1_Parallel_function_calling
        function_descriptions.py
        prompt_setup.py
        weather.py

First, sign up for a free account on weatherapi.com.

They will give you pro for 14 days for free but it will automatically switch back to free afterward and you don’t have to provide any payment or credit card information, so don’t worry about it, you can use this API for free without any hassle.

Now create a '.env' file in the base directory of your project:

FINX_OPENAI_UPDATES (root project folder)
    1_Parallel_function_calling
        function_descriptions.py
        prompt_setup.py
        weather.py
    .env

And inside this file put both your weatherapi API key and OpenAI API key using the following syntax, making sure not to use quotes or spaces:

CHATGPT_API_KEY=supersecretchatgptapikeygoeshere
WEATHER_API_KEY=yoursupersecretweatherapikeygoeshere

Close and save that file so we can load our secret API keys from this file later. Now open a terminal and run the following command:

pip install python-decouple

Writing the Functions

This library will allow us to load our API keys from the .env file we just created. Now open up weather.py (which is still empty) and put the following code inside:

from decouple import config
from json import dumps
import requests

Config will allow us to easily read the content of our .env file, allowing us to load our API keys without coding their values in our code. The json module is a part of Python’s standard library and provides methods for working with JSON data.

The dumps function is used to convert a Python object into a JSON string, which basically holds the same information but in a string format. This is useful as ChatGPT cannot take Python objects as input, but it can take strings.

Finally, the requests module is a Python library used for making HTTP requests, providing us with a simple API. We’ll use it to send requests to the weatherapi.com API.

Now we define a simple function below:

def get_current_weather(location) -> str:
    if not location:
        return (
            "Please provide a location and call the get_current_weather_function again."
        )
    API_params = {
        "key": config("WEATHER_API_KEY"),
        "q": location,
        "aqi": "no",
        "alerts": "no",
    }

We use the config function to load the API key from the .env file. (make sure the name matches exactly and the .env file does not have any spaces).

Q holds the location, and AQI (air quality index) and alerts are not needed.

Continue inside the function:

    response: requests.models.Response = requests.get(
        "http://api.weatherapi.com/v1/current.json", params=API_params
    )
    str_response: str = dumps(response.json())
    return str_response

We make a get request, passing in our URL and parameters, and get a response object, which contains the server’s response.

We then convert the response to a dictionary by calling the .json method and convert this dictionary to a string using the dumps function we imported above.

This is the whole function:

def get_current_weather(location) -> str:
    if not location:
        return (
            "Please provide a location and call the get_current_weather_function again."
        )
    API_params = {
        "key": config("WEATHER_API_KEY"),
        "q": location,
        "aqi": "no",
        "alerts": "no",
    }
    response: requests.models.Response = requests.get(
        "http://api.weatherapi.com/v1/current.json", params=API_params
    )
    str_response: str = dumps(response.json())
    return str_response

Testing

Give it a quick test run to make sure it’s working. Add the print statement below and run your file:

print(get_current_weather("Seoul"))

You should see something like this in your terminal:

{"location": {"name": "Seoul", "region": "", "country": "South Korea", "lat": 37.57, "lon": 127.0, "tz_id": "Asia/Seoul", "localtime_epoch": 1699705164, "localtime": "2023-11-11 21:19"}, "current": {"last_updated_epoch": 1699704900, "last_updated": "2023-11-11 21:15", "temp_c": 1.0, "temp_f": 33.8, "is_day": 0, "condition": {"text": "Clear", "icon": "//cdn.weatherapi.com/weather/64x64/night/113.png", "code": 1000}, "wind_mph": 6.9, "wind_kph": 11.2, "wind_degree": 330, "wind_dir": "NNW", "pressure_mb": 1029.0, "pressure_in": 30.39, "precip_mm": 0.0, "precip_in": 0.0, "humidity": 55, "cloud": 0, "feelslike_c": -3.1, "feelslike_f": 26.3, "vis_km": 10.0, "vis_miles": 6.0, "uv": 1.0, "gust_mph": 12.1, "gust_kph": 19.4}}

Make sure you comment out the print statement so it won’t run every time we import this file in the future.

Writing the Second Function

Now we’ll create a second function to get the weather forecast.

This one is a bit more complicated as we need to provide a number of days to forecast. We’ll also need to do some error handling to make sure the user provides a valid number of days.

def get_weather_forecast(location, days=7) -> str:
    try:
        days = 1 if days < 1 else 14 if days > 14 else days
    except TypeError:
        days = 7

We take a location and set a default of 7 days.

If the days variable is less than 1 we set it to 1, but if it’s more than 14 we set it to 14. If neither condition is true the user provided a valid value and we just use the input argument value.

Finally, if some weird type gets passed in we just default to 7 days.

    params = {
        "key": config("WEATHER_API_KEY"),
        "q": location,
        "days": days,
        "aqi": "no",
        "alerts": "no",
    }

    response: requests.models.Response = requests.get(
        "http://api.weatherapi.com/v1/forecast.json", params=params
    )

Parameters are largely the same except we have a number of days now. The only problem is that the API sends back a lot of data, even hourly data so 24 entries per day, which is way too much, so we need to do some filtering:

    response: dict = response.json()
    filtered_response = {}
    filtered_response["location"] = response["location"]
    filtered_response["current"] = response["current"]
    filtered_response["forecast"] = [
        [day["date"], day["day"]] for day in response["forecast"]["forecastday"]
    ]
    return dumps(filtered_response)

First convert the response to a dictionary. Keep the location and the current weather by copying them from the response to the empty dictionary named filtered_response we just created.

For the forecast, we only want the daily data, as the hourly data will completely overload the response. The line just extracts only the data we want and is based on the structure of the response from the API.

I don’t want to get too deeply into it here as this course is on OpenAI and not on list comprehensions but basically, we extract the date and day data from each day in the forecast and put it in a list.

Finally, we convert the filtered_response dictionary to a string and return it, without all the hourly data that the API sent to us.

The second function now looks like this:

def get_weather_forecast(location, days=7) -> str:
    try:
        days = 1 if days < 1 else 14 if days > 14 else days
    except TypeError:
        days = 7

    params = {
        "key": config("WEATHER_API_KEY"),
        "q": location,
        "days": days,
        "aqi": "no",
        "alerts": "no",
    }

    response: requests.models.Response = requests.get(
        "http://api.weatherapi.com/v1/forecast.json", params=params
    )

    response: dict = response.json()
    filtered_response = {}
    filtered_response["location"] = response["location"]
    filtered_response["current"] = response["current"]
    filtered_response["forecast"] = [
        [day["date"], day["day"]] for day in response["forecast"]["forecastday"]
    ]
    return dumps(filtered_response)

Give it a test run:

print(get_weather_forecast("Seoul", days=3))

And you should get a fairly large output in your terminal. Again, make sure you comment out the print statement so it won’t run every time we import this file in the future.

Parallel Function Calling

Ok go ahead and close your weather.py file. We’re done with it for now. Now we’ll create a new file called 'parallel_function_calling.py' in the same folder:

FINX_OPENAI_UPDATES (root project folder)
    1_Parallel_function_calling
        function_descriptions.py
        prompt_setup.py
        weather.py
        parallel_function_calling.py
    .env

Important! Before we get started make sure you run this in a terminal window:

pip install openai --upgrade

This gets the latest version of the openai library, to make sure your syntax is the same as mine, as there are quite some differences between the old and new versions, which we’ll be going over in the coming parts.

Recommended: How to Install OpenAI in Python?

Open the parallel_function_calling.py file and let’s have some fun!

import json
from decouple import config
from openai import OpenAI
from typing import Callable

We import the built-in json module to work with JSON data, config to load our OpenAI API key from the .env file, and OpenAI to access the API.

Note that the syntax is different, where we would just import 'openai' itself in the past, in this new version of the openai library, we need to import OpenAI instead. The 'Callable' from typing is just used to clear up something in our code later on.

from weather import get_current_weather, get_weather_forecast
from prompt_setup import current_and_forecast_setup
from function_descriptions import (
    describe_get_current_weather,
    describe_get_weather_forecast,
)

Here we just import our own stuff we prepared ahead of time.

MODEL = "gpt-3.5-turbo-1106"
client = OpenAI(api_key=config("OPENAI_API_KEY"))

Define the model up top, and then we create a 'client' by calling the OpenAI class we imported and passing in the api_key by loading it from the .env file using config.

This is part of the new standard syntax, we will interact with this client object to make API calls to OpenAI’s various API endpoints from here on.

Utility Printer Function

Now create a quick utility to print the output in a more readable manner:

def quick_dirty_printer(messages):
    """
    Prints messages in alternating colors (irrespective of role) and the final message in green. (92 is green, 93 is yellow, 94 is blue)
    """
    for index, message in enumerate(messages):
        if index == len(messages) - 1:
            print(f"\033[92m {message} \033[0m")
        elif index % 2 == 0:
            print(f"\033[93m {message} \033[0m")
        else:
            print(f"\033[94m {message} \033[0m")

This function takes a list of messages and then loops over each index and message in the messages. If the index is the last one, it prints the message in green, otherwise, it prints it in alternating yellow and blue colors using the remainder operator to distinguish odd and even indexes.

This is just a quick and dirty way to make the output more readable.

The \033[92m part is an ANSI color code, which is a special character sequence that tells the terminal to change the color of the text. The \033[0m part resets the color back to the default.

GPT Function

Now let’s start on our GPT function:

def ask_weather_gpt(query, message_history=None, simulate_failure=False):
    need_to_fail_once = simulate_failure
    messages = []

We’re going to take a query as input, and optionally a message history. So if we want to call this for a second time with an already established message history of the messages sent between chat GPT and the user, we can call this function again and pass in the already existing message history.

I’m going to also add this key simulate_failure because we’re going to be building in something just in case ChatGPT fails somehow. And we want to be able to test this because it’s actually not that likely that it will fail. So we’re going to have a very simple simulate_failure feature so we can just test that our fail-saves are working.

Then we’re going to have this variable need_to_fill_once, just a boolean value based on whether something was passed in here or not.

And then we’re going to create the messages. This is just going to be a list that’s going to have all the messages. So perhaps the system message first, which tells chat GPT you’re a helpful assistant that’s supposed to do this or that, then we maybe have the user message with a query. Then we could have the assistant message from chat GPT coming back to us, giving us an answer. And all of these messages are going to be appended to this list.

So, every single time chat GPT sends us a response, we will be appending it to this message history list.

As we’re going to be doing this several times over, and there’s also currently a small bug that we want to avoid, we’re going to create a small inner function that we can call every single time we want to append something to this list:

    def handle_and_append_response(response):
        """
        Appends message to history and extracts the message,
        prevents a current bug by explicitly setting .content and .function_call
        """
        response_message = response.choices[0].message
        if response_message.content is None:
            response_message.content = ""
        if response_message.function_call is None:
            del response_message.function_call
        messages.append(response_message)
        return response_message

This may look a little bit confusing, but let’s go over it. So this function is going to handle and append a response. When we make a call to chat GPT and we get a response in return, we’re going to just put the response into this function.

This inner handle_and_append_response function is going to append the message to the history, or the list of ‘messages‘.

First, we extract the message from the response and save it as ‘response_message‘. Then we’re going to prevent a current bug by explicitly setting the .content and .function_call.

So if the response message .content is none, which is the case when ChatGPT tries to call a function, we’re going to set the response message’s .content to an empty string.

Now, why do we do this?

There’s currently a bug that if you append a message to the messages history, and then you send this back to ChatGPT, it’s going to complain that there’s no message.content. So this is kind of a bug that we’re circumventing by making sure this key exists, even though it’s an empty string.

The same goes for the second one, if response_message.function_call is None, then we’re just going to get rid of this particular key, just to make sure it doesn’t bug out on us. If you’re watching this in the future, they may have actually fixed this so you can try removing these lines later on.

Then we’re just going to take the messages and append our response message with these small edits and also return the response_message from the function.

Now we’re outside of the inner function again:

def ask_weather_gpt(...)
    ...
    def handle_and_append_response(...)
        ...
        ...

    # continue down here outside the inner function
    if message_history:
        messages = message_history
    else:
        messages = [
            {"role": "system", "content": current_and_forecast_setup},
            {"role": "user", "content": query},
        ]

If we passed in a message_history as an argument when calling the function, we’re going to use that as the message_history.

Otherwise, we’ll define a basic message history with a system message containing our prompt from prompt_setup.py and the user query in the second message.

    tools = [
        describe_get_current_weather,
        describe_get_weather_forecast,
    ]

Now we create a list of ‘tools‘.

Notice OpenAI has adopted the LangChain tool naming convention. This is a list of the descriptions of the functions from function_descriptions.py, and not the actual functions themselves!

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=tools,
        tool_choice="auto",
    )

Now we make a call to ChatGPT, passing in our model, messages, and tools. We set the tool choice to 'auto' to let ChatGPT decide if and which function(s) it should call. You can force a call by setting a specific tool name here.

Now we’re going to handle the response using our inner function:

    response_message = handle_and_append_response(response)

Which means it’s now also in our messages list.

    while response_message.tool_calls:
        tool_calls = response_message.tool_calls
        available_functions = {
            "get_current_weather": get_current_weather,
            "get_weather_forecast": get_weather_forecast,
        }

We open a while loop. As long as ChatGPT wants to call a function, the response_message will have a .tool_calls attribute, which is a list of the functions it wants to call.

So, while ChatGPT wants to call functions we will run this loop. We save this list as tool_calls. Then we define a simple dictionary of available functions, mapping the function names we gave to ChatGPT to the actual functions we defined in weather.py.

        try:
            if need_to_fail_once:
                need_to_fail_once = False
                raise Exception("Simulating failure")
            for call in tool_calls:
                func_name: str = call.function.name
                func_to_call: Callable = available_functions[func_name]
                func_args: dict = json.loads(call.function.arguments)
                func_response = func_to_call(**func_args)

                messages.append(
                    {
                        "tool_call_id": call.id,
                        "role": "tool",
                        "name": func_name,
                        "content": func_response,
                    }
                )

We run a try/except block from now on. Remember ChatGPT is generating the function names and input arguments from now on. If it makes any name or syntax mistakes our function might blow up, which is why we use a try and except block to catch any errors and handle them.

First, if the need_to_fail_once variable is set to true, we simulate failure by raising an exception. We also make sure to set the variable to false so we only raise an exception once. By raising an exception we force the except block to run so we can test out our fail-safe code.

Then we loop over each call in the tool_calls list. We extract the function name from the call, and then we get the actual function from our available_functions dictionary.

We also extract the function arguments from the call and convert them from a string to a dictionary using json.loads.

We then call the function passing in the arguments dictionary using the asterisk ** operator. Finally, we append a message to our messages list, containing the tool_call_id, the role, the name of the function, and the response from the function.

We will feed this message history back to ChatGPT again later and the call id helps ChatGPT discern which answer is related to which function call, as multiple functions are being called in parallel here.

Now we go to the except block:

        except:
            messages.pop()
            messages.append(
                {
                    "role": "system",
                    "content": "Based on the above information, please generate the appropriate tool calls with valid arguments per the schema provided.",
                }
            )
            return ask_weather_gpt(query, message_history=messages)

If we get an exception, we pop the last message from the messages list, and then we append a system message telling ChatGPT to generate the appropriate tool calls with valid arguments per the schema provided.

Then we call the ask_weather_gpt function again, passing in the query and the message history, which now contains the system message we just appended.

Basically what this comes down to is that ChatGPT generated faulty arguments, we popped this generation off the stack and put in a system message reminding ChatGPT to generate correct arguments. Then we return out of this function by calling the function itself again passing in our message history with the reminder.

This is not actually a perfect error handling at all, but I just want to give you a starting point, an idea from which you can start to build your own error handling, without making this example too complex.

Now outside the try/catch block:

        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
        )

        response_message = handle_and_append_response(response)

        quick_dirty_printer(messages)
        return response_message

    quick_dirty_printer(messages)
    return response_message

We make a second request to ChatGPT passing in the message history which now contains all the responses from the functions we called.

We then handle the response using our inner function, append the response to the message history, print the messages using our quick and dirty printer utility function, and return the response message.

After that, we call the quick and dirty printer and return the response once more, but notice this is indented one level more to the outside.

In case there was no function call to begin with, the user asked a question that didn’t require a function call, we just bypass the whole while loop and directly print the messages and return the response message.

As this can be a bit confusing in snippets here is the whole function once more:

def ask_weather_gpt(query, message_history=None, simulate_failure=False):
    need_to_fail_once = simulate_failure
    messages = []

    def handle_and_append_response(response):
        """
        Appends message to history and extracts the message,
        prevents a current bug by explicitly setting .content and .function_call
        """
        response_message = response.choices[0].message
        if response_message.content is None:
            response_message.content = ""
        if response_message.function_call is None:
            del response_message.function_call
        messages.append(response_message)
        return response_message

    if message_history:
        messages = message_history
    else:
        messages = [
            {"role": "system", "content": current_and_forecast_setup},
            {"role": "user", "content": query},
        ]

    tools = [
        describe_get_current_weather,
        describe_get_weather_forecast,
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=tools,
        tool_choice="auto",
    )

    response_message = handle_and_append_response(response)

    while response_message.tool_calls:
        tool_calls = response_message.tool_calls
        available_functions = {
            "get_current_weather": get_current_weather,
            "get_weather_forecast": get_weather_forecast,
        }

        try:
            if need_to_fail_once:
                need_to_fail_once = False
                raise Exception("Simulating failure")
            for call in tool_calls:
                func_name: str = call.function.name
                func_to_call: Callable = available_functions[func_name]
                func_args: dict = json.loads(call.function.arguments)
                func_response = func_to_call(**func_args)

                messages.append(
                    {
                        "tool_call_id": call.id,
                        "role": "tool",
                        "name": func_name,
                        "content": func_response,
                    }
                )

        except:
            messages.pop()
            messages.append(
                {
                    "role": "system",
                    "content": "Based on the above information, please generate the appropriate tool calls with valid arguments per the schema provided.",
                }
            )
            return ask_weather_gpt(query, message_history=messages)

        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
        )

        response_message = handle_and_append_response(response)

        quick_dirty_printer(messages)
        return response_message

    quick_dirty_printer(messages)
    return response_message

Running a Single Function Call

So let’s try it out! Add the following print statement and run your file:

ask_weather_gpt("What's the weather in San Francisco?", simulate_failure=False)

You should see something like this in your terminal:

{'role': 'system', 'content': 'You are a regular ChatGPT chatbot, just like normal, however you also have access to some functions that can be called if you need them. One will provide the current weather and one will provide the weather forecast. IF THE USER DOES NOT ASK A WEATHER RELATED QUESTION JUST ANSWER THEM AS NORMAL WITHOUT CALLING ANY FUNCTIONS.'}

{'role': 'user', 'content': "What's the weather in San Francisco?"}

ChatCompletionMessage(content='', role='assistant', tool_calls=[ChatCompletionMessageToolCall(id='call_8oWdO9OoMqwXUp7QEE7kaaCX', function=Function(arguments='{"location":"San Francisco"}', name='get_current_weather'), type='function')])

{'tool_call_id': 'call_8oWdO9OoMqwXUp7QEE7kaaCX', 'role': 'tool', 'name': 'get_current_weather', 'content': '{"location": {"name": "San Francisco", "region": "California", "country": "United States of America", "lat": 37.78, "lon": -122.42, "tz_id": "America/Los_Angeles", "localtime_epoch": 1699771110, "localtime": "2023-11-11 22:38"}, "current": {"last_updated_epoch": 1699770600, "last_updated": "2023-11-11 22:30", "temp_c": 11.1, "temp_f": 52.0, "is_day": 0, "condition": {"text": "Clear", "icon": "//cdn.weatherapi.com/weather/64x64/night/113.png", "code": 1000}, "wind_mph": 2.2, "wind_kph": 3.6, "wind_degree": 10, "wind_dir": "N", "pressure_mb": 1019.0, "pressure_in": 30.1, "precip_mm": 0.0, "precip_in": 0.0, "humidity": 83, "cloud": 0, "feelslike_c": 11.5, "feelslike_f": 52.6, "vis_km": 16.0, "vis_miles": 9.0, "uv": 1.0, "gust_mph": 1.3, "gust_kph": 2.1}}'}

ChatCompletionMessage(content='The current weather in San Francisco is clear with a temperature of 52.0°F. The wind is blowing at 3.6 km/h from the north, and the humidity is at 83%.', role='assistant', tool_calls=None)

So first we have the system message we set up followed by the user query. We can then see that ChatGPT sends us a request to call functions and passes the arguments to us and also gives this call an id. We then have the tool call results with the matching ID to link them together and finally, ChatGPT gives us a readable final answer!

Testing the Failsafe

Before we get into multiple function calls let’s quickly test out our fail-safe. Change the print statement to this:

ask_weather_gpt("What's the weather in San Francisco?", simulate_failure=True)

Your output should look exactly the same as above, but with one extra entry in between the user query and the ChatGPT message:

{'role': 'system', 'content': 'setup...'}

{'role': 'user', 'content': "What's the weather in San Francisco?"}

{'role': 'system', 'content': 'Based on the above information, please generate the appropriate tool calls with valid arguments per the schema provided.'}

ChatCompletionMessage(content='', role='assistant', tool_calls=[....])

{'tool_call_id': 'call_QrLTbUe3RqPYyfbXUkyIVqZe', 'role': 'tool', 'name': 'get_current_weather', 'content': '....'}

Exactly as expected, the error triggered, sent the first messages back to ChatGPT again with the third one appended, reminding the model to please generate appropriate tool calls with valid arguments and made a new call like nothing happened.

Running Parallel Function Calls

Comment out the above print statement and let’s try parallel function calls now:

ask_weather_gpt(
    "Please give me the current weather in Seoul and the weather forecast in Amsterdam for the coming three days."
)

And you can see two function calls being sent back simultaneously. We then call both functions and return the results to ChatGPT which gives us the final answer:

{'role': 'system', 'content': 'setup....'}

{'role': 'user', 'content': 'Please give me the current weather in Seoul and the weather forecast in Amsterdam for the coming three days.'}

ChatCompletionMessage(content='', role='assistant', tool_calls=[ChatCompletionMessageToolCall(id='call_QKpzrTXdoh2Carn0bvyhott5', function=Function(arguments='{"location": "Seoul"}', name='get_current_weather'), type='function'), ChatCompletionMessageToolCall(id='call_n7BmrrjgnKEAROSZHaOoWxLf', function=Function(arguments='{"location": "Amsterdam", "days": 3}', name='get_weather_forecast'), type='function')])

{'tool_call_id': 'call_QKpzrTXdoh2Carn0bvyhott5', 'role': 'tool', 'name': 'get_current_weather', 'content': '{......}'}

{'tool_call_id': 'call_n7BmrrjgnKEAROSZHaOoWxLf', 'role': 'tool', 'name': 'get_weather_forecast', 'content': '{......}'}

ChatCompletionMessage(content='The current weather in Seoul is sunny with a temperature of 5°C (41°F). The wind is blowing from the north at 11.2 km/h.\n\nIn Amsterdam, the current weather is foggy with a temperature of 5°C (41°F). Over the next three days, expect patchy rain with a
high of 8.8°C (47.8°F) and a low of 6.0°C (42.8°F) tomorrow, followed by moderate rain with a high of 14.6°C (58.3°F) and a low of 5.9°C (42.6°F) the day after, and more moderate rain with a high of 11.9°C (53.4°F) and a low of 10.1°C (50.2°F) on the third day.', role='assistant', tool_calls=None)

Perfect! We can now call multiple tools at the same time without having to loop through ChatGPT several times, greatly speeding up the process.

Asking a Simple Question

Finally ask a normal query to make sure ChatGPT will still answer normal questions without calling functions when not needed:

ask_weather_gpt("What is a zombie watermelon?")

{'role': 'system', 'content': 'You are a regular ChatGPT chatbot, just like normal, however you also have access to some functions that can be called if you need them. One will provide the current weather and one will provide the weather forecast. IF THE USER DOES NOT ASK A WEATHER RELATED QUESTION JUST ANSWER THEM AS NORMAL WITHOUT CALLING ANY FUNCTIONS.'}

{'role': 'user', 'content': 'What is a zombie watermelon?'}

ChatCompletionMessage(content='A "zombie watermelon" is a term that\'s used for a watermelon that has been left in the field for an extended period, causing it to turn mushy and ooze behind its rind after being picked. This causes the inside to rot while the exterior remains vibrant
and green, hence the name "zombie watermelon." It\'s not an official term but more of a colloquial description.', role='assistant', tool_calls=None)

Yep, ChatGPT goes straight into the answer. Now that you are up to date with parallel function calls and the new syntax, let’s look at the new JSON mode and seeds in the next part. See you there soon!

Full Course: OpenAI API Mastery: Innovating with GPT-4 Turbo, Text-to-Speech (TTS), and DALL·E 3

The post OpenAI Parallel Function Calling (WeatherGPT Example) appeared first on Be on the Right Side of Change.