[Full Tutorial] OpenAI Fine-Tuning: Creating a Chatbot of Yourself (Example: ChrisGPT)

Hi and welcome to this course on fine-tuning ChatGPT with the OpenAI API. In this course we’ll not just look at how to deal with the fine-tuning API endpoints itself, but also how to run the fine-tuned model, and most importantly, the data preparation and validation steps so that you will be able to do this yourself using any data you want to when you are done.

💡 Tip: This is a full-text tutorial on how to fine-tune ChatGPT using the OpenAI API with code! For a video guide-through, check out our premium course “Fine-Tuning OpenAI: How to Create and Scale Your Digital Self (Example: ChrisGPT)“

For our fine-tuning example, we have a really fun one here. We are going to clone Chris! Yes, you read that correctly😉. We will use the Finxter emails Chris always sends us to train our own ChrisGPT! ChrisGPT is going to learn how to write emails in the distinctive Finxter style, on any topic we choose!

In part 1 we’ll start by looking at what fine-tuning is and why we would want to do it, and look at the type of data we will need in order to fine-tune our own ChatGPT models, getting started on our data preparation.
Part 2 will have us really dive into the data preparation, also using ChatGPT to make our work a bit easier, creating a perfectly formatted JSON dataset for training ChrisGPT.
In part 3 we’ll look at flattening our data into JSONL format, and then data validation to make sure we get it right the first time. We’ll also calculate the cost so that we know exactly what we’re getting into.
Part 4 is all about the fine-tuning API and how to use it. We’ll train ChrisGPT and then have a lot of fun having our own Robo🤖Chris🧑🏼 write our emails about anything we want!

I hope you’re as excited as I am to get started. Let’s dive in!

Part 1: Fine-tuning Our Own ChatGPT Model

Welcome to this course on ChatGPT fine-tuning. My name is Dirk van Meerveld and together we will be taking a look at fine-tuning ChatGPT to make our own custom versions of the popular LLM. Before we dive in we’ll take a look at what fine-tuning entails and when we should use it.

Why fine-tuning?

First of all, let’s take a moment to consider how we usually get ChatGPT to do what we want. We tell it, using a prompt message right? Basically, we tell it what we want to do, and we probably give it some examples as well if the task has any complexity to it. This is called “few-shot-learning” as we give a couple of demonstrations on how to perform the task.

So usually prompt engineering will get ChatGPT to do whatever we want and there’s not really any problem with that right? But what if the problem we want ChatGPT to solve is a bit more complex, and would require hundreds of reasonably sized examples? There are several use cases for this, but we’ll start with an example on brand identity.

Say that your company brand has a certain style and tone of communication, which is different from the default ChatGPT way of speaking. You are probably not a famous person, so you can’t just query GPT to write “In the style of Elon Musk” or “In the style of Barack Obama”. ChatGPT doesn’t know who you are!

So what do you do? Use the very largest GPT-4-turbo model with the largest context limit and just send 100 pages full of examples of your brand’s style of communication in the prompt setup message every single time? This will not work very well for several reasons:

Cost -> Sending that much information with every GPT call, especially when using the most expensive GPT4 model, will be very expensive if you scale it up.
Latency -> Your calls will not only be expensive but also slow in comparison, as the amount of data sent and processed is very large.
The normal model will have trouble learning an entire way of speaking including the tone and nuances from just a single system prompt setup message, even if it is very long. The input text is just a prompt and this style of speaking will not get ’embedded into the neurons’ of the model so to speak.

This is where fine-tuning comes to the rescue. Basically, OpenAI will give us a vanilla GPT model in a separate container. We then get to provide extra training data of our own, and OpenAI will further train the GPT model on the data we provide, creating our own custom fine-tuned version of ChatGPT.

We feed it a large amount of examples of our brand’s style of communication. This way we won’t have to send a million tokens in the context limit every time and can just query our custom-trained model which has our brand’s style of communication embedded into its very neurons!

I think you can see how this would be extremely helpful in many areas. A content creator may want some help writing initial drafts or ideas for new work but needs them to adhere to his own writing style. A large brand company may want to employ customer service bots, like all do these days, but needs them to adhere to the brand’s style and rules for communication, just like the human employees. Anyone with any kind of writing or speaking style may want some assistance from ChatGPT but in their own style and form of speech.

Let’s clone Chris!

To explore this idea and show you how to implement this for yourself or your clients using example data of their writing, we will be using an example most of you will be familiar with, Chris! Most of you will be familiar with Chris’ writing from the Finxter emails as you’ve probably received a fair amount of them if you’re taking this Finxter course. Today we are going to make ChrisGPT, a model that has been fine-tuned on Chris’ writing style.

I’ve chosen Chris as an example for several reasons:

Most of you are probably familiar with him from the emails.
He’s not so famous that we could just query GPT to write “In the style of Chris” and get a good result. This makes it into a realistic example of doing this for yourself or a client.
He has a distinct style of writing that we will be able to differentiate from the default ChatGPT style.
I have loads of data for him in the form of the Finxter emails on my email account.
He has agreed to let us do this (thanks Chris! 😉).

Of course, Finxter emails from my Gmail inbox aren’t perfectly clean ideal-world data examples, but they will be good enough to give you an idea of how fine-tuning works and how you can use it to implement a brand identity or writing style into ChatGPT with the proper data. They will also add some fun to our exploration and learning experience!

Now I could just give you the perfectly formatted data so that you can send it straight to the OpenAI API for fine-tuning and we could have the whole thing over and done with in a single video. But if I did that I’d be doing you a huge disservice. The fine-tuning API itself is very simple and doesn’t take much time to learn at all, but it’s only a small part of the picture.

The real challenge is getting your input data, and getting the input data right. There is a real process to getting the right training data, making sure you formulate it right and avoid the pitfalls, and then getting all the training data into properly formulated examples. The fine-tuning part itself is the easy part. We will look at the nitty-gritty details of getting the data right, as our GPT will only ever be as good as our training data.

Garbage in = Garbage out.

So what kind of data do we need?

Well, basically we need to show ChatGPT what we want, by giving it examples. These examples need to be in the same format as normal ChatGPT chat completions, which makes total sense, as we’re trying to teach it what kind of chat completions we want it to create right? So the basic idea is something like this:

[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a Harry Potter wizard spell creator who comes up with new magical spells! Alohamora!"
      },
      { "role": "user", "content": "How can I bake a cookie?" },
      {
        "role": "assistant",
        "content": "Confectio Deliciosum! This spell will help you bake cookies in no time!"
      }
    ]
  },
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a Harry Potter wizard spell creator who comes up with new magical spells! Alohamora!"
      },
      { "role": "user", "content": "I need to fend off a shark!" },
      {
        "role": "assistant",
        "content": "Squalus Repellum! This spell will help you fend off a shark!"
      }
    ]
  },
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a Harry Potter wizard spell creator who comes up with new magical spells! Alohamora!"
      },
      { "role": "user", "content": "I need to lose some weight fast." },
      {
        "role": "assistant",
        "content": "Adiposum Reductum! Your belly fat will just fly out the window!"
      }
    ]
  }
]

Notice that we have a list of objects, which look like the messages list we would typically work with. We are basically simulating conversations with ChatGPT that never occurred, so we can use these as training data. For each example we have a system message, then we have the user query in the second slot, and the assistant’s response in the third slot.

Now notice that the entire prompt message is sent and is exactly the same for all three examples. You might feel that this is wasteful, as we’re repeating the same thing over and over again, but remember that we’re trying to sort of hard-wire these instructions into ChatGPT, so this is a good thing.

If you try to leave out the system messages to save tokens the model will have to learn entirely through demonstration of the outcome, and your training may actually take longer. We recommend just leaving that system message in there for every example, and make sure it’s a good one because it is going to get baked into the model!

The second entry, the user query, is obviously going to be different each time. Make sure you include examples that match the kind of use you want to use your final fine-tuned model for. Especially make sure you include any edge cases and harder-than-usual examples, the training phase is the time to show the model what it will be up against.

The third entry, the assistant’s response, is going to be the exact perfect answer that we want ChatGPT to give for this query. ChatGPT will be trained on this system message, with this query, this is the response I should give.

Note the example above is of course useless, as we could easily achieve this output without any fine-tuning at all from basically any LLM in existence. It is just an example of the training data structure. In reality, we need at least 10 examples for fine-tuning, but you should probably aim for at the very least 50 well-crafted examples if not more.

Also, the final format needs to be in JSONL format, with every object flattened down onto a single very long line, which looks kind of like this:

{"messages": [{system...}, {user...}, {assistant...}]}
{"messages": [{system...}, {user...}, {assistant...}]}
{"messages": [{system...}, {user...}, {assistant...}]}

But this is only a minor and easy conversion, so we’ll get back to that later.

As for the length, each training example is limited to the context length of the model. So every single line of the JSONL data can be up to the context limit, which for gpt-3.5-turbo-1106 is 16,385 tokens. As this is a very high amount, we’re not going to worry about it too much for our use cases here, as we’re not going to be going over this limit.

Now we’ll be using gpt-3.5-turbo-1106 here as it is the newest version of the model that has fine-tuning support so far. This is probably a good thing though as fine-tuning on GPT-4 would be a lot more expensive and as we’ll be showing the model exactly what we want it to do anyway, we won’t really need GPT-4’s extra capabilities.

The data

So, I’ve gone through my email account and extracted a whole bunch of emails I have received from Chris at Finxter, the last 200 to be precise. This very first step, I have done for you, as I obviously cannot give you all access to my personal email inbox! But I will still cover roughly the steps taken:

I’ve applied a label to all the emails I wanted to extract from my inbox, so I could easily find them.
I went to Google Takeout and requested a download of all my emails with that label.
I received a file with all my emails in MBOX format.
I wrote a Python script, mbox_to_json_decode.py, which takes the emails, decodes them, takes all my personal unsubscribe links and other personal data out, and then writes them to a JSON file.

As this MBOX to JSON conversion is hyper-specific, and the MBOX file contains some of my personal data, this is the only step along the way we will skip, as the chances that you will also have to convert MBOX files to JSON are very slim and I want to keep this course relevant. If you do need information on MBOX to JSON conversion, I will add the mbox_to_json_decode script in the GitHub repository so you can check it out if you need to.

So now we are left with Finx_dataset.json, which will be our entry point for this tutorial. Normally I would include this file in the GitHub repository, but as it has a large amount of the Finxter email data, we have elected to not include it in the repository. Instead, the file will be available for download from the course page in the Finxter Academy. If you haven’t downloaded it yet, please do so now.

Then go ahead and create a base project folder to use for this course, I’ve named mine Finx_Fine_Tuning, and then create a folder named data inside of it. Then move the Finx_dataset.json file into the data folder to create the following structure:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json

Create a venv in the root project folder

Ok, just a small detour before we continue with our project!

We’ll be running this project inside a virtual environment. A virtual environment is a self-contained directory that will allow us to install specific versions of packages inside the virtual environment without affecting the global Python installation.

We will use this as I will be using specific versions for the libraries we install as we go along, and I want to make sure that you have the exact same experience as I do. The virtual environment will make it easy for you to install my exact versions without worrying about affecting any of your other projects.

To create a new virtual environment we’ll use a tool called pipenv. If you don’t have pipenv installed, you can install it using pip, which is Python’s package manager. Run the following command in your terminal:

pip install pipenv

Make sure the terminal is inside your root project folder, e.g. /c/Coding_Vault/Finx_Fine_Tuning, and then run the following command to create a new virtual environment:

pipenv shell

This will create a new virtual environment and also a Pipfile in your project directory. Any packages you install using pipenv install will be added to the Pipfile.

To generate a Pipfile.lock, which is used to produce deterministic builds, run:

pipenv lock

This will create a Pipfile.lock in your project directory, which contains the exact version of each dependency to ensure that future installs are able to replicate the same environment.

We don’t need to install a library first to create a Pipfile.lock. From now on when we install a library in this virtual environment with pipenv install library_name, they will be added to the Pipfile and Pipfile.lock.

Back to our data

Back to where we were. Our root project folder should now look like this:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json
    📄Pipfile
    📄Pipfile.lock

Let’s go ahead and take a look at the Finx_dataset.json file we downloaded earlier to see what kind of raw data we are working with here:

[
  {
    "subject": "5 Proxies to Investing in OpenAI",
    "body": "<html>\n<head>\n\t<title></title>\n</head>\n<body data-gr-ext-installed=\"\" data-new-gr-c-s-check-loaded=\"8.909.0\" data-new-gr-c-s-loaded=\"8.909.0\" style=\"font-family:Arial;font-size:16px;\">\n<p style=\"text-align: center;\"><a href=\"{Link}\" target=\"_blank\"><img alt=\"\" height=\"39\" src=\"{Link}\" width=\"153\" /></a></p>\n\n<p>\u00a0</p>\n\n<p>Hey {User},</p>\n\n<p>To profit from change, we need to increase ownership of disruptive trends. Today's article covers a question that many Finxters frequently ask:</p>\n\n<p>\ud83e\udeb4 [<strong>Blog</strong>] <a href=\"{Link}\">How to Invest in OpenAI?</a> \ud83c\udf33</p>\n\n<p>While it's not possible to invest in OpenAI directly, the blog discusses five alternatives:</p>\n\n<ul>\n\t<li><strong>MSFT </strong>(49% stake in OpenAI),</li>\n\t<li><strong>NVIDIA </strong>(makes more revenue from OpenAI than any other company),</li>\n\t<li><strong>ARKVX </strong>(<em>Anthropic!</em>),</li>\n\t<li><strong>META </strong>(<em>Llama 2!</em>), and</li>\n\t<li><strong>TSLA </strong>(Optimus!).</li>\n</ul>\n\n<p>Check it out if you're interested in any of those! No financial advice. \ud83d\ude0a</p>\n\n<p>Be on the right side of change. \ud83d\ude80<br />\nChris</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p><strong>\u2665\ufe0f Community Corner: Featured Resources</strong></p>\n\n<ul>\n\t<li><a href=\"{Link}\">TradeUnafraid</a> is a trading platform owned and operated by Finxter community member Lee.</li>\n</ul>\n\n<p>Do you want to feature your own startup, YouTube channel, blog, or website as a <a href=\"{Link}\">Finxter premium member</a>? Hit reply and let me know!</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<div style=\"background:#eeeeee;border:1px solid #fcfcfc;padding:20px 20px;\">\n<p><span><strong><a href=\"{Link}\">How are we doing?</a></strong><br />\n<a href=\"{Link}\">\u2b50</a><br />\n<a href=\"{Link}\">\u2b50\u2b50</a><br />\n<a href=\"{Link}\">\u2b50\u2b50\u2b50</a><br />\n<br />\nTo make sure you keep getting these emails, please add <em>chris@finxter.com</em> to your address book.<br />\n<br />\nI'd love to hear your feedback so that I can improve this free email course over time. Please reply to this email and share everything on your mind!<br />\n<br />\n<strong>If you find the Finxter Email Academy useful, please invite a friend or colleague! \u2764</strong></span></p>\n\n<p><br />\n<span>Here's the subscription link you can share:<br />\n<a href=\"{Link}\" target=\"_blank\">https://blog.finxter.com/subscribe/</a><br />\n<br />\nDownload the Ultimate Python Cheat Sheet here (direct PDF download): \ud83d\udc0d</span></p>\n\n<p><span><strong><a href=\"{Link}\" target=\"_blank\">The Ultimate Python Cheat Sheet</a></strong><br />\n<br />\nNot very motivated to learn today? Consider this:<br />\n<strong><em>\"Knowledge compounds!\"</em></strong> -- Warren Buffett<br />\n<br />\nConsequently, if you improve your skills by 1% every day, you'll 36x your programming skills within a year!</span></p>\n</div>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p><br />\n<em><strong><span>Finxter, Dr. Christian Mayer</span></strong><br />\n<span>{Address}., {City}, {Country}</span></em></p>\n\n<p><span>Want out of the loop? I'm so sad to see you go. \ud83d\ude22 How could we have done better? </span><br />\n<span>To help future Finxters, please hit reply and tell us! \ud83e\udd17</span></p>\n<a href=\"{Link}\" >Unsubscribe here</a>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n</body>\n</html>\n<img src=\"{Link}\" alt=\"\" style=\"width:1px;height:1px;\"/>\n"
  },
  {
    "subject": "Tech Deflation vs Inflation",
    "body": "Email no2..."
  }

As you can see, we have a list of objects, each with a subject and body key. The body key contains the raw HTML of the email, which we will need to clean up a bit before using it for our purposes. The only preprocessing I’ve done in the MBOX to JSON conversion is removing links and personal data for generic {Link} and {User} placeholders.

If you’re wondering what the \uxxxx characters are, like the sequence \ud83d\udc0d, they are Unicode escape sequences that represent characters in the Unicode standard. Specifically, this sequence represents the “snake” emoji (🐍). You will see these quite a lot as Chris is of course famous for his creative emoji usage!

The full list has about 200 of these email objects, in non-chronological order. If you scroll through the data, you will see some noise in there, which will be reflected in our final product. For the purposes of this tutorial, it will be good enough. For professional use, you’d want to make sure to clean up the data more thoroughly, spending some more time here.

Preparing our data

We now have our basic data, and we know what kind of format we need for the training data, like the Harry Potter magical spells example we showed. Now let’s start wrangling the data into the format we need. As with all complex coding tasks let’s take it one step at a time, and let’s build our solution in small and reusable parts.

We’ll start with a utility to convert the email above into a more readable and simple format. Instead of the HTML above with all the emojis in a format that we cannot even read and loads of HTML tags all over the place, let’s have a utility function that takes that HTML email as input and returns a simple and readable markdown format version for us to work with instead.

So go ahead and create a new folder named utils in the root project folder, and then create a new file named html_email.py inside the utils folder:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json
    📁utils
        📄html_email.py
    📄Pipfile
    📄Pipfile.lock

Now before we get started on the html_email.py file, we’ll need to install a library called html2text which will help us convert the HTML emails to markdown. Someone has already written a library to do this for us, so we don’t have to write it ourselves. Always use existing solutions when they exist to speed up your development cycle!

To install a specific version of a package in our Pipenv environment, you can use the pipenv install command followed by the package name and the version number. Run the following command:

pipenv install html2text==2020.1.16

This command will add html2text to our Pipfile under the [packages] section with the specified version. It will also update your Pipfile.lock to include the exact version of html2text and its dependencies.

Now let’s go ahead and open the html_email.py file and add the following code:

import html2text

def html_to_markdown(html: str) -> str:
    html = html.encode("utf-16", "surrogatepass").decode("utf-16")

    html_to_text_converter = html2text.HTML2Text()
    html_to_text_converter.ignore_links = False
    return html_to_text_converter.handle(html)

We first import the library we have just installed. Then we define a function html_to_markdown which takes an HTML string as input and returns a markdown string.

We then take the html variable, which is a string, and we
will convert any Unicode escape sequences in the string back into their corresponding characters. The "surrogatepass" error handler instructs Python to properly handle any surrogate characters in the string so that for the \ud83d\ude80 patterns we talked about earlier, after running this line, they will be turned into the corresponding emoji characters (in this case, the rocket emoji 🚀).

This works because the .encode method converts the string to bytes using UTF-16 encoding, which includes converting Unicode escape sequences to their actual Unicode characters. Then, the .decode method converts those bytes back into a string, preserving the Unicode characters. So we basically did a round-trip conversion from Unicode escape sequences to actual Unicode characters.

We then create an instance of the HTML2Text class and set the ignore_links attribute to False to include links in the output. We then call the handle method of the HTML2Text instance and pass the HTML string as an argument to convert it to markdown, and simply return the result.

Let’s test it out

Let’s go ahead and give it a test run. Above the html_to_markdown function, add the following variable holding a test email string:

test_email = '<html>\n<head>\n\t<title></title>\n</head>\n<body data-gr-ext-installed="" data-new-gr-c-s-check-loaded="8.909.0" data-new-gr-c-s-loaded="8.909.0" style="font-family:Arial;font-size:16px;">\n<p style="text-align: center;"><a href="{Link}" target="_blank"><img alt="" height="39" src="{Link}" width="153" /></a></p>\n\n<p>\u00a0</p>\n\n<p>Hey {User},</p>\n\n<p>To profit from change, we need to increase ownership of disruptive trends. Today\'s article covers a question that many Finxters frequently ask:</p>\n\n<p>\ud83e\udeb4 [<strong>Blog</strong>] <a href="{Link}">How to Invest in OpenAI?</a> \ud83c\udf33</p>\n\n<p>While it\'s not possible to invest in OpenAI directly, the blog discusses five alternatives:</p>\n\n<ul>\n\t<li><strong>MSFT </strong>(49% stake in OpenAI),</li>\n\t<li><strong>NVIDIA </strong>(makes more revenue from OpenAI than any other company),</li>\n\t<li><strong>ARKVX </strong>(<em>Anthropic!</em>),</li>\n\t<li><strong>META </strong>(<em>Llama 2!</em>), and</li>\n\t<li><strong>TSLA </strong>(Optimus!).</li>\n</ul>\n\n<p>Check it out if you\'re interested in any of those! No financial advice. \ud83d\ude0a</p>\n\n<p>Be on the right side of change. \ud83d\ude80<br />\nChris</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p><strong>\u2665\ufe0f Community Corner: Featured Resources</strong></p>\n\n<ul>\n\t<li><a href="{Link}">TradeUnafraid</a> is a trading platform owned and operated by Finxter community member Lee.</li>\n</ul>\n\n<p>Do you want to feature your own startup, YouTube channel, blog, or website as a <a href="{Link}">Finxter premium member</a>? Hit reply and let me know!</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<div style="background:#eeeeee;border:1px solid #fcfcfc;padding:20px 20px;">\n<p><span><strong><a href="{Link}">How are we doing?</a></strong><br />\n<a href="{Link}">\u2b50</a><br />\n<a href="{Link}">\u2b50\u2b50</a><br />\n<a href="{Link}">\u2b50\u2b50\u2b50</a><br />\n<br />\nTo make sure you keep getting these emails, please add <em>chris@finxter.com</em> to your address book.<br />\n<br />\nI\'d love to hear your feedback so that I can improve this free email course over time. Please reply to this email and share everything on your mind!<br />\n<br />\n<strong>If you find the Finxter Email Academy useful, please invite a friend or colleague! \u2764</strong></span></p>\n\n<p><br />\n<span>Here\'s the subscription link you can share:<br />\n<a href="{Link}" target="_blank">https://blog.finxter.com/subscribe/</a><br />\n<br />\nDownload the Ultimate Python Cheat Sheet here (direct PDF download): \ud83d\udc0d</span></p>\n\n<p><span><strong><a href="{Link}" target="_blank">The Ultimate Python Cheat Sheet</a></strong><br />\n<br />\nNot very motivated to learn today? Consider this:<br />\n<strong><em>"Knowledge compounds!"</em></strong> -- Warren Buffett<br />\n<br />\nConsequently, if you improve your skills by 1% every day, you\'ll 36x your programming skills within a year!</span></p>\n</div>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p><br />\n<em><strong><span>Finxter, Dr. Christian Mayer</span></strong><br />\n<span>{Address}., {City}, {Country}</span></em></p>\n\n<p><span>Want out of the loop? I\'m so sad to see you go. \ud83d\ude22 How could we have done better? </span><br />\n<span>To help future Finxters, please hit reply and tell us! \ud83e\udd17</span></p>\n<a href="{Link}" >Unsubscribe here</a>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n\n<p>\u00a0</p>\n</body>\n</html>\n<img src="{Link}" alt="" style="width:1px;height:1px;"/>\n'

Just copy it from the written version of the tutorial, and make sure you insert it above the function we wrote:

import html2text

test_email = ...

def html_to_markdown(html: str) -> str:
    ...

Now, below the html_to_markdown function, add the following code to test the function:

if __name__ == "__main__":
    markdown_content = html_to_markdown(test_email)

    with open("test.md", "w", encoding="utf-8") as file:
        file.write(markdown_content)

This code will run the html_to_markdown function with the test_email string as input, and then write the result to a file named test.md. The if __name__ == "__main__": line ensures that the code inside the block only runs when the script is executed directly, and not when we import the html_to_markdown function into another script later on.

💡 Python Top-tip 💡
In Python, when a script is run, a special built-in variable called __name__ is set to "__main__". However, if a module is imported, __name__ is set to the module's name instead. By checking if __name__ == "__main__":, the script can determine whether it's being run directly or being imported as a module.

This allows for a flexible way to organize your code. You can put code that tests the functionality of the module or demonstrates how to use the module under this if statement. When the module is imported, this code won't run, but when the script is run directly, the code will execute. This is particularly useful for unit testing or for scripts that can be used both as utility modules and as standalone programs.

Now go ahead and run the script and a new file named test.md will be created. If you check it out it will have the markdown version of the email we provided as input.

[![]({Link})]({Link})

Hey {User},

To profit from change, we need to increase ownership of disruptive trends.
Today's article covers a question that many Finxters frequently ask:

🪴 [ **Blog** ] [How to Invest in OpenAI?]({Link}) 🌳

While it's not possible to invest in OpenAI directly, the blog discusses five
alternatives:

  * **MSFT** (49% stake in OpenAI),
  * **NVIDIA** (makes more revenue from OpenAI than any other company),
  * **ARKVX** ( _Anthropic!_ ),
  * **META** ( _Llama 2!_ ), and
  * **TSLA** (Optimus!).

Check it out if you're interested in any of those! No financial advice. 😊

Be on the right side of change. 🚀
Chris

**♥️ Community Corner: Featured Resources**

  * [TradeUnafraid]({Link}) is a trading platform owned and operated by Finxter community member Lee.

Do you want to feature your own startup, YouTube channel, blog, or website as
a [Finxter premium member]({Link})? Hit reply and let me know!

**[How are we doing?]({Link})**
[⭐]({Link})
[⭐⭐]({Link})
[⭐⭐⭐]({Link})

If we render this properly as markdown it will result in the following look:

###########################START##########################

Hey {User},

To profit from change, we need to increase ownership of disruptive trends.
Today’s article covers a question that many Finxters frequently ask:

🪴 [ Blog ] How to Invest in OpenAI? 🌳

While it’s not possible to invest in OpenAI directly, the blog discusses five
alternatives:

MSFT (49% stake in OpenAI),
NVIDIA (makes more revenue from OpenAI than any other company),
ARKVX ( Anthropic! ),
META ( Llama 2! ), and
TSLA (Optimus!).

Check it out if you’re interested in any of those! No financial advice. 😊

Be on the right side of change. 🚀
Chris

♥️ Community Corner: Featured Resources

TradeUnafraid is a trading platform owned and operated by Finxter community member Lee.

Do you want to feature your own startup, YouTube channel, blog, or website as
a Finxter premium member? Hit reply and let me know!

How are we doing?
⭐
⭐⭐
⭐⭐⭐

###########################END##########################

Which is good enough for our purposes for this tutorial. We will be using this markdown version of the emails as our training data for the fine-tuning process. We could go and clean up even further to have cleaner output, but for the purposes of this tutorial, this will be good enough.

Now that we have our HTML to Markdown function prepared, we’ll continue in part 2, where we will generate the actual training data for our fine-tuning of ChrisGPT. I’ll see you in part 2!

Part 2: Generating the Training Data

Hi and welcome back to part 2, where we will be generating the training data for our own ChrisGPT model! We’ll pick up where we left off in part 1 and use our HTML to Markdown converter utility and more to write and run our training dataset generator that will build the ChrisGPT training data for us.

Api Key

We need to start by setting our key for the OpenAI API, but we cannot hardcode this one in our source code. Go to https://platform.openai.com/api-keys and copy your API key. If you don’t have one, make sure to get one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env in the root folder of your project:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json
    📁utils
        📄html_email.py
    📄.env                  (new file)
    📄Pipfile
    📄Pipfile.lock

And paste your API key in there like this, making sure not to use any spaces or quotes:

OPENAI_API_KEY=your_api_key_here

Then go ahead and save and close this file. (You can also use environment variables to set the key if you prefer, but this is simpler to set up for tutorial purposes as it works the same on all platforms.)

Constants

Now, let’s define some project-wide settings by creating a new file named constants.py in the root directory of our project. This file will contain the constants that we will use multiple times throughout our project:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json
    📁utils
        📄html_email.py
    📄.env
    📄constants.py        (new file)
    📄Pipfile
    📄Pipfile.lock

Run the following command in your terminal to add the python-decouple package inside your pipenv environment:

pipenv install python-decouple==3.7

We will use this package to read the .env file and get the API key from it. Then install the openai library as well:

pipenv install openai==1.12.0

Make sure you run the installs even if you already have the packages installed, as we need to make sure they are installed in the virtual environment we are using for this project. Now, open the constants.py file and add the following code:

from pathlib import Path

from decouple import config
from openai import OpenAI

CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
DATA_DIRECTORY = Path(__file__).parent / "data"

We import Path from pathlib, the OpenAI class from the openai package, and the config function from the decouple package. We then create a CLIENT constant that will be an instance of the OpenAI class, passing the API key from the .env file to it by reading it from the .env file using the config function.

💡 Python Top-tip 💡
The 'config' function from the 'decouple' package reads the value of the environment variable from the '.env' file and returns it as a string. If you share or upload your code to a public repository, make sure to add the '.env' file to your '.gitignore' file to avoid sharing your API key with others.

We also create a DATA_DIRECTORY constant that will be a Path object pointing to the data directory. Path(__file__) returns the absolute path of the current file, and parent returns the parent directory of the current file (which is constants.py so that gives us the root folder). We then use the / operator to add the data directory to the path.

The Data Generator

Go ahead and save and close the constants.py file. Then create a new file in the root directory called chris_gpt_dataset_generator.py:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json
    📁utils
        📄html_email.py
    📄.env
    📄constants.py
    📄chris_gpt_dataset_generator.py        (new file)
    📄Pipfile
    📄Pipfile.lock

Now let’s install two libraries before we dive in here. We’ll combine them into a single command this time:

pipenv install tqdm==4.65.0 && pipenv install tenacity==8.2.3

The tqdm package will be used to easily add a progress bar to our generator, and the tenacity package will be used to retry the API requests if they fail for some reason. Both are extremely simple in their usage, as you will see.

Ok now open the chris_gpt_dataset_generator.py file and we will start with our imports for the file:

import json

from tenacity import retry, stop_after_attempt, wait_fixed
from tqdm import tqdm

from constants import CLIENT, DATA_DIRECTORY
from utils import html_email

We use json to load and save back the JSON data, tenacity and tqdm for the reasons we just mentioned, and then we import the CLIENT and DATA_DIRECTORY constants from the constants module, and our html_email utility we just made. Now let’s set up some base constants for use in our generator only:

INPUT_DATA = DATA_DIRECTORY / "Finx_dataset.json"
OUTPUT_DATA = DATA_DIRECTORY / "Finx_completed_dataset.json"
MODEL: str = "gpt-3.5-turbo-0125"
TOTAL_TOKENS_USED = 0
RATE_ERROR_MESSAGE = "There was an error calling 'get_user_query'. Perhaps the OpenAI ChatGPT rate limit has been reached. Retrying one more time in 60 seconds to reset the rate limiter..."

We define INPUT_DATA and OUTPUT_DATA constants that will be Path objects pointing to the Finx_dataset.json and Finx_completed_dataset.json files, respectively. The output one doesn’t exist yet obviously, but that is fine. The MODEL constant is just a string with the model name we use.

💡 OpenAI Top-tip 💡
There are many versions of the gpt-3.5-turbo model:

✨ gpt-3.5-turbo-0613. This one is the older version.
✨ gpt-3.5-turbo-1106. Has much improved functionality and a larger context window compared to the 0613 model. This is the newest version that can be used for fine-tuning purposes.
✨ gpt-3.5-turbo-0125. The latest minor update to the 1106 model, this one cannot be used for fine-tuning yet. Aside from the minor improvements, it also has a reduced price.

This is why we will use the `0125` version for the data generation in this part of the tutorial, but later when we do the fine-tuning we will be using the `1106` version instead.

We also define a TOTAL_TOKENS_USED constant that will be used to keep track of the total number of tokens used in the API requests, and a RATE_ERROR_MESSAGE that we reference for an error message later. As all of the above constants will only be used for this script we defined them all inside the chris_gpt_dataset_generator.py file.

The data structure

Now we’ll define a class that will serve as the data structure “blueprint” if you will, for each example in our dataset. This is exactly the same structure we used in the Harry Potter magical spells example in part 1:

class TrainingDataEntry:
    def __init__(self, fictional_user_query, markdown_email) -> None:
        self.data = {
            "messages": [
                {
                    "role": "system",
                    "content": 'You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris\' writing style. You will respond in the following format: {"subject": "The subject of the email", "body": "The body of the email in Markdown formatting"}.',
                },
                {"role": "user", "content": fictional_user_query},
                {"role": "assistant", "content": markdown_email},
            ]
        }

We have a class named TrainingDataEntry, which has an __init__ method that takes two arguments: fictional_user_query and markdown_email. The __init__ method initializes the data attribute of the TrainingDataEntry object with a dictionary that has a messages key, which is a list of dictionaries.

The first dictionary in the list has a role key with the value "system" and a content key with a long string that explains the role of the assistant. This is going to be the same for all examples of the training data so I just hard-coded it in the class.

The second dictionary in the list has a role key with the value "user" and a content key with the value of the fictional_user_query argument. The third dictionary in the list has a role key with the value "assistant" and a content key with the value of the markdown_email argument.

So the system message is the same for each entry and the assistant message is simply the email that we already have saved in our dataset and can convert using the HTML to Markdown converter we built.

The only challenge we have left is the user message, which is supposed to hold the user query that resulted in the generated email. We will have to generate a fictional user query for each “result” email we have in our dataset. This should mostly be a short summary of the email contents for our case, as we want to train ChrisGPT to generate an email if we give it the topics to write on.

Getting the user query (that was never made)

So let’s create a function that will take an email as input and generate a fictional user query for it:

def get_fictional_user_query(email: str) -> str:
    global TOTAL_TOKENS_USED
    response = CLIENT.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": 'You will be provided with an email in the following format:{"subject": "The subject of the email", "body": "The body of the email in Markdown formatting"}. Your task is to go back in time and write a rough draft of the topics the email writer decided to discuss in the email. This will basically be a summary of the email\'s topics formatted in short bullet points, that the author would have used as a basis to then write the real email.',
            },
            {
                "role": "user",
                "content": f"Here is the output LLM generated email: {email}. Bullet point draft of the topics discussed in the email:",
            },
        ],
    )
    if not response.usage or not response.choices[0].message.content:
        raise Exception("Invalid response from OpenAI API")
    TOTAL_TOKENS_USED += response.usage.total_tokens
    return response.choices[0].message.content

We define a function named get_fictional_user_query that takes an email string argument and will return a string. Inside the function, we first reference the TOTAL_TOKENS_USED variable as a global variable, so that we can update its value from inside the function. Then we call the CLIENT.chat.completions.create method with the model and messages arguments.

For the prompt setup message we basically just explain that the LLM will receive an email in the format of a subject and body and that it should write a rough draft of the topics the email writer decided to discuss in the email. This will basically be a summary of the email’s topics formatted in short bullet points. That way we can give our trained model a list of bullet points to write and email for us later on.

The user message is then just the email we received as input, and a short message requesting the bullet points draft of the topics discussed. Note how it is stated as “Bullet point draft of the topics discussed in the email:” ending with a : colon so that the LLM will answer with the bullet points right away without giving some introduction or “Sure, I can help you with that” or something like that.

We then double-check if the response has a usage and choices attribute and if the content of the choices attribute is not empty. If any of these checks fail we raise an exception, but if everything is fine we add the total_tokens used in the response to the TOTAL_TOKENS_USED variable and return the content of the choices attribute.

Making it robust

Ok, so that is pretty good so far, but we will be calling this function many times in a row, and saving the data into an object in memory, which we write to a file at the end. If we get an error halfway then the script will crash out and we have to try again from the start. We can also hit an error because of a rate limit, as we will be doing many requests in a row. This is not ideal, so we will use the tenacity package to retry the API requests if they fail for some reason.

We can simply add the @retry decorator to our function, by adding it straight above the function definition like this:

@retry(
    wait=wait_fixed(60),
    stop=stop_after_attempt(2),
    reraise=True,
    before_sleep=lambda _: print(RATE_ERROR_MESSAGE),
)
def get_fictional_user_query(email):
    ... function here ...

We use the @retry decorator to specify that we want to retry the function if it fails due to any type of error. As you can see the tenacity library is very easy to read and use. We specify that we want to wait 60 seconds between each retry and that we want to stop after 2 attempts, maximum.

The 60 seconds are in case we hit a rate limit, as trying again right away might trigger the rate limit again. We also specify that we want to reraise the exception if the function fails after the maximum number of attempts, so we can see what the original error was.

Now the last part is a bit trickier. As tenacity will stop the program from crashing out, we won’t see the error and we won’t be aware of what happened. This is why we added a before_sleep function that will print the RATE_ERROR_MESSAGE before each retry. This way we can keep ourselves up to date in the terminal about what is happening.

The problem is that if I just pass in print(RATE_ERROR_MESSAGE) then the print statement will be triggered right away when Python first loads the function into memory because it is being called with the ellipsis. This is not what we want, so we need to wrap it in a lambda function that will then return the print call when the lambda is called.

The lambda _: print(RATE_ERROR_MESSAGE) is basically equivalent to:

# Do not put in your code
def before_sleep_function(_):
    print(RATE_ERROR_MESSAGE)

But it’s just much shorter to write. The _ is the name of the input argument, as tenacity will send an input argument to our before_sleep function, but we won’t be using it, so we just name it _ to sort of throw it away.

Ok so now we have a TrainingDataEntry class to hold the data for each entry and a function to generate the fictional user queries that will automatically retry if there is an error. See how we’re making the problem smaller and smaller?

Putting it all together

All we need to do now is create a TrainingDataEntry object for each entry in the training data, get the fictional user query by calling our function, and then save all the objects to a new JSON file to get our training data. Let’s do that now:

with open(INPUT_DATA, "r", encoding="utf-8") as file:
    input_data = json.load(file)
    output_data = []

We open the INPUT_DATA file in read mode and use the json.load function to load the JSON data from the file into the input_data variable. We then create an empty list and assign it to the output_data variable. So far so good right? Let’s finish it:

with open(INPUT_DATA, "r", encoding="utf-8") as file:
    input_data = json.load(file)
    output_data = []
    for finx_email in tqdm(input_data, desc="Generating training data"):
        finx_email["body"] = html_email.html_to_markdown(finx_email["body"])
        training_data_entry = TrainingDataEntry(
            fictional_user_query=get_fictional_user_query(finx_email),
            markdown_email=str(finx_email),
        )
        output_data.append(training_data_entry.data)

First of all, we open a loop. We’re going to loop over each finx_email in the input_data list. The reason input_data is wrapped inside tqdm() here is that this will allow us to add a progress bar to the loop. The desc argument is used to specify the description of the progress bar for tqdm, which will be “Generating training data”.

Now remember that each finx_email is a dictionary with a subject and body key. We convert the body from HTML to Markdown using our html_to_markdown utility function, and then we create a TrainingDataEntry object for the finx_email. The fictional_user_query is obtained by calling our get_fictional_user_query function with the finx_email as input, and the markdown_email is just the finx_email dictionary converted to a string. The str() conversion here is important as the OpenAI API will crash if you send it all sorts of nested objects as input, but if they’re in string format it will work fine.

We then append the dictionary that the data attribute of the TrainingDataEntry holds to the output_data list. Now when we’ve called this once for every entry in the dataset we must save our data:

with open(OUTPUT_DATA, "w", encoding="utf-8") as file:
    json.dump(output_data, file, indent=4)

print(f"Total tokens used: {TOTAL_TOKENS_USED}")

This opens the OUTPUT_DATA file in write mode and uses the json.dump function to save the output_data list to the file in JSON format. We also specify the indent argument as 4 to make the JSON data more readable. Finally, we added a print statement to see how many tokens we used in the process. I’m not going to calculate the exact costs here as it would add unneeded complexity for a 1-time use script. It will be roughly like ~$0.30 if you make the full 200 calls, possibly a bit less.

Time to sit back!

And that’s it! We now have a data generator. Go ahead and run your chris_gpt_dataset_generator.py file and you should see a progress bar in the terminal as it generates the data. It will take a few minutes to run. You may run into a rate limit in which case tenacity will trigger for you and pause the process for 60 seconds before continuing. (This is also the reason we’re not using some fancy async API calls here, as we would hit the rate limit anyway and introduce pointless complexity to a script we will not run often anyway.)

Let it run and grab a coffee☕ or your favorite snack🥨 while you watch that glorious progress bar creep up and relax for a moment, you’ve earned it 😄! When it’s done you should have a new Finx_completed_dataset.json file in your data directory.

Generating training data: 100%|███████████████████████████████████████████████| 200/200 [08:56<00:00,  2.68s/it]
Total tokens used: 224800

Open it up and take a look at the data, it should be a list of dictionaries with the same structure as the data attribute of the TrainingDataEntry class:

[
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris' writing style. You will respond in the following format: {\"subject\": \"The subject of the email\", \"body\": \"The body of the email in Markdown formatting\"}."
            },
            {
                "role": "user",
                "content": "- Investment opportunities in OpenAI\n- Five proxies to invest in OpenAI:\n  - MSFT\n  - NVIDIA\n  - ARKVX\n  - META\n  - TSLA\n- Community corner featuring TradeUnafraid\n- Invitation to feature one's startup, YouTube channel, blog, or website\n- Feedback request on the email course\n- Invitation to invite a friend or colleague to the Finxter Email Academy\n- Download link for the Ultimate Python Cheat Sheet\n- Motivational quote on knowledge compounding\n- Contact information for Dr. Christian Mayer\n- Unsubscribe option"
            },
            {
                "role": "assistant",
                "content": "{'subject': '5 Proxies to Investing in OpenAI', 'body': '[![]({Link})]({Link})\\n\\nHey {User},\\n\\nTo profit from change, we need to increase ownership of disruptive trends.\\nToday\\'s article covers a question that many Finxters frequently ask:\\n\\n\ud83e\udeb4 [ **Blog** ] [How to Invest in OpenAI?]({Link}) \ud83c\udf33\\n\\nWhile it\\'s not possible to invest in OpenAI directly, the blog discusses five\\nalternatives:\\n\\n  * **MSFT** (49% stake in OpenAI),\\n  * **NVIDIA** (makes more revenue from OpenAI than any other company),\\n  * **ARKVX** ( _Anthropic!_ ),\\n  * **META** ( _Llama 2!_ ), and\\n  * **TSLA** (Optimus!).\\n\\nCheck it out if you\\'re interested in any of those! No financial advice. \ud83d\ude0a\\n\\nBe on the right side of change. \ud83d\ude80  \\nChris\\n\\n**\u2665\ufe0f Community Corner: Featured Resources**\\n\\n  * [TradeUnafraid]({Link}) is a trading platform owned and operated by Finxter community member Lee.\\n\\nDo you want to feature your own startup, YouTube channel, blog, or website as\\na [Finxter premium member]({Link})? Hit reply and let me know!\\n\\n**[How are we doing?]({Link})**  \\n[\u2b50]({Link})  \\n[\u2b50\u2b50]({Link})  \\n[\u2b50\u2b50\u2b50]({Link})  \\n  \\nTo make sure you keep getting these emails, please add _chris@finxter.com_ to\\nyour address book.  \\n  \\nI\\'d love to hear your feedback so that I can improve this free email course\\nover time. Please reply to this email and share everything on your mind!  \\n  \\n**If you find the Finxter Email Academy useful, please invite a friend or\\ncolleague! \u2764**\\n\\n  \\nHere\\'s the subscription link you can share:  \\n[https://blog.finxter.com/subscribe/]({Link})  \\n  \\nDownload the Ultimate Python Cheat Sheet here (direct PDF download): \ud83d\udc0d\\n\\n**[The Ultimate Python Cheat Sheet]({Link})**  \\n  \\nNot very motivated to learn today? Consider this:  \\n**_\"Knowledge compounds!\"_** \\\\-- Warren Buffett  \\n  \\nConsequently, if you improve your skills by 1% every day, you\\'ll 36x your\\nprogramming skills within a year!\\n\\n  \\n_**Finxter, Dr. Christian Mayer**  \\n{Address}., {City}, {Country}_\\n\\nWant out of the loop? I\\'m so sad to see you go. \ud83d\ude22 How could we have done\\nbetter?  \\nTo help future Finxters, please hit reply and tell us! \ud83e\udd17\\n\\n[Unsubscribe here]({Link})\\n\\n![]({Link})\\n\\n'}"
            }
        ]
    },
    ... many more entries ...

Perfect, this has all the data we need to train our ChrisGPT model. We have the system message that is the same for all entries, the user message that is a fictional user query for the email, and the assistant’s response message that is the email itself. That’s it for part 2 of this tutorial. I’ll see you in part 3 where we will be fine-tuning our ChrisGPT model using the data we just generated 🚀. See you there!

Part 3: Data Validation and Training Cost

Welcome back to part 3! This is where we’re going to do the last preparation and data validation steps on our dataset and also calculate how much it’s going to cost us to train the model.

JSONL format

Remember in part 1 where we discussed the training data? We discussed the data needing to be in JSONL format. Well, it’s time to come back to that now. So what is JSONL format?

JSONL, or JSON Lines, is a convenient format for storing structured data that may be processed one record at a time. Each line in a JSONL file is a valid JSON object. This is different from a regular JSON file, where the entire file is a single JSON object or array.

Each line is a separate, independent JSON object. This means that a large file can be read into memory one line at a time, instead of needing to read the entire data into memory at once, which can be a significant advantage when working with very large datasets. This makes it very useful for streaming JSON data object by object through another process like training an LLM model!

So say we have an object that looks like this:

[
  {
    "employee": {
      "name": "John Doe",
      "age": 30,
      "department": "Sales",
      "address": {
        "street": "123 Main St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  },
  {
    "employee": {
      "name": "Jane Smith",
      "age": 28,
      "department": "Marketing",
      "address": {
        "street": "456 Elm St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  },
  {
    "employee": {
      "name": "Joe Schmoe",
      "age": 35,
      "department": "Engineering",
      "address": {
        "street": "789 Oak St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  }
]

Then the JSONL version is essentially just a flattened-down version of this, with each object on a single line. Note that we can remove the brackets and the commas between different objects, as it is a given that each line contains one JSON object in this format:

{"name": "John Doe", "age": 30, "department": "Sales", "address": {"street": "123 Main St", "city": "Springfield", "state": "IL", "zip": "62701"}}
{"name": "Jane Smith", "age": 28, "department": "Marketing", "address": {"street": "456 Elm St", "city": "Springfield", "state": "IL", "zip": "62701"}}
{"name": "Joe Schmoe", "age": 35, "department": "Engineering", "address": {"street": "789 Oak St", "city": "Springfield", "state": "IL", "zip": "62701"}}

You will probably see the objects wrap around, but this is only a visual thing. In the actual file, each object is on a single line.

Flattening our dataset into a JSONL file

So let’s create a utility function to flatten our dataset into a JSONL file. In your existing utils folder, make a new file called jsonl.py:

📁Finx_Fine_Tuning
    📁data
        📄Finx_completed_dataset.json
        📄Finx_dataset.json
    📁utils
        📄html_email.py
        📄jsonl.py          (new file)
    📄.env
    📄constants.py
    📄chris_gpt_dataset_generator.py
    📄Pipfile
    📄Pipfile.lock

In jsonl.py, add the following imports to get started:

import json
from pathlib import Path
from typing import Iterable

We import the json module to read and save JSON data. We import Path and Iterable only to use them as type hints, to make sure our code is as clear and readable as possible. First, let’s make the problem smaller by creating a function that takes a list or iterable of dictionaries, and converts them into a JSONL file. Add the following function to jsonl.py:

def dicts_to_jsonl(output_file: Path, data: Iterable[dict]) -> Path:
    with open(output_file, "w") as file:
        for dict_obj in data:
            json_string = json.dumps(dict_obj)
            file.write(json_string + "\n")
    return output_file

This function takes two arguments: output_file is the path to the file we want to write, and data is an iterable of dictionaries. We open the file in write mode, and then loop through each dictionary in the iterable. We convert each dictionary to a JSON string using json.dumps, and then write it to the file. We add a newline character at the end of each line to separate the JSON objects. Finally, we return the path to the file as a Path object.

Ok, that handles directly converting a list of dictionaries stored in memory, now let’s add a second function below that will handle converting an existing JSON file into a JSONL file. Add the following function to jsonl.py:

def json_to_jsonl(input_file: Path, output_file: Path) -> Path:
    with open(input_file, "r") as in_file:
        data = json.load(in_file)

    return dicts_to_jsonl(output_file, data)

This function takes two arguments: input_file is the path to the JSON file we want to read, and output_file is the path to the JSONL file we want to write. We open the input file in read mode, and then load the JSON data into memory using json.load. We then call the dicts_to_jsonl function we defined earlier to write the data to the output file.

Using this composition, we now have two functions, one for converting dictionaries, and another for an existing JSON file, yet we did not duplicate any code. Go ahead and save and close jsonl.py

Validating our dataset

Before we train our model, we need to make sure our dataset is in the right format and we’ll also check how much this is going to cost, and make sure none of the entries exceed the token limit. This may all seem a bit overkill, but you really don’t want to start training a model and have it fail halfway due to sloppy data or a single entry that is too long. It’s also considerably more expensive than other ways of using ChatGPT because we’re creating a whole custom model, so it’s nice to know ahead of time exactly how much money you’re going to spend.

We’re writing most of these specific things in utility functions in separate files, so you can reuse all of these for your future fine-tuning projects. We’ll do the same for the validation and price-calculator logic. In your existing utils folder, make a new file called data_validation.py:

📁Finx_Fine_Tuning
    📁data
        📄Finx_completed_dataset.json
        📄Finx_dataset.json
    📁utils
        📄data_validation.py          (new file)
        📄html_email.py
        📄jsonl.py
    📄.env
    📄constants.py
    📄chris_gpt_dataset_generator.py
    📄Pipfile
    📄Pipfile.lock

Time to install the tiktoken library before we start writing the code. Open your terminal and run the following command:

pipenv install tiktoken==0.6.0

The tiktoken library is a Python package developed by OpenAI. We’ll use it to count the number of tokens in a text string without making any API calls.

In data_validation.py, get started by adding the following imports:

import json
from decimal import Decimal
from pathlib import Path

import tiktoken

Most of these are familiar by now, but we also import Decimal from the decimal module. We’ll use this to handle the cost calculations, as it’s more precise than using floating point numbers, not giving us the annoying rounding errors to deal with.

Now define a constant that will be used for our calculations:

TRAINING_COST_PER_1000_TOKENS = Decimal("0.0080")

This is the cost per 1000 tokens for training data at the time of writing, but it may have changed if you’re watching this tutorial in the future. You can check the current cost on the OpenAI pricing page and adjust this number accordingly.

Creating the Validator class

Now let’s create our Validator. As we’ll have a lot of related functions, let’s use a class to group them together and start with the __init__ method:

class Validator:
    def __init__(self, jsonl_file: Path) -> None:
        self.data = self._load_data(jsonl_file)
        self._token_list = None
        self.encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

So the __init__ method will get called when we instantiate a new instance of this class, and it will take a Path argument to the JSONL file we want to validate. We’ll load the data from the file and store it in the data attribute using the _load_data method we’ll define next.

We’ll also initialize the _token_list attribute to None for now, and we’ll use it to store the token count for each entry in the dataset. Finally we store the encoding for the model we’re going to use in the encoding attribute. As the tiktoken library was also made by OpenAI, it has a handy method to let us load up the proper encoding for the model we’re going to use.

Now let’s add the _load_data method. As our data file is not that massive, we’ll just load up the whole file at once and not worry about loading the JSONL one line at a time:

class Validator:
    def __init__():
        ...

    def _load_data(self, jsonl_file: Path) -> list:
        with open(jsonl_file, "r", encoding="utf-8") as file:
            data = [json.loads(line) for line in file]
        return data

No big surprises here, we take the path as input and return a list. The only different thing is since the data is in JSONL format, we use a list comprehension. For each line in the fine, we call json.loads to convert the JSON string to a Python dictionary, which will then become an element in the list saved as the variable data.

Now let’s add a method to calculate the token count for each entry in the dataset:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry(self) -> list[int]:
        extra_tokens_per_message = 2
        token_list = []
        for training_data_object in self.data:
            num_tokens = 0
            for message in training_data_object["messages"]:
                for _, value in message.items():
                    num_tokens += len(self.encoding.encode(str(value)))
                    num_tokens += extra_tokens_per_message
            token_list.append(num_tokens)
        return token_list

This method will return the approximate amount of tokens as a list of integers. We start by defining a variable extra_tokens_per_message and set it to 2. This is the current number of extra tokens we need to add to account for the object structure besides just the strings themselves to come to an accurate number. We then loop through each training_data_object in the dataset and set a counter num_tokens to 0.

As this is ChatCompletion data, we know that the messages are stored in a list under the key “messages”. We loop through each message and then through each key-value pair in the message. (We use an _ for the key because we don’t need it in this case, but we need to use it as a placeholder to unpack the tuple.)

We call self.encoding.encode to encode the value to a list of tokens, and then add the length of this list to num_tokens, as it’s only the len or length that we are interested in. We then add the extra_tokens_per_message to account for the object structure as discussed, as this also takes up tokens.

After all the key-value pairs inside each index of the messages inside a training_data_object are processed, we append the num_tokens to the token_list and then move on to the next training_data_object in the list.

Now let’s add a function to check if our formatting has any mistakes in it:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format(self, entry) -> bool:
        if not isinstance(entry, dict):
            return False

        if list(entry.keys()) != ["messages"]:
            return False

        messages = entry.get("messages", [])

        return all(
            isinstance(message, dict) and "role" in message and "content" in message
            for message in messages
        )

This function will return True if the entry is in the correct format, and False if it’s not. It will check a single entry, or training_data_object, in our dataset at a time. First, it will check if the entry is a dictionary. After that, we call keys() on the entry to get the dictionary keys and call list() on it to convert it to a list. We then check if the list is equal to ["messages"], so make sure it has one key and only one, and that key is “messages”.

We then call the get() method on the entry to get the value of the “messages” key. Now the last line uses a generator expression and might look confusing if you’re not familiar with it, so let’s break it down step by step.

A generator expression is similar to a list comprehension, but it doesn’t store the list in memory. Instead, it generates each value on the fly as you iterate over it. This can be more memory-efficient than a list comprehension for large sequences, though it doesn’t matter much for our dataset size here. The generator expression in the code is:

(message for message in messages)

This generates a sequence of message values, one for each message in messages.

The isinstance(message, dict) and "role" in message and "content" in message part is a condition that checks whether each message is a dictionary and whether it contains the keys role and content.

The all() function takes an iterable (in this case, the generator expression) and returns True if all elements of the iterable are truthy (i.e., they evaluate to True), and False if even a single entry is not True. So, in simple terms, we check whether all messages in the messages list are dictionaries that contain the keys role and content, and return either True or False.

Now, let’s add a property to get the token_list, so we can easily access it:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list(self) -> list[int]:
        if self._token_list is None:
            self._token_list = self._calculate_token_amount_per_entry()
        return self._token_list

The @property decorator here means that we can access this particular method as a property, so using self.token_list instead of calling it as a method with self.token_list(). First, this will check if self._token list is None, which it will be the first time we access it. If it is, it will call the _calculate_token_amount_per_entry method to calculate the token list and store it in the self._token_list attribute. Then it will return the _token_list attribute. If this method is called again, it will just return the _token_list attribute without recalculating it as it’s already been calculated and stored.

Note that the methods with the _ prefix are meant to be private, so the _token_list is our implementation detail here, and the token_list property is the public interface to access it. This is a good practice because it ensures that _token_list is always in a valid state when it’s accessed, and it hides the details of how _token_list is implemented and managed from the rest of your program by providing token_list as an access point.

Now let’s add a method to check if the dataset is valid:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data(self) -> bool:
        return all(self._check_single_entry_format(entry) for entry in self.data)

This method will return True if all entries in the dataset are in the correct format, and False if any of them are not. It uses a generator expression in the same style as we did before. Note that it will stop checking as soon as it finds an entry that fails the _check_single_entry_format test, because all stops iterating as soon as it encounters a False value.

Now let’s add a to get the training cost in dollars:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data():
        ...

    def get_training_cost_in_dollars(self, epochs: int = 3) -> Decimal:
        total_tokens = sum(self.token_list)
        total_cost_dollars = (
            TRAINING_COST_PER_1000_TOKENS * total_tokens / 1000 * epochs
        )
        print(
            f"Total estimated cost: ~${total_cost_dollars:.3f} for training {epochs} epochs on {total_tokens} token dataset."
        )
        return total_cost_dollars

💡 Machine-learning Top-tip 💡
Epochs are the number of times the model will go through the entire dataset during training. The more epochs, the more the model will learn and internalize our dataset. If the number is too low, it will not fully internalize our training data, but if the number is too high it will internalize our specific examples too much and lose its ability to generalize, a concept called overfitting. 3 Epochs is a good starting point for most fine-tuning tasks.

This method will return the total cost in dollars for training the model for a given number of epochs as a Decimal type object. It uses the sum function to calculate the total number of tokens in the dataset and then does simple math to get the total cost in dollars. We print the total cost with an accuracy of 3 decimal places by using the :.3f format specifier in the f-string and then return the total cost.

One last method and we’ll be done, I promise! 😄 We want to be able to make sure the longest entry is not above our token limit:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data():
        ...

    def get_training_cost_in_dollars():
        ...

    def longest_entry_token_count(self) -> int:
        return max(self.token_list)

We use the max function to get the maximum value from the token_list and return it. Token limits per training example, so for every line in our JSONL file, are the same as the context limit for the ChatGPT model we’re using. For gpt-3.5-turbo-1106, the maximum context length is 16,385 tokens, so as long as this number is below that, you’ll know you’re safe.

Here is the whole class again for reference:

class Validator:
    def __init__(self, jsonl_file: Path) -> None:
        self.data = self._load_data(jsonl_file)
        self._token_list = None
        self.encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

    def _load_data(self, jsonl_file: Path) -> list:
        with open(jsonl_file, "r", encoding="utf-8") as file:
            data = [json.loads(line) for line in file]
        return data

    def _calculate_token_amount_per_entry(self) -> list[int]:
        extra_tokens_per_message = 2
        token_list = []
        for training_data_object in self.data:
            num_tokens = 0
            for message in training_data_object["messages"]:
                for _, value in message.items():
                    num_tokens += len(self.encoding.encode(str(value)))
                    num_tokens += extra_tokens_per_message
            token_list.append(num_tokens)
        return token_list

    def _check_single_entry_format(self, entry) -> bool:
        if not isinstance(entry, dict):
            return False

        if list(entry.keys()) != ["messages"]:
            return False

        messages = entry.get("messages", [])

        return all(
            isinstance(message, dict) and "role" in message and "content" in message
            for message in messages
        )

    @property
    def token_list(self) -> list[int]:
        if self._token_list is None:
            self._token_list = self._calculate_token_amount_per_entry()
        return self._token_list

    def validate_data(self) -> bool:
        return all(self._check_single_entry_format(entry) for entry in self.data)

    def get_training_cost_in_dollars(self, epochs: int = 3) -> Decimal:
        total_tokens = sum(self.token_list)
        total_cost_dollars = (
            TRAINING_COST_PER_1000_TOKENS * total_tokens / 1000 * epochs
        )
        print(
            f"Total estimated cost: ~${total_cost_dollars:.3f} for training {epochs} epochs on {total_tokens} token dataset."
        )
        return total_cost_dollars

    def longest_entry_token_count(self) -> int:
        return max(self.token_list)

Using the Validator

So give yourself a pat on the back for that 😎. Now let’s train us some ChrisGPT! Save and close this file, then create a new file in your root directory named chris_gpt_dataset_validation.py:

📁Finx_Fine_Tuning
    📁data
        📄Finx_completed_dataset.json
        📄Finx_dataset.json
    📁utils
        📄data_validation.py
        📄html_email.py
        📄jsonl.py
    📄.env
    📄constants.py
    📄chris_gpt_dataset_generator.py
    📄chris_gpt_dataset_validation.py          (new file)
    📄Pipfile
    📄Pipfile.lock

In chris_gpt_dataset_validation.py, add the following setup to get started:

from utils import data_validation, jsonl
from constants import DATA_DIRECTORY


JSON_FILE = DATA_DIRECTORY / "Finx_completed_dataset.json"
JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"

We import all the stuff we made and prepared ourselves, and then we define the paths to the existing JSON file and the JSONL file we want to create. Now let’s make some good use of all the hard work we’ve done so far:

jsonl.json_to_jsonl(JSON_FILE, JSONL_FILE)  # Only run once

data_validator = data_validation.Validator(JSONL_FILE)

print(f"Data valid: {data_validator.validate_data()}")
data_validator.get_training_cost_in_dollars()
print(f"Longest entry: {data_validator.longest_entry_token_count()} tokens")

We convert our JSON file to a JSONL file with the same name. It says “Only run once” so you can comment out the code after we run the file the first time. Nothing bad will happen if you don’t though, it just does some unneeded calculations to make the same file again.

Then we create a new instance of our Validator class and pass the path to the JSONL file as an argument. We call the validate_data method to check if the dataset is valid and print the result. We then call the get_training_cost_in_dollars method to get the estimated training cost, which will get printed to the console automatically, and finally, we call the longest_entry_token_count method to get the token count of the longest entry in the dataset so we can make sure we don’t exceed the token limit.

Let’s run the file we have so far just as an interim test. You should get an output in your terminal that looks something like this:

Data valid: True
Total estimated cost: ~$5.184 for training 3 epochs on 216000 token dataset.
Longest entry: 2441 tokens

Your numbers will be slightly different from mine, as the data is partly LLM generated, but it will be very close to this. We can see our data is valid, we have over 200,000 tokens in total, and the longest entry is 2441 tokens, which is well below the 16,385 token limit for the gpt-3.5-turbo-1106 model.

You’ll also notice that a JSONL file has been created in your data directory with the training data in JSONL format:

📁Finx_Fine_Tuning
    📁data
        📄Finx_completed_dataset.json
        📄Finx_completed_dataset.jsonl ✨
        📄Finx_dataset.json
    ...

Now you might be surprised by the cost here. While $5 is not a massive amount of money it is a whole lot more than we typically consume when making regular ChatGPT calls. This is the reason we took so much time on the data validation, to make sure we get the data right the first time, and to know the exact cost before we commit to the training.

For those $5 you get something pretty damn cool though, your own custom ChatGPT 😎. That being said, I understand if you’re not willing to spend $5 on this simple test project. You can run with half the training data, which is 100 examples, or even a quarter, which is 50 examples. But your output will not be as good as mine if you do so.

Limiting the dataset size

Let’s make some small changes to the code so you can limit your dataset size if you want to:

import json

from constants import DATA_DIRECTORY
from utils import data_validation, jsonl


JSON_FILE = DATA_DIRECTORY / "Finx_completed_dataset.json"
JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"
LIMIT = 100


with open(JSON_FILE, "r", encoding="utf-8") as in_file:
    data = json.load(in_file)
    jsonl.dicts_to_jsonl(JSONL_FILE, data[:LIMIT])

data_validator = data_validation.Validator(JSONL_FILE)

print(f"Data valid: {data_validator.validate_data()}")
data_validator.get_training_cost_in_dollars()
print(f"Longest entry: {data_validator.longest_entry_token_count()} tokens")

We added an import for json, and we set a constant named LIMIT. We then simply manually load the data from the JSON_FILE and use the dicts_to_jsonl function instead of the json_to_jsonl function, passing in only the first LIMIT number of examples using a simple slice. Note how easy this is as we made the jsonl utility module out of pieces so we can simply use a different piece this time.

I’m going to set the LIMIT variable to None as I want to use the full 200 examples for mine. Choose whatever number you want to use for the LIMIT, and then run the file again. It will create the new JSONL_FILE with the limited number of examples, and then validate and tell you the new cost. Limiting to 100 examples will cost you around $2.55.

Now that we know the cost, and we know our data is valid, we can move on to the next part where we’ll actually train our model on the JSONL data. I’ll see you there! 🚀

Part 4: Training and Running ChrisGPT

Hi and welcome back to part 4, where we’ll be training and running ChrisGPT. In this part, we’ll finally be using the OpenAI fine-tuning API endpoints, which are fairly simple!

There are two ways to use the fine-tuning API, both of which are very simple. The first way to do this is programmatically, using Python code just like we do when sending normal calls to ChatGPT. We’ll be looking at this first. The second way is to use the web interface for the fine-tuning API.

Using the fine-tuning API programmatically

Go ahead and create a new file called chris_gpt_training.py in the root directory of your project:

📁Finx_Fine_Tuning
    📁data
        ...
    📁utils
        ...
    📄.env
    📄chris_gpt_dataset_generator.py
    📄chris_gpt_dataset_validation.py
    📄chris_gpt_training.py ✨
    📄constants.py
    📄Pipfile
    📄Pipfile.lock

So let’s start with our imports and basic setup:

from constants import CLIENT, DATA_DIRECTORY


JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"
MODEL = "gpt-3.5-turbo-1106"
SUFFIX = "chris_gpt"

We import the OpenAI Client we stored in CLIENT and DATA_DIRECTORY. Then we quickly set up a path to the JSONL data for training (make sure you don’t accidentally use the json instead of jsonl one). We also set the model to the 1106 version as this is the newest one that has fine-tuning. My testing showed 1106 to be significantly better for fine-tuning than the older 0613 version.

Finally, the SUFFIX part will allow us to choose a part of the fine-tuning model’s name ourselves. This suffix will become part of the name you use to call your model, which is quite useful for identification as the model names are a bit long and all extremely similar if you don’t have a suffix. An example:

# Example model name without suffix
ft:gpt-3.5-turbo-1106:personal::8ot8ZLJR

# Example model name with suffix
ft:gpt-3.5-turbo-1106:personal:chris-gpt:8ot8ZLJR

Now let’s have a look at the file related methods in the openai client. Make sure you don’t run the file yet:

# File related methods
file = CLIENT.files.create(file=open(JSONL_FILE, "rb"), purpose="fine-tune")

CLIENT.files.list(purpose="fine-tune")

CLIENT.files.delete(file.id)

These methods are all fairly self-explanatory. The create method will upload your file, make sure to use the rb (read-binary) mode for the file and provide the purpose as fine-tune so that the OpenAI servers know what this file is for. This returns an object we catch in the file variable. It looks like this and contains some basic file data, most importantly the id:

# Example of a file object

FileObject(
    id="file-DamWAnhgpnRvEyMZ3dOdHpvG",
    bytes=865053,
    created_at=1708303339,
    filename="Finx_completed_dataset.jsonl",
    object="file",
    purpose="fine-tune",
    status="processed",
    status_details=None,
)

The list method will list all files, allowing us to filter on a purpose, so we’ll filter for files with the purpose of fine-tune here. It just returns a list of the FileObject objects you see above. The delete method will delete a file by its id, using whatever ID you pass in to delete.

Fine-tuning methods

Now let’s take a look at the fine-tuning-job related methods. Again, don’t run this file yet, let’s just have a quick look first:

# Fine-tuning-job related methods
fine_tuning_job = CLIENT.fine_tuning.jobs.create(
    model=MODEL,
    training_file=file.id,
    hyperparameters={"n_epochs": 3},
    suffix=SUFFIX,
)

CLIENT.fine_tuning.jobs.list()

CLIENT.fine_tuning.jobs.retrieve(fine_tuning_job.id)

CLIENT.fine_tuning.jobs.cancel(fine_tuning_job.id)

The most important method is of course the create method, which will create a fine-tuning job. You pass in the model and the file.id of the file you want to do the training on. You can optionally pass in hyperparameters such as the number of epochs we discussed earlier, and the suffix we talked about. This method returns a FineTuningJob object, which looks like this:

FineTuningJob(
    id="ftjob-1OATxnQAgdY4yjPNmSBai95f",
    created_at=1708318423,
    error=Error(code=None, message=None, param=None, error=None),
    fine_tuned_model=None,
    finished_at=None,
    hyperparameters=Hyperparameters(
        n_epochs=3, batch_size="auto", learning_rate_multiplier="auto"
    ),
    model="gpt-3.5-turbo-1106",
    object="fine_tuning.job",
    organization_id="org-oMYMXpp7Cr9pG1rG5Z8a1T2w",
    result_files=[],
    status="validating_files",
    trained_tokens=None,
    training_file="file-EX13iLyISBZcreRCH3Fm51Pn",
    validation_file=None,
)

We can see that the FineTuningJob object also has an ID that we can use to refer to it, and some basic info, such as the fact that this one has not finished yet. We can see the hyperparameters, including the batch_size and learning_rate_multiplier which are set to auto by default.

The batch_size is the number of examples in each training batch. Batch size is the amount of data the model looks at before it learns something new. It’s like reading a few pages of a book, and then stopping to think about what you’ve read before continuing. At the end of the batch, the predictions are compared to the desired output, and the error is calculated and used to update the model. We’ll just leave this set to auto.

The learning_rate_multiplier is a value that multiplies the learning rate of the model. The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. If this is set to high the model may very easily overfit to the specific data it has been trained on. We’ll also leave this set to auto.

The list method, again, simply returns a list of all the FineTuningJob objects for your account. The retrieve method will return a single FineTuningJob object by its id and the cancel method will cancel a fine-tuning job by its id. These methods are extremely simple and there really is nothing more to them.

Again, don’t run this fine yet. The last method we have is for deleting a completed fine-tuned model:

# Fine-tuned-model related methods
CLIENT.models.delete("model_id_here")

For this, you need to have a fully trained model to delete obviously, and then just pass in the ID. Before we actually go ahead and run this file, I’ll comment out several of the methods to leave them in for our reference, as we naturally don’t want to delete the file we just uploaded nor cancel the fine-tuning-job:

from constants import CLIENT, DATA_DIRECTORY


JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"
MODEL = "gpt-3.5-turbo-1106"
SUFFIX = "chris_gpt"


# File related methods
file = CLIENT.files.create(file=open(JSONL_FILE, "rb"), purpose="fine-tune")

print(CLIENT.files.list(purpose="fine-tune"))

# CLIENT.files.delete(file.id)


# Fine-tuning-job related methods
fine_tuning_job = CLIENT.fine_tuning.jobs.create(
    model=MODEL,
    training_file=file.id,
    hyperparameters={"n_epochs": 3},
    suffix=SUFFIX,
)

# CLIENT.fine_tuning.jobs.list()

print(CLIENT.fine_tuning.jobs.retrieve(fine_tuning_job.id))

# CLIENT.fine_tuning.jobs.cancel(fine_tuning_job.id)


# Fine-tuned-model related methods
# CLIENT.models.delete("model_id_here")

I’ve gone ahead and added print statements around the files.list and the fine_tuning.jobs.retrieve calls so we can see the output in our terminal. Now go ahead and run this file to start your fine-tuning job!

You should see the FineTuningJob object printed to the console. From here on we’ll be switching to the web interface for the fine-tuning API, as it’s much easier to work with and has real-time progress.

Using the web interface for the fine-tuning API

As developers we sometimes think we need to do everything programmatically, but there really is no need if we have an easy web interface that is much simpler to use. There really is no point in coding up some complex programmatic solution for something you’re only going to be doing once in a while.

If you have a look at https://platform.openai.com/files, you will see the web interface for managing the files you uploaded to OpenAI:

File API:

This interface is much nicer to read than the long file object list in your terminal window, and you can also upload new files here directly.

Now let’s switch to the fine-tuning tab at https://platform.openai.com/finetune, to see the fine-tuning job that we have started:

Fine-tuning jobs API:

As you can see I’ve done quite some testing so I have a whole load of models here, but you should see your model either “Validating files…” or maybe already in the training stage. You can also create new fine-tuning jobs on this page or cancel a running fine-tuning job using the cancel button. We can also see a lot of details on our fine-tuning job as it progresses:

Fine-tuning details:

Please ignore all the “Cancelled” fine-tunes in my list, I tend to double-check the code when writing these things so I triggered a lot of fine-tunes that were not needed and therefore canceled them straight after. The important thing here is first of all, when the model is done training, you will see the full model name for your fine-tune. In this case, I have ft:gpt-3.5-turbo-1106:personal:chris-gpt-full:8ot8ZLJR as my model name here, but yours is probably not done training yet.

It can take anywhere from a couple of minutes to potentially even hours for OpenAI to train your model, depending on the size of the data and how busy the servers are. Mine typically finished within 10 to 30 minutes, and OpenAI will send you an email when the model training is done. In the meantime, you can see the progress update in real-time in the right-side panel here, where you can see the training loss go down in real-time. You can also see the messages at the bottom keeping you posted on the progress.

So go ahead, it’s time again for your well-deserved coffee break☕ with your favorite snack🥨. Have a small break and give your model time to train, and I’ll see you back here when it’s done!

Running our own fine-tuned ChrisGPT model!

Ok, so I assume you’re back and your model is done training, which means you now have the name of your personal fine-tuned model from the https://platform.openai.com/finetune fine-tuning page. The name will also be in the email you receive when the training is done. First, go and open the constants.py file in your root directory, as we want to make some updates to the file:

from pathlib import Path

from decouple import config
from openai import OpenAI


CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
DATA_DIRECTORY = Path(__file__).parent / "data"
## Add the below variables
OUTPUT_DIRECTORY = Path(__file__).parent / "output"
CHRIS_GPT = "ft:gpt-3.5-turbo-1106:personal:chris-gpt-full:8ot8ZLJR"

We added an output directory for our generated files, and I stored the name for my ChrisGPT model in the CHRIS_GPT variable. Make sure you replace the CHRIS_GPT string with the name of your own model and do not copy mine, as you will not be able to access my personal model. Make sure you save and close this file.

Now create the empty folder for the output:

📁Finx_Fine_Tuning
    📁data
        ...
    📁output ✨
        (empty)
    📁utils
        ...
    📄.env
    📄chris_gpt_dataset_generator.py
    📄chris_gpt_dataset_validation.py
    📄chris_gpt_training.py
    📄constants.py
    📄Pipfile
    📄Pipfile.lock

And then create a new file in your root directory called chris_gpt.py:

📁Finx_Fine_Tuning
    📁data
        ...
    📁output
        (empty)
    📁utils
        ...
    📄.env
    📄chris_gpt_dataset_generator.py
    📄chris_gpt_dataset_validation.py
    📄chris_gpt_training.py
    📄chris_gpt.py ✨
    📄constants.py
    📄Pipfile
    📄Pipfile.lock

Now let’s start with our imports for chris_gpt.py:

from constants import CLIENT, OUTPUT_DIRECTORY, CHRIS_GPT
import time

We import the CLIENT, OUTPUT_DIRECTORY, and CHRIS_GPT model-name from the constants.py file, and we also import the time module so we can easily give our output files unique names.

Now I’m going to simply declare a string variable that will contain the contents I want our email to have:

leaf_blower = """
Introduction to the AI-powered leaf blower and its innovative features in the realm of yard maintenance equipment.
Description of how the AI technology enhances the efficiency and performance of the leaf blower compared to traditional models.
Overview of the specific AI algorithms and sensors integrated into the leaf blower for optimized leaf collection and debris management.
Real-world application scenarios demonstrating the effectiveness of the AI-powered leaf blower in various landscaping and gardening tasks.
Discussion on the environmental benefits of using the AI-powered leaf blower, such as reduced noise pollution and energy consumption.
Insights into the future development and potential advancements in AI-powered yard maintenance equipment, including further automation and integration with smart home systems.
"""

Either copy this from the written version of the tutorial or come up with your own topics that you want Chris to talk about. You can also ask ChatGPT to generate bullet points on a topic for you if want.

Now we’ll just define a simple chris_gpt function that will take a string of topics and then call our custom model to ask Robo🤖Chris🧑🏼 to write an email about them:

def chris_gpt(topics: str) -> str:
    response = CLIENT.chat.completions.create(
        model=CHRIS_GPT,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris' writing style. You will respond in Simple text format. Don't insert any newline characters and such but use an actual newline. Make sure that the subject makes sense in regards to the content of the email. Keep the email CONCISE AND TO THE POINT, and STAY ON TOPIC. Do not repeat yourself. Don't forget to add Chris' signature emoticons. Also don't make up nonsense terms that do not exist, and make sure you ALWAYS USE CORRECT SPELLING! The user will inform you about the topics of the email:",
            },
            {"role": "user", "content": topics},
        ],
    )

    return (
        response.choices[0].message.content
        or "There was an error with the response. Please try again."
    )

Note that we edited the prompt a bit from the training data to emphasize conciseness and staying on-topic. Turns out Chris-GPT really goes wild if left unchecked, really talking about anything and everything. It’s very funny actually. 😂 I’ve also decided to ask it for simple text output for now. While we have baked much of the behavior into the model, we can still steer it.

Now let’s finish off the code. Add the following below and outside the chris_gpt function:

current_unix_time = int(time.time())

filename = f"chris_gpt_output_{current_unix_time}.txt"

with open(OUTPUT_DIRECTORY / filename, "w", encoding="utf-8") as file:
    file.write(chris_gpt(leaf_blower))

First, we get the Unix time in seconds, which is a unique number that will be different every second and simply refers to the number of seconds that have passed since 1970. This makes sure that files don’t overwrite each other as long as we don’t generate multiple files in the same second.

We then use this to create a filename for our output file. We open the file in write mode and write the output of our chris_gpt function to the file, using the utf-8 encoding to make sure we don’t blow up the code when emoticons are used.

Lets put 🤖Robo🧑🏼Chris to work!

Go ahead and run the file, and your output will appear in the output folder. You can open the file and see the email Chris wrote for you. Here is a random example of what I got:

Hey {User}! This is one of the most unique products I've seen in a long time.

🍂AI Leaf Blower: A Must-Have Garden Tech for 2022?🍁

I found it on Twitter, 500,000 views in the last 24 hours! It's the next level in leaf blowers. The AI can identify holes, sticks, mud, and leaves.

Perception and decision-making are impressive: It assesses the weight and size of sticks, identifying problematic areas not only by visual information but also friction.

For example, if you collide with something hard, it'll learn from this feedback to avoid those spots in the future.

It also listens to the sound it makes on areas with a lot of leaves compared to an area with just a few leaves in order to gain a sensory perception of the leaf bed to optimize collection and airflow.

Overall a great machine for both pickup and distribution!

It's easy to see more products like this coming out and we're only at the beginning of AI integration in everyday life, first visually (showing the gardens), then providing communication and sensor data based on subscriber inputs.

A systems engineer with 5-10 years of experience could put this together quite easily so you won't see prohibitive costs anymore. This is a massive trend!

Check out the one-minute Twitter AI-Blower-Cheetah in action: 👇
▶ AI-Powered Leaf Blower

With cool new tech disrupting every industry, let's be on the right side of change! 🚀😇
Chris 🌿

PS: This is the future! Any dirt you throw at it with a shovel, it'll pick up in seconds! 🍂🍁✨⚡ Scroll up to see the video demonstration if you missed it.

That is fascinating, right!? My mind is blown. If you compare this with the leaf_blower prompt we fed it, you can see that the original prompt is totally generic, yet this email reads convincingly like a Finxter email written by Chris! It’s sort of scary almost, but also very cool. 🤯

While this is not perfect, and you definitely cannot send these emails without some editing and checking, this really is a talking ChrisGPT now. It is admittedly a bit wild, but that’s because of our import data, which was a bit messed up because of the formatting and shorter much more attention-grabbing style of speech one uses in emails. We deliberately created a type of idea-generator here intended to be used by the creator themself.

While this is perhaps not a useful commercial application in itself, it clearly demonstrates the power and potential of fine-tuning. It’s also a lot of fun to play around with!

I think you can easily see that this type of fine-tuning will absorb your business’s communication tone, as long as you feed it somewhat more boring and professional clean data as opposed to short promotional email data which we didn’t clean up perfectly.

Let’s have some fun 🤖🧑🏼

I’ll show you a couple of my other test inputs/outputs:

super_robot = """
new AI-powered robot:
All-in-one household chore solution
Capable of cleaning, laundry, dishwashing, cooking, and car washing
Utilizes advanced AI and robotics technology
Customizable chore scheduling options
Maximizes efficiency and productivity
Seamlessly integrates with smart home systems
"""

Hey {User},

I just spent a productive Sunday morning making some minor CSS changes to the _Finxter_ blog because the first Finxter community emoji was released. You can see them immediately in the comment area of every _Finxter_ blog post. Let's continue with some Finxter industry updates.

The coming years will be exciting for roboticists and us consumers. Many robotic companies are headed for an IPO. If Tesla and Bitcoin 2.0️⃣ (i.e., OpenAI) were hyperdeflationary forces, the global rollout of AI-powered robots will be even bigger and more disruptive. 🤯

🤖🎨💦 First, I've ordered a house-cleaning robot from Altan. I'll receive it next month in February.

🚨 _Disclosure: As an early Finxter of only [5] Bitcoin_ 🚨

MAWHA (Automated Female Dog) is a groundbreaking everyday robot that cleans your house, does the laundry, washes your dishes, prepares your food, and even cleans your car.

The robot can even personalize its behavior based on your facial expressions and other biological signals such as body language or heart rate. No other robot does this. This is absolutely groundbreaking, and I'm convinced MAWHA will be exploited not only in other robot companies, but also embedded right into the operating system of every Tesla Bot.

Check it out: www.youtube.com/watch?v=O6fGGdUnyjo

I'll keep you posted on this. 🤖💪

Let me know if your personal bot shall get in touch with my personal bot. Okay, just kidding. But mindblowing developments are right around the corner.

Happy coding!

Chris ♥️
🧑‍💻👨‍💻💦

Now, there is some nonsense in here, but it’s still pretty good! I have not edited these at all, they are 100% direct ChrisGPT output. Of course, it’s based on fictional products as we are asking it to write about fictional products here.

I’ll ask ChrisGPT to write about something factual:

sharks = """
Briefly introduce the topic of great white sharks and why you're discussing them.
Describe the appearance and size of great white sharks, highlighting their distinctive features.
Discuss where great white sharks are commonly found and their preferred habitats, such as coastal waters and oceanic regions.
Diet and Feeding Behavior: Explain what great white sharks eat and how they hunt, including their role as apex predators in marine ecosystems.
Provide information about the conservation status of great white sharks, including any threats they face and conservation efforts to protect them.
Discuss human interactions with great white sharks, including encounters in the wild, conservation initiatives, and safety measures for beachgoers and divers.
"""

Hey {User},

Let's talk about great white sharks today. Nobody knows sharks, which create an ocean environment that is anything but monolithic. Maddening creatures indeed, bringing the oceans alive in a thoughtful way.

Anyway, enjoy the following educational snippets! Feel free to send this quick tutorial to other ocean nerds out there:

🦈 **Great White Sharks** 101

Appearance: The great white shark is one of the most iconic species. Named for its gray coloration and white underside, it has a torpedo-shaped body and impressive size, with an average length of 13-16 feet; 20 feet is common, and some sharks grow even larger.

Distribution: They are found mostly in cool coastal waters of all major oceans, often close to shore, but also travel hundreds of miles into open oceanic regions. Juveniles tend to stay closer to shore, while the adults roam through deeper waters.

Diet and Feeding Behavior: Being top predators for marine ecosystems, great white sharks hunt marine mammals such as seals, sea lions, and small toothed whales. Fast and powerful, they can launch themselves out of the water to catch prey, often attacking from below.

Conservation Status: As an apex species, the health of great white shark populations provides an indication of the health of the overall marine ecosystem. Pressure from overfishing, bycatch, pollution, and loss of prey have taken their toll, and great whites are now considered vulnerable.

Human Interactions: Great white sharks are frequently studied, with many researchers interested in conserving and understanding these animals better. More and more programs have been established to protect the sharks; providing a balanced view on whom we should really fear.

While the chance that you run into a great white is luckily low, I roughly calculated your lifetime risk of dying from any shark attack to be in the same ball park as your lifetime win from a shark lottery of 1 in 1,000. 🤣

We'll dive into the 2023 edition of Shark Diving technology once available.

Enjoy your breaks today, and keep hitting the books!

Best, Chris. 🦊♥️🔗

Now you can see it does a pretty good job writing about real and existing topics as well. As you can see it needs some minor edits, but the above is 100% unedited raw output. While talking about a real topic it still uses the fun emoticons and that engaging Chris-specific style of speech that makes it so much more engaging to read. It’s those things like “Feel free to send this quick tutorial to other ocean nerds out there:” that really make me chuckle and make it feel like Chris is really writing this.

Be careful what you train the model to do

Notice that ChrisGPT is sometimes a bit wild and makes stuff up. I left it like that on purpose. The reason it is like this proves an important point here. We trained ChatGPT on data that had only a few bullet points in the user’s request but way more information in the response.

What this means is we trained ChrisGPT over and over, specifically, to come up with stuff that was not present in the second item which represented the user query, because all the examples we fed it showed ChatGPT responses that had more information than the user request had.

This taught ChrisGPT to include more information that was not present in the original user request because that’s what all the examples were doing. Now, in our case, that’s what we wanted, so that is absolutely fine. But when designing your training data be very careful and aware of exactly what you are teaching the model to do. Your input will equal your output here.

When not to use fine-tuning

Before we end this tutorial series, let’s discuss when to use, and when not to use fine-tuning. You should not use fine-tuning for any tasks where you can get good results using prompt engineering and giving examples in the prompt setup. Most problems can be solved using simple prompt engineering and do not require the extra complexity and time investment of fine-tuning. Do not assume you need fine-tuning just because it sounds fancy.

You will see loads of video tutorials on YouTube that ‘fine-tune’ chat GPT to be a sarcastic Reddit commenter etc.. This is completely pointless!! You can do this with simple prompt engineering and perhaps a couple of examples in the prompt setup. This is missing the point of fine-tuning altogether, and the reason is that there isn’t that much that ChatGPT cannot do yet. Training it to do something it can already do is a waste of time, energy, and money.

Single-correct-answer type tasks

Fine-tuning can be used for very specific and focused tasks. Say you have a very narrow task that you want ChatGPT to do over and over, like extracting very complex CSV or JSON objects from unstructured data, with a lot of edge cases and exceptions, for all of which you’ll need to provide many examples.

You may find that GPT-4 is powerful enough to just do this, but ChatGPT 3.5-turbo is not quite able to do the same task reliably. Yet, you may still want to use ChatGPT 3.5-turbo for that specific task. Why? Because ChatGPT 3.5 is smaller and therefore much faster and cheaper. If you fine-tune GPT 3.5-turbo on that specific task, then it will generally be able to reach GPT-4 level quality on that specific task. This is a good use of fine-tuning.

This seems to actually reflect a general trend in the LLM space as well, where first the models just got bigger and bigger, but now more and more models are coming out that are specialized for certain specific uses, like writing code or handling math problems, etc, as opposed to having the one model to rule them all. This is a good thing, as it allows for more efficient use of resources and more specialized and accurate results for specific tasks.

Another benefit that you get here is OpenAI will let you submit a testing and a training portion of the data, and after training on the 70% training data, OpenAI will let you know how accurate the model is by testing on the 30% testing data that has not been shown to the model during training so it makes the perfect test.

This is also useful for increasing accuracy on complex function calls and such, as these are also a JSON object output type task, though the models have improved so much with each iteration lately that you probably will not need to fine-tune for this yourself anymore as OpenAI has sort of done this for us already.

Customer service chatbots

Fine-tuning is not a data retrieval system. You might think that you can simply train ChatGPT on your data and it will answer all questions about this data correctly from there on in. Like you feed all your business data and questions and answers from your customer service history and now your fine-tuned model will know everything about your business right? Well, the answer is yes, and no.

I tested this out to make sure my statement here is correct and fine-tuned a model on training data containing over 80 frequently asked questions and their answers from the customer service of a fictional product. Though the fine-tuned model was able to answer most questions correctly, it did give some wrong answers conflicting with the data it was trained on. You cannot use this as a foolproof data retrieval system for your customer service, as being correct 90% of the time there is not good enough.

That being said fine-tuning is being used for customer-service bots, but they should be used in combination with a data retrieval strategy that provides the correct answer to ChatGPT before answering the user, in which case they act as an extra reinforcement where the effect of both is added together. The fine-tuning takes care of the company-specific style of communication, and trains the actual data into the model to some degree, while the retrieval system takes care of the actual data retrieval, feeding the model with the exact and correct information to generate the response for the end user.

The retrieval part is usually achieved by storing the entire dataset cut into pieces in embeddings in an embedding database. You would then retrieve ONLY the pieces of text from your dataset that are similar to the user query to ChatGPT, giving it the perfect source material it needs to give a factually accurate answer. ChatGPT will then generate a response using the retrieved pieces of text it was handed as the source material. If you want to know more about embeddings check out chapters 6 and 7 of my “function calls and embeddings” related course here on the Finxter Academy

Thank you for participating!

That’s it for the fine-tuning course, I hope you enjoyed it and it has given you a good idea of when you can use fine-tuning in your own projects. As always, it’s been a pleasure and an honor to take this journey together with you!

Dirk.