Speech Recognition and Generation Archives - Be on the Right Side of Change

OpenAI Whisper – Speeding Up or Outsourcing the Processing

Dirk van Meerveld — Thu, 25 Jan 2024 19:57:21 +0000

Course: This article is based on a lesson from our Finxter Academy Course Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it!

Hi and welcome back! In this part, we’re going to look at some alternatives to speed stuff up or outsource the processing power to OpenAI’s servers altogether. First, we’ll look at faster-whisper at a basic level. If you’re not sure whether you want to use this you can also just watch this part and decide whether or not you want to install it for further use later as we’re just going to cover it quickly before moving on to the web API version for the rest of this part.

So what is faster-whisper? Faster-Whisper is a quicker version of OpenAI’s Whisper speech-to-text model. As OpenAI released the whisper model as open-source this has naturally allowed others to try to build on and optimize it further. It uses CTranslate2, a fast engine for Transformer models, and is up to 4 times faster and uses considerably less memory than the original openai/whisper while claiming to maintain the same accuracy. You can find the GitHub repository here.

You can use this for the same apps we have built so far, just as a faster version of the Whisper model, so we won’t be building a new app specifically for this, as it would get repetitive and I don’t want to waste your time! You just need some syntax changes to make your app work with faster-whisper instead of the original whisper model. So we’ll take a look at the basics of fast-whisper, let you decide if you want to use/implement it, and then move on to the web-API version.

Installing faster-whisper

Note: If you do not plan on using faster-whisper or are not quite sure, there is no point in going through the install procedures, and you can skip ahead a couple of minutes to the web-API version, or just watch/read along and decide later if you want to use it.

Basically, to install faster-whisper you just have to run the following command in your terminal:

pip install faster-whisper

And to support GPU execution you need to have the appropriate libraries for CUDA installed, which are cuBLAS and cuDNN. This can be the slightly trickier part of the install, and again I cannot really give you platform-specific instructions or help you with the specific troubleshooting if you run into challenges. As always in software development, if you’re lucky you won’t have any problems, and if you’re not, you spend some time on Google and Stackoverflow to find the solution. If you just want to run faster-whisper on your CPU, which will of course be slower but may not be a big deal for small-scale development on your own machine, you can skip the cuBLAS and cuDNN installs.

Using faster-whisper

So let’s give it a spin to see how it works! First create a new file in your project root directory called 4_faster_whisper.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    4_faster_whisper.py   (new file)
    settings.py
    .env

And inside let’s start with our imports:

from faster_whisper import WhisperModel
from settings import TEST_AUDIO_DIR

model_size = "small"

We import the WhisperModel class from the faster_whisper package, and the TEST_AUDIO_DIR variable from our settings.py file, and then set a string variable to the value small. Like whisper, faster-whisper also comes with different sizes of models. Using the same naming convention we have tiny.en, base.en, small.en, and medium.en as our English-only models. For the multi-language models, we can choose between tiny, base, small, medium, or one of several versions of the full-size model, namely: large-v1, large-v2, large-v3, or large.

Next, we’ll create a new instance of the WhisperModel class, picking only one of the two options below:

model = WhisperModel(model_size, device="cpu", compute_type="int8")
# Choose only one of these, depending on if you're running on CPU or GPU (cuda). (I'll be using the second option)
model = WhisperModel(model_size, device="cuda", compute_type="float16")

More options are available, like running on cuda using int8_float16 or even using float32, see here for more details.

The .transcribe method for faster-whisper is slightly different:

segments, info = model.transcribe(
    str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    beam_size=5,
)

As you can see we get two returns when calling model.transcribe instead of the single dictionary output we had before. The first is a list of segments which contains the transcription. The second is a NamedTuple (a Tuple with named fields) which allows us to access information like the language (info.language), language probability (info.language_probability), etc. So let’s add some print statements to print the information and then the transcription itself to the console:

print(f"Detected language '{info.language}' with probability {info.language_probability}")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

The first print statement just has us access some of the properties of the info object we discussed. The second print statement loops over the list of segments, and for each segment it will print the segment’s start time, end time, and the text of the segment itself. The :.2f is a formatting string that tells Python to print the number with two decimal places, for example: 1.23 instead of 1.23456789.

One interesting thing to note here though is that segments is not actually a list. Segments is a generator, which is a different type of iterable. What this means is that the segments will be generated when you request them and not beforehand. In other words, the transcription only begins when we iterate over the segments and not before. Calling .transcribe() on our model did not start the transcription as vanilla whisper did. You can either loop over the segments as we did above, or you can convert the generator to a list by converting it to a list list(segments).

One of the nice things about this generator is that we can very easily see the live transcription and print it to the console while it is still generating, which is exactly what this code will do. So let’s run it and see what happens:

Estimating duration from bitrate, this may be inaccurate
Detected language 'nl' with probability 0.931703
[0.00s -> 3.04s]  Hoi allemaal, dit is weer een testbestandje.
[3.04s -> 6.88s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[6.88s -> 12.68s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[12.68s -> 13.88s]  Ik ben benieuwd.
[13.88s -> 16.84s]  Hoi allemaal, dit is weer een testbestandje.
[16.84s -> 20.72s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[20.72s -> 26.48s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[26.48s -> 27.68s]  Ik ben benieuwd.
[27.68s -> 30.72s]  Hoi allemaal, dit is weer een testbestandje.
[30.72s -> 34.60s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[34.60s -> 40.36s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[40.36s -> 41.52s]  Ik ben benieuwd.

You can see the output streaming to the console as the model transcribes. Unless you run over CPU you will also notice a pretty good speed. Now as you’re probably not Dutch I’ll just tell you the transcription above is perfect except for the one small (herkent/herkend) issue we had before, but as you know this can be fixed by loading a larger model size.

Play around with any audio file you want and see what model size you need. If you use English files pick a .en model for greater efficiency. Also be aware that you can pass in options into the .transcribe method much like the vanilla whisper model, for instance:

segments, info = model.transcribe(
    str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    beam_size=5,
    word_timestamps=True,  # uncomment this line to get word timestamps
    # without_timestamps=True,  # uncomment this line to get rid of timestamps and just transcribe
)

In conclusion, faster-whisper is a nice optimization to look into if you’re considering deploying this model in a production application somewhere. There are also other optimized versions of the whisper model out there that you can check out, like distil-whisper. Play around and see which gives you the best trade-offs between speed and accuracy. I’ll leave the rest up to you as we move on from faster-whisper to check out the web-API version.

Web-API version

Another option we have is to simply not deploy the model anywhere but outsource this to OpenAI’s fast servers. This is kind of like making a ChatGPT call except we request a transcription instead of a chat completion. The OpenAI servers are also very optimized for machine-learning calculations (obviously) and as you’ll see they are therefore quite fast!

So let’s take a look at the pricing first. The cost for using the Whisper API is $0.006 per minute transcribed, rounded to the nearest second. This means a 20-minute video would cost you $0.12. This is a good solution if you don’t want to deploy the model yourself, perhaps your application will only be used occasionally and it’s simply not worth it to invest that much into having a model running somewhere. For a high-use application dealing with longer files and many users, this is not the way to go though.

So let’s take a quick look at how this would work practically, by building one last quick application, but this time using the web API. Our application will take any video in any language as input and will return a short quiz with questions about the video. First, create a new file in your utils folder named openai_api.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        command.py
        openai_api.py   (new file)
        podcast.py
        subtitles.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    4_faster_whisper.py
    settings.py
    .env

Inside openai_api.py, let’s start with our imports and some basic setup:

import typing
from pathlib import Path

from decouple import config
from openai import OpenAI


CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
MODEL = "whisper-1"

ResponseFormat = typing.Literal["text", "srt", "vtt"]

We’ll use typing to define our allowed response formats. The rest is all imports we have used before, config as we’ll need to load our API key and OpenAI to call the APIs for Whisper and ChatGPT. We create our CLIENT just like last time and we save the MODEL in a string variable, whisper-1 is the only option for the Whisper API for now.

Finally, we define a type alias named ResponseFormat which is a Literal type, which means it can only be one of the three strings we have defined, text, srt, or vtt. We can use this as a type hint later to indicate that if a particular variable is of type ResponseFormat then it should have one of these three values and nothing else. (json and verbose_json are also possible if you prefer JSON object output, but we will be skipping them as they are useless for our purposes.)

Now we’ll define our transcription utility function:

def transcribe(
    file: Path,
    language: str | None = None,
    translate: bool = False,
    response_format: ResponseFormat = "text",
) -> str:

    print("Transcribing file...")
    options = {
        "file": file,
        "model": MODEL,
        "response_format": response_format,
    }

    if translate:
        transcript = CLIENT.audio.translations.create(**options)
    else:
        if language:
            options["language"] = language
        transcript = CLIENT.audio.transcriptions.create(**options)

    if type(transcript) != str:
        raise TypeError(
            f"Expected a string value to be returned, but got {type(transcript)} instead."
        )
    print(f"Transcription successful:\n{transcript[:100]}...")

    return transcript

We define a function called transcribe which takes a file of type Path, a language of type str or None, which defaults to None, in which case the API will try to detect the language automatically. We also have a translate boolean which defaults to False, and a response_format which has to be of type ResponseFormat, so one of the three values we defined in the type alias, and defaults to text. The function returns a string.

We print a message to indicate the transcription is starting and then create a dictionary named options in which we pass in some options that are needed for both a translation and a transcript call, so the shared options if you will. These are the file, model, and response_format. If the user requests a translation we call the CLIENT.audio.translations.create method, passing in the **options dictionary as arguments as is. If translation = False it must be a transcription. For transcriptions, we can add the language key to the options dictionary to specify the language, but if the user didn’t provide it we can leave it out and it will just take a bit longer to do the auto-detection. This time we call the CLIENT.audio.transcriptions.create method, again passing in the **options dictionary which optionally now contains the language key.

Finally, we check if the transcript is a string, and if not we raise a TypeError to indicate something went wrong, just to make sure the user is not requesting JSON from this endpoint, which is possible and would crash the rest of our code. Otherwise, we print a message to indicate the transcription was successful and return the transcript.

Video to Quiz

As we’re going to be building a video-to-quiz app, we need one more utility function inside this openai_api.py file, which will take a transcript and generate some questions for us. Continue below the transcribe function:

PROMPT_SETUP = """You are a text-to-quiz app. The user will provide you a video transcription in textual format. You will generate a list of questions for the user to answer about this video. Depending on the length of the transcription, stick to a maximum of 5 questions. All questions should be solely about the video transcription content provided by the user and should be answerable by reading the transcription. Do not provide the answers, but only the questions. The transcription the user provides is based on a video, and may include timestamps, please ignore these timestamps and just treat it as one single transcription containing all the content in the video.
List and number each item on a separate line.
"""

from tenacity import retry, stop_after_attempt, stop_after_delay

First, we define a constant to hold the prompt setup instruction for ChatGPT. Just go ahead and copy mine. It’s a fairly basic setup that asks for questions related to the video so we can make a quiz tailor-made for the input video. We also import retry, stop_after_attempt, and stop_after_delay from the tenacity package. (Go ahead and move the tenacity imports line to the top of your file with the other imports instead of here in the middle.) We can use these to make our code a bit more robust when calling APIs or taking actions that do not have a 100% success rate. It’s fairly easy to use and I just want to show you that this tool is out there, you’ll see how it works in a second.

Let’s code up the function:

def text_to_quiz(text: str) -> str:
    print("Converting text to quiz...")
    messages = [
        {"role": "system", "content": PROMPT_SETUP},
        {"role": "user", "content": text},
    ]
    result = CLIENT.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
    )
    content = result.choices[0].message.content
    if content == None:  # Just a quick sanity check
        raise ValueError("There was an error while trying to generate the quiz.")
    print(f"Text to quiz conversion completed.")
    return content

Our function takes a string which is the transcription and returns a string as output. We create a list of messages with the first being the system message, holding our PROMPT_SETUP, and the second being the user message which has the transcription as its content. We then call the CLIENT.chat.completions.create method, passing in the model and messages as arguments. We’ll use gpt-3.5-turbo-1106 which is the newest gpt-3.5 model out there and is frankly good enough. You can use gpt-4 but make sure you consider the cost, it is considerably more expensive and not really needed for this use case. If you’re worried about the lower maximum input size, or ‘context window’ of gpt-3.5, know that it has a 16k context limit that can easily handle long video transcriptions, though most are not really as long as you might think they are.

We then access the content of the first choice’s message in the result object, which should hold our quiz. We do a quick sanity check to make sure we received a valid response, and then print a message to indicate the conversion was successful and return the content.

So that’s pretty simple, right? But what if we get no content back? Do we really want to just raise an error and give up immediately? Let’s use the tenacity library so we can try again in case of a failure. The only single thing we have to change is to add the @retry decorator before our function, the only thing that changes is the first line:

@retry(stop=stop_after_attempt(3) | stop_after_delay(60))
def text_to_quiz(text: str) -> str:
    print("Converting text to quiz...")
    messages = [
        {"role": "system", "content": PROMPT_SETUP},
        {"role": "user", "content": text},
    ]
    result = CLIENT.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
    )
    content = result.choices[0].message.content
    if content == None:  # Just a quick sanity check
        raise ValueError("There was an error while trying to generate the quiz.")
    print(f"Text to quiz conversion completed.")
    return content

And just like that, our function is set up to try up to three times or (|) for a max of 60 seconds, just in case the API call fails for some reason. Notice how easy it is to use the Tenacity library. This is not required but it’s a nice way to make your code more robust just in case.

Putting it all together

That’s our openai_api.py file done! Go ahead and save and close it. Now let’s create a new file in our project root directory called 4_vid_to_quiz.py to put it all together:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        command.py
        openai_api.py
        podcast.py
        subtitles.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    4_faster_whisper.py
    4_vid_to_quiz.py   (new file)
    settings.py
    .env

Inside 4_vid_to_quiz.py let’s start with our imports:

import os
import uuid
from pathlib import Path

import gradio as gr

from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR
from utils import openai_api, video


API_UPLOAD_LIMIT_BYTES = 26214400  # 25mb

We will use os to check the size of the file we will upload, as there is a size limit to the API. We have some imports you’ve seen before, and some of our directories from the settings file plus our openai_api and video utilities. We also define a constant API_UPLOAD_LIMIT_BYTES which is the maximum size of the file we can upload to the API, which is 25 MB.

Let’s start with a quick function to check if the file is not too big:

def check_upload_size(input_file: str) -> None:
    """Check the video file size is within the API upload limit."""
    input_file_size = os.path.getsize(input_file)
    if input_file_size > API_UPLOAD_LIMIT_BYTES:
        raise ValueError(
            f"File size of {input_file_size} bytes ({input_file_size / 1024 / 1024:.2f} MB) exceeds the API upload limit of {API_UPLOAD_LIMIT_BYTES} bytes ({API_UPLOAD_LIMIT_BYTES / 1024 / 1024:.2f} MB). Please use a shorter video or lower the audio quality settings."
        )

We take an input file path as a string and then use os.path.getsize to get the size of the file in bytes, and then check if it is larger than our API_UPLOAD_LIMIT_BYTES. If it is, we raise a ValueError to indicate the file is too large. We also print a message to indicate the file size and the API upload limit. That’s all there is to this function.

Let’s move on to our main function:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a quiz as string."""
    unique_id = uuid.uuid4()

    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=OUTPUT_TEMP_DIR / f"{unique_id}.mp3",
        mono=True,
    )

    check_upload_size(mp3_file)
    transcription = openai_api.transcribe(
        Path(mp3_file), language="en", translate=False, response_format="text"
    )

    quiz = openai_api.text_to_quiz(transcription)
    return quiz

This is the function the gradio button will call when clicked. It takes an input_video as string input and will return the quiz in string format. We don’t really care about the name of the mp3 file we’ll extract from the video here so we just use a uuid to make it unique. Now we use our video.to_mp3 utility function from the previous part to extract the audio from the video.

We pass in the input_video as the video file, our project root directory as the log_directory, and our output_path is the OUTPUT_TEMP_DIR with the uuid and .mp3 extension pasted on. Finally, this is the time to use the mono option we built into the to_mp3 function but didn’t use last time. So far the size of our files has not been that important, but now that we have a web API it suddenly becomes relevant.

Whisper down-mixes audio to mono before processing anyway, and the API has an upload limit of roughly 25MB per transcription request. So we can save a lot of space by dropping the channels to 1, from stereo to mono audio, which allows us to make much longer requests as we can drastically lower the bitrate with only 1 audio channel.

Sending stereo audio would exceed the file limit after about 20 minutes of audio at 192kbps quality. We more than halved the quality to 80kbps which is still considered decent quality for mono mp3 files and allows us to transcribe way longer files. You can also try playing with the other audio quality settings or lower the bitrate even further to 64kbps for mono if you want to go even further.

After that, we run our check_upload_size check to make sure the file is not too large, and then we call our openai_api.transcribe function, passing in the mp3_file as the file, language="en" as the language, translate=False as we don’t want to translate, and response_format="text" as we want the transcription in text format. We then call our openai_api.text_to_quiz function, passing in the transcription as the text and returning the resulting quiz.

Gradio Interface

Finally, we’ll create our gradio interface:

if __name__ == "__main__":
    block = gr.Blocks(
        css=str(STYLES_DIR / "vid2quiz.css"),
        theme=gr.themes.Soft(primary_hue=gr.themes.colors.yellow),
    )

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                
                
                
                """
            )
            with gr.Row():
                input_video = gr.Video(
                    label="Input Video", sources=["upload"], mirror_webcam=False
                )
                output_quiz_text = gr.Textbox(label="Quiz")
            with gr.Row():
                button_text = " Make a quiz about this video! "
                btn = gr.Button(value=button_text, elem_classes=["button-row"])

            btn.click(main, inputs=[input_video], outputs=[output_quiz_text])

    block.launch(debug=True)

All of this will be familiar by now, I just used a different CSS file we’ll have to create, and used a slightly different primary_hue for the team than last time. The ‘imgur’ image link has changed as well to give you a new header logo and below that, we just take an input video and have an output Textbox. Our button has a CSS class of button-row again so we can style it and clicking the button runs the function with the input video and the output going to the output textbox.

Let’s add the CSS file to our styles folder:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
        subtitle_master.css
        vid2quiz.css      (new file)
        whisper_pods.css
    test_audio_files
    utils
        command.py
        openai_api.py
        podcast.py
        subtitles.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    4_faster_whisper.py
    4_vid_to_quiz.py
    settings.py
    .env

And inside vid2quiz.css let’s add the following:

.header {
  display: flex;
  justify-content: center;
  align-items: center;
  padding: 2em 8em;
}

.header-img {
  max-width: 50%;
}

.header,
.button-row {
  background-color: #0c1d36;
}

We use flex to center the header image vertically and horizontally and apply the usual padding. We give the header-img class a max-width of 50% so it doesn’t take up the entire width of the screen. Finally, we give the header and button-row classes a background color of #0c1d36 which is a dark blue color.

Ok, you know the drill, let’s run it and see what happens!

Ok, looking good, so let’s upload a video and then request a quiz about it. I used a random video from YouTube, namely Hot Dr Pepper from the 1960s, just because it showed up when I opened the YouTube website. Let’s see how it does:

Perfect, exactly what we wanted, and this was all powered by the OpenAI API! You’ll also notice it was probably reasonably fast, considering it had to convert the whole video and then transcribe it and generate a quiz.

One important limitation of the app in this particular form is that it can handle videos up to about ~48 minutes in length (with the 80kbps mono settings), because of the upload limit. If you want to handle longer videos you could split them up and put the transcripts back together, but honestly, if you’re going to be handling files of that length you’re probably better off deploying the model yourself to save cost as it is calculated per minute of audio.

A fun idea is that you can also use the translation option in our utils.get_transcription function to have foreign language videos as input and then English questions about the foreign language video as output. This could be cool for a foreign language learning app or test.

So that’s it for the whisper course. I hope you enjoyed it and now have a good idea of how to use Whisper, what you can use it for, and the various deployment options. The next step is up to you and limited only by your imagination!

As always, it was an honor and a pleasure to take this journey together, and I hope to see you next time!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

Check out our full OpenAI Whisper course with video lessons, easy explanations, GitHub, and a downloadable PDF certificate to prove your speech processing skills to your employer and freelancing clients:

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper – Speeding Up or Outsourcing the Processing appeared first on Be on the Right Side of Change.

OpenAI Whisper Example – Building a Subtitle Generator & Embedder

Dirk van Meerveld — Thu, 25 Jan 2024 19:57:05 +0000

Welcome back to part 3, where we’ll use Whisper to build another really cool app. In this part, we’ll look at how to work with video files. After all, many of the practical applications of speech recognition don’t come in convenient MP3 files, but rather in video files. We’ll be building a subtitle generator and embedder, which will take a video file as input, transcribe it, and then embed the subtitles into the video file itself, feeding the result back to the end user.

Before we can get started on the main code, we will need to write some utilities again, just like in the previous part. The utilities we’ll need this time are:

Subtitles -> We just can reuse the subtitle-to-disk utility from the previous part. (Done)
Video -> We will need a way to convert a video file to an mp3 file so that we can feed it to Whisper.
Commands -> We will need a way to run commands on the command line, as there are multiple ffmpeg commands we’ll need to run both for the video conversion and the subtitle embedding.

So let’s get started with the command utility. Inside the utils folder, first create a new file named command.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
        command.py   (new file)
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    settings.py
    .env

Then inside the command.py file let’s start with our imports:

import datetime
import subprocess
from pathlib import Path

We’re going to run commands and provide some very basic logging as well. We imported the datetime module so we can add timestamps to our logs, and pathlib should be familiar by now. The subprocess module in Python is used to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. It allows you to execute system commands and interact with them programmatically. It’s basically a bit like opening a terminal window inside your Python code.

Next, we’ll start with an extremely simple function that will print a message but in blue letters:

def print_blue(message: str) -> None:
    print(f"\033[94m{message}\033[00m")

The \033[94m and \033[00m are ANSI escape codes, which are used to add color and formatting to text in terminal output. The 94 is the code for blue, and the 00 is the code for reset. You can find a list of all the codes here: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors. We will print the commands we execute to the terminal in blue, which helps them stand out from the other white text output and makes it easier for us to check our commands.

Running system commands

Next, we’ll create a function that will run a command like you would run on the command line:

def run_and_log(command: str, log_directory: Path) -> None:
    print_blue(f"Running command: \n{command}")
    with open(log_directory / "commands_log.txt", "a+", encoding="utf-8") as file:
        subprocess.call(
            command,
            stdout=file,
            stderr=file,
        )
        file.write(
            f"\nRan command: {command}\nDate/time: {datetime.datetime.now()}\n\n\n\n"
        )

We create a function called run_and_log, which takes two arguments: command which is a string, and log_directory which is a Path and indicates the directory where we want to save the log file. We then print the command we’re about to execute in blue, and then open the log file in append mode. The a+ means that we will append to the file if it exists, and create it if it doesn’t. Again, we use the encoding="utf-8" argument to make sure that we can write non-ASCII characters to the file as well. If you do not do this you will eventually run into trouble.

Inside the with open context manager, so while the file is open, we call the subprocess.call function. This function takes a command as input and executes it, so as the first argument we pass the command variable. The second argument is stdout=file, which means that we will write the output of the command to the file (instead of the console). The third argument is stderr=file, which means that we will write any errors to the file as well. So we basically execute the command and whatever output there is gets logged inside the text file.

After that, we write what command we executed and a timestamp to the file, and use a couple of \n to add some newlines to the file so that the next command will be lower down, making them easy to distinguish from each other.

Now let’s run a quick test, using the extremely simple terminal command echo 'hello', which will simply print hello to the console. Let’s run this command and see if our function works:

run_and_log("echo 'hello'", Path.cwd())

For the path we’ve used the Path.cwd() method in Python’s pathlib module which returns the current working directory as a Path object. This is the terminal’s current directory when you run the script. (This is just for a quick test, we don’t want to go through the trouble of importing the base directory in here)

Go ahead and run the command.py file, and whatever directory your terminal was in when you ran the script should now have a file named commands_log.txt with the following inside:

hello

Ran command: echo 'hello'
Date/time: 2024-01-14 12:13:49.535692

It worked! We’ve successfully logged the output of hello followed by our logging information of the time and command executed. Make sure you remove or comment out the run_and_log line before we continue, as we don’t want to run this command every time we run the script.

# run_and_log("echo 'hello'", Path.cwd())

A peculiar issue with slashes

With our run_and_log function completed, we have just one more function to create in here. There is a small discrepancy between the file paths where ffmpeg will expect a different format for the system commands than our Python code will give us. So we need to write a short utility to fix the path. This issue only occurs with the subtitle path when trying to embed the subtitles using ffmpeg system commands, and I’m honestly not sure why it occurs, but this is the type of thing you will run into during your software development journey.

If you keep looking you’ll always find a solution, never despair, but I’ll save you this time and tell you about the issue ahead of time!

The path C:\Users\dirk\test/subtitle.vtt will not work in the command and will give errors as it get’s messed up and then is unable to be parsed as a valid path.\
What we need is C\:\\Users\\dirk\\test\\subtitle.vtt instead. Notice there is an extra \ after the C and after every \ in the path. The first \ is an escape character, which means that the second \ is not interpreted as a special character but as a literal \.
This issue only affects the subtitle path and not the input or output video paths, so we only need to fix the subtitle path.

Below the run_and_log function inside the command.py file, add a new function:

def format_ffmpeg_filepath(path: Path) -> str:
    """Turns C:\Users\dirk\test/subtitle.vtt into C\:\\Users\\dirk\\test\\subtitle.vtt"""
    string_path = str(path)
    return string_path.replace("\\", "\\\\").replace("/", "\\\\").replace(":", "\\:")

We take a Path as input, and then first convert it to a string so we can use string methods on it to fix the format. We then use the replace method to replace all the \ with \\ and all the / with \\. We also replace the : with \:. Now I see you looking mighty confused! Why so many slashes? Well, remember the first \ is the escape character so that the second slash is interpreted not as an operator but as a literal slash string-character.

So in order to replace \ we need to target it using \\, as we need the escape character to indicate we want to target the \ string-character and not use it as an operator, so a single \ won’t work as it would be interpreted as the slash operator.
Likewise, to replace it with \\ we need to use \\\\ as each slash typed needs a slash to escape it, so that each second slash is interpreted as a literal slash string-character.
So the above function just means that \ is replaced by \\, / is replaced by \\, and : is replaced by \:. It just looks so confusing because of all the extra escape characters which also happen to be slashes! Phew.

Video utility functions

Okay so with that out of the way, go ahead and save and close the command.py file. It’s time for our video utility file next, so create a new file called video.py inside the utils folder:

    FINX_WHISPER (project root folder)
        output_temp_files
        output_video
        styles
        test_audio_files
        utils
            __init__.py
            podcast.py
            subtitles.py
            command.py
            video.py   (new file)
        1_basic_call_english_only.py
        1_multiple_languages.py
        2_whisper_pods.py
        settings.py
        .env

Don’t worry, this one won’t be so bad ! Open up your new video.py file and let’s start with our imports:

from pathlib import Path
from . import command

All we need is Path for input argument type-hinting and the command module we just created. Next, we’ll create a function that will convert a video file to an mp3 file so it can be fed to Whisper:

def to_mp3(
    input_video: str, log_directory: Path, output_path: Path, mono: bool = False
) -> str:
    output_path_string = str(output_path)

    channels = 1 if mono else 2
    bitrate = 80 if mono else 192

    command_to_run = f'ffmpeg -i "{input_video}" -vn -ar 44100 -ac {channels} -b:a {bitrate}k "{output_path_string}"'
    command.run_and_log(command_to_run, log_directory)
    print(f"Video converted to mp3 and saved to {output_path_string}")

    return output_path_string

We define a function named to_mp3 which takes an input_video as a string, a log_directory as a Path, an output_path as a Path, and a mono option as a boolean. The function returns a string in the end, which holds the output path. The input_video path is a string because gradio will feed it to us, which is why it is not a Path object like the log_directory and output_path. Make sure you always keep track of what type all the variables are or you will run into trouble eventually passing in a Path object where a string is expected, or vice versa.

First, we get a string version of the output_path and save it in output_path_string. Then we check if the mono option is set to True or False, and set the channels and bitrate variables accordingly. If mono is True we set channels to 1 and bitrate to 80, and if mono is False we set channels to 2 and bitrate to 192. We won’t actually need this mono option until part 4, but we might as well add it now.

Then we get to the command, first preparing it in a variable named command_to_run. We use the ffmpeg command and pass in the input_video as the input file (-i). We then use the -vn option to disable video recording, the -ar option to set the audio sampling frequency to 44100 Hz, the -ac option to set the number of audio channels to channels, and the -b:a option to set the audio bitrate to bitrate kbps. We then pass in the output_path_string as the output file location.

Notice that the command is contained inside an f-string which has single quotes on the outside (f'command'). Make sure you imitate this perfectly, using the single quotes on the outside and the double quotes around the variable names of "{input_video}" and "{output_path_string}". We need these double quotes because the user input video file is likely to have spaces in the name, and not having double quotes around a name with spaces inside will cause the command to fail.

Then we call the run_and_log function from our command module, passing in the command and the directory we want to log to, printing a message to the console, and returning the output_path_string.

That completes our video.py file, go ahead and save and close it. We’re ready to start on the main code now!

Subtitle Master – Putting it all together

In your root folder, create a new file named 3_subtitle_master.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
        command.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py   (new file)
    settings.py
    .env

Inside, let’s start with our imports:

import os
import uuid

import gradio as gr
import whisper
from whisper.utils import WriteVTT

from settings import BASE_DIR, OUTPUT_TEMP_DIR, OUTPUT_VIDEO_DIR, STYLES_DIR
from utils import command, subtitles, video

We import os to do some filename splitting, and all the other imports are familiar from previous parts. To finish up we import several directories from our settings file and the command, subtitles, and video modules from our utils folder, reusing the subtitles module from the previous part.

Next up are our constants for the file:

MODEL = whisper.load_model("base.en")
VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))

We just load up a model, I’ll start with base.en as it will probably be good enough to get started. Then we instantiate a WriteVTT object like we did last time, indicating we want to save the subtitles in the temp directory.

As we are going to be returning a video to the end user this time, I would like to include the original video name in the output file, though we’ll still need a uuid as well to guarantee unique names (the user might upload the same file twice!). So let’s create a quick function that gets us a unique project name. Say the user inputs a file named my_video.mp4, we want the function to return my_video_0f646333-0464-43a1-a75c-ed57c47fbcd5 so that we basically have a uuid with the filename in front of it. We can then add .mp3 or .srt or whatever file extension we need at the end, making sure all the files for this project have the same but unique project name.

def get_unique_project_name(input_video: str) -> str:
    """Get a unique subtitle-master project name to avoid file-name clashes."""
    unique_id = uuid.uuid4()
    filename = os.path.basename(input_video)
    base_fname, _ = os.path.splitext(filename)
    return f"{base_fname}_{unique_id}"

The function takes the input path as a string and then generates a uuid. We then get the filename using os.path.basename, which takes a path like C:\Users\dirk\test\my_video.mp4 and returns my_video.mp4. We then use os.path.splitext to split the filename into a base filename and an extension, so my_video.mp4 becomes my_video and .mp4. We catch the base name as base_fname and the extension under the variable name _ as we don’t need it. We then return the base filename with the uuid appended to it.

Now let’s get started on our main function below that will tie it all together:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a video file with subtitles embedded as string path."""
    unique_project_name = get_unique_project_name(input_video)
    get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}"
    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=get_temp_output_path(".mp3"),
    )

We’ll take an input video, which gradio will pass to our main function as a string path. The function will return a string path pointing towards the processed video file with embedded subtitles back to gradio. First, we get a unique project name using the function we just wrote. Then we create a simple lambda function like the one we had in part 2. It takes an extension like .mp3 as input and returns output_dir/project_name.mp3, as we’ll need temporary directories for both our .mp3 and our .vtt files, and this way we only have one place to change if we ever need to change the output directory.

Then we call the to_mp3 function from our video module, passing in the input video, the project’s base directory as the log directory, and the output path as the get_temp_output_path lambda function with .mp3 as the extension. We save the return of the function as the variable named mp3_file.

Continuing on:

def main(input_video: str) -> str:
    ...previous code...

    whisper_output = MODEL.transcribe(mp3_file, beam_size=5)
    vtt_subs = subtitles.write_to_file(
        whisper_output,
        writer=VTT_WRITER,
        output_path=get_temp_output_path(".vtt"),
    )

We call the transcribe method on our MODEL object, which has an instance of Whisper, passing in the mp3_file as the input file, and setting the beam_size to 5. We then call the write_to_file function from our subtitles module, passing in the whisper_output as the transcript, the VTT_WRITER as the writer, and the get_temp_output_path lambda function with .vtt as the extension as the output path.

So what is this beam_size parameter? Well, it’s one of a number of possible parameters we can pass into the transcribe method. The beam_size parameter is the number of beams to use in the beam search. The higher the number, the more accurate the transcription will be, but the slower it will be as well. The default is 5, and I’ve found that this is a good balance between speed and accuracy. The only reason I’ve passed it in explicitly here is to make you aware of these parameters. It basically refers to the number of different potential paths that will be explored, from which the most likely one is chosen. Here are some of the other possible parameters:

temperature -> The higher the temperature, the more likely it is that the model will choose a less likely character. You can think of it in a similar way as the temperature setting you get with ChatGPT calls. The default is 0 and will simply always return the most likely predictions only, 0 is what we have been using so far.
beam_size -> The number of beams to use in the beam search. We just discussed this one above. It is only applicable when the temperature is set to 0, and its default value is 5.
best_of -> Selects multiple random samples, only for use with a nonzero temperature and will generate more diverse (and possibly wrong) samples.
task -> Either transcribe or translate. We’ve used this one before and it defaults to transcribe.
language -> The language to use when task = translation. Defaults to None which will perform a language detection first.
device -> The device to use for inference. Defaults to cuda if you have a cuda enabled GPU, otherwise, it will default to cpu.
verbose -> Whether to print out the progress and debug messages, defaults to True.

And there are more. For general use, you’ll probably do fine with the defaults most of the time, but be aware that you can tweak these parameters to get better results if you need to.

Back to our code, let’s continue:

def main(input_video: str) -> str:
    ...previous code...

    vtt_string_path = command.format_ffmpeg_filepath(vtt_subs)
    output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4"
    embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"'

    command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR)

    return str(output_video_path)

We need to run another ffmpeg system command to embed the subtitles we have created into our video file. We first get the vtt_string_path by passing in the vtt_subs path we already have into that crazy function with all the //// backslashes we called format_ffmpeg_filepath, remember? After that, we save our desired output video path in a variable by just combining our OUTPUT_VIDEO_DIR with the unique_project_name and pasting _subs.mp4 at the end for good measure.

Now we prepare the ffmpeg command we’re about to run in a separate variable for readability. We use the input_video as the input file (-i), and then use the -vf option to add a video filter. The video filter we use is subtitles and we pass in the vtt_string_path as the subtitle file. We then pass in the output_video_path as the output file.

Notice again that the whole command is inside single brackets ' inside of which we have path variables in double brackets " to avoid trouble if there are spaces in the filename. But as we have to pass in "subtitles='{vtt_string_path}'" which requires another level of brackets again, going back to the single brackets ' would cause trouble as we have already used these to open the string variable at the start, so we have to escape them using the backslash \' instead.

Then we call the run_and_log function from our command module, passing in the command we just wrote, and the BASE_DIR as the log directory. We then return the output_video_path as a string, as gradio doesn’t want a Path object.

The whole main function now looks like this:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a video file with subtitles embedded as string path."""
    unique_project_name = get_unique_project_name(input_video)
    get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}"
    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=get_temp_output_path(".mp3"),
    )

    whisper_output = MODEL.transcribe(mp3_file, beam_size=5)
    vtt_subs = subtitles.write_to_file(
        whisper_output,
        writer=VTT_WRITER,
        output_path=get_temp_output_path(".vtt"),
    )

    vtt_string_path = command.format_ffmpeg_filepath(vtt_subs)
    output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4"
    embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"'

    command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR)

    return str(output_video_path)

Building the interface

Now all we need to do to run this is create another gradio interface. As you are already familiar with gradio now we’ll go through this one a bit more quickly, the principles are the same as last time. Below your main function, continue with:

if __name__ == "__main__":
    block = gr.Blocks(
        css=str(STYLES_DIR / "subtitle_master.css"),
        theme=gr.themes.Soft(primary_hue=gr.themes.colors.emerald),
    )

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                
                
                
                """
            )
            with gr.Row():
                input_video = gr.Video(
                    label="Input Video", sources=["upload"], mirror_webcam=False
                )
                output_video = gr.Video()
            with gr.Row():
                button_text = " Subtitle my video! "
                btn = gr.Button(value=button_text, elem_classes=["button-row"])

            btn.click(main, inputs=[input_video], outputs=[output_video])

    block.launch(debug=True)

We use the if __name__ == "__main__": guard to make sure that the code inside only runs when we run the file directly. We create the gradio block object just like we did before, passing in a css file that doesn’t exist yet, but this time we also pass in a theme. I’ll pass in the gr.themes.Soft() which has a bit of a different style to it, and set the accent color to emerald by passing in primary_hue=gr.themes.colors.emerald when calling Soft(). This will match nicely with the logo I have prepared for you with this application.

Then we open the block object using the with statement, and open up a new Group inside of it, just like we did before, so we can build our block interface. The HTML object is the same as in the last part, except I changed the image link URL to give you a new logo for this app. Then we open up a new Row and add a Video object for the input video, passing in sources=["upload"] so that the user can upload a video file, and setting mirror_webcam=False as we don’t want to take the user’s webcam as input. Still on the same Row, so next to the input video, we declare another Video object for the output video file.

We then have a row that only has a button for which we provide a text and a class of button-row so we can target it with CSS. The btn.click declaration is a lot simpler this time as we just call the main function with only a single input of input_video and only one output of output_video. Finally, we call .launch on the block just like last time.

That’s our code done! You’re probably dying to run it, but wait! We have to create a quick CSS file to finish it off. Create a new file named subtitle_master.css inside the styles folder:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
        subtitle_master.css   (new file)
        whisper_pods.css
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
        command.py
        video.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    3_subtitle_master.py
    settings.py
    .env

Inside we’ll just write some quick CSS styles:

.header {
  padding: 2em 8em;
}

.header,
.button-row {
  background-color: #1d366f7e;
}

We just gave the header class some padding to stop the logo image from being too large and then gave both the header and button-row classes a background color of #1d366f7e which is a nice dark blue half-transparent color. Save and close the file, and we’re ready to run! Go ahead and run the 3_subtitle_master.py file, and give it some time to load. Click the link in your terminal window again to open the interface in your browser, and you should see something like this:

Yours won’t have Korean in the input video box though, but whatever your computer’s language is set to. Go ahead and upload a video file, wait a second for it to load, and then press the subtitle my video button. This may take quite a while if you’re not on the fastest system with a powerful GPU, but you’ll see the commands and steps being executed in your terminal window just like we set up. Eventually, you’ll see the output video appear with the subtitles embedded, each one perfectly in time with the video, and you can play it back and download it!

You can check the commands_log.txt file in the root directory to see all the commands that were run, and you can check the output_temp_files folder to see the temporary files that were created during the process, and the output_video folder to see the final output video file. If you need some extra quality, set a higher model like small.en or medium.en.

Conclusion

That’s pretty awesome! An automatic subtitler that will subtitle any video for you all on its own. You can build on this maybe by accepting YouTube links or adding translation functionality so you can have English subtitles on foreign language videos, which could be cool for language learning. Make sure you don’t use the .en model if you want to use other languages obviously.

To make a real production-grade application use a front-end framework and have some kind of progress or stream the live transcription to the page to stop the user getting bored, or allow them to do something else while the file processes in the background. A production app would have to run on a server with good processing power and GPU.

That’s it for part 3, I’ll see you soon in part 4 where we’ll look at ways to speed up Whisper or outsource the processing using the OpenAI API endpoint in the cloud. We’ll also build one more app using the cloud API to round off the series. See you there soon!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper Example – Building a Subtitle Generator & Embedder appeared first on Be on the Right Side of Change.

OpenAI Whisper – Building a Podcast Transcribing App in Python

Dirk van Meerveld — Thu, 25 Jan 2024 19:56:17 +0000

Welcome back to part 2, where we’ll start practically applying our Whisper skills to build useful stuff. We obviously cannot just rely on the user needing to give us MP3 files to transcribe, they may want to just link a podcast for example. Here, we’ll be building a real application that can transcribe podcasts to text or subtitle format by taking just a podcast link as input.

Before we get started on the main code, we’ll do some basic setup work and create the helper functions we need to run in our main code. Keeping things separated across multiple functions and files will help keep our code a lot more clean and readable compared to just having one big script that does everything at the same time.

Saving our constants to a separate file

First, there are a couple of settings we’ll be using again and again over the next three parts, namely the paths to the input and output folders for the mp3 files, subtitles, and whatever else we will be processing. Instead of importing pathlib in every single file and then writing BASE_DIR = Path(__file__).parent we’ll just write this in a separate file and import it everywhere we need it. This will also make it easier to change the paths later if we need to.

In your project folder create a new file called settings.py, making sure to put it in the root folder of your project:

FINX_WHISPER (project root folder)
    test_audio_files
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py

In settings.py, write the following code:

from pathlib import Path

BASE_DIR = Path(__file__).parent
OUTPUT_TEMP_DIR = BASE_DIR / "output_temp_files"
OUTPUT_VIDEO_DIR = BASE_DIR / "output_video"
STYLES_DIR = BASE_DIR / "styles"
TEST_AUDIO_DIR = BASE_DIR / "test_audio_files"

We first get the root directory of the project using Path(__file__).parent, and then we create a few more paths relative to the root directory. We’ll use these paths in our main code to save the output files to the correct folders. Go ahead and also create empty folders for the output_temp_files, output_video, and styles folders, making sure to spell them correctly:

FINX_WHISPER (project root folder)
    output_temp_files     (new empty folder)
    output_video          (new empty folder)
    styles                (new empty folder)
    test_audio_files      (already existing folder)
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py

That’s our folders and paths setup done. We can just import these variables to access the folders from any file in our project. There is one more setting we need to define, but we cannot hardcode this one in our source code. We need to get our API key for OpenAI, as we’ll be using some ChatGPT in this part of the course. You’ll also need your API key for later parts. Go to https://platform.openai.com/api-keys and copy your API key. If you don’t have one, make sure to get one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env in the root folder of your project:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py
    .env                  (new file)

And paste your API key in there like this, making sure not to use any spaces or quotes:

OPENAI_API_KEY=your_api_key_here

Then go ahead and save and close this file.

Creating a utils folder for our helper functions

Now let’s create a new folder named utils to hold our helper functions, and then inside this new folder create an empty file called __init__.py:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils                 (new folder)
        __init__.py       (new empty file)
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py
    .env

The __init__.py file is required to make Python treat the utils folder as a package, which will allow us to import the functions from within our other files. You don’t need to write anything in this file, just create it and leave it empty.

Our first utils file will deal with the podcast-related functions, so create a file called podcast.py in the utils folder:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py        (new file)
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py
    .env

Inside podcast.py get started with our imports:

import re
import uuid
from pathlib import Path

import requests
from decouple import config
from openai import OpenAI

The re library deals with regular expressions and will help us find the podcast download page link amongst the page text. The uuid library lets us generate unique id’s, pathlib is familiar to us by now, and requests will help us download the podcast mp3 file. decouple will help us read our API key from the .env file, and openai will help us use the OpenAI API. If you have not used decouple before, make sure you run the install command in your terminal:

pip install python-decouple

Back in podcast.py let’s create a few constants that we’ll be using in our functions:

GPT_MODEL = "gpt-3.5-turbo-1106"
CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))

First, we set the ChatGPT model we’ll be using to request a podcast summary later on. Then we create a CLIENT object that we’ll use to make requests to the OpenAI API. We pass in our API key as a string, and we use config to read the API key from the .env file. Note that config("OPENAI_API_KEY") already returns a string value, the str() call surrounding it is just there to make it explicit and will not convert values that are already strings to a string again for the second time or something weird like that.

Scraping the podcast download link from the podcast page

So what are some of the functions we’ll need in here? For this example application I will be using Google Podcasts as our podcast source. This means we will get an input link like this:
https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjExMTk3MDo4MA?sa=X&ved=0CAgQuIEEahcKEwiIzMnavduDAxUAAAAAHQAAAAAQAQ

If you load this page in your browser, you will see an HTML page, with a play button. This is the kind of page link the user will input into our app, so first of all we will need a function to extract the .mp3 download link from this page’s HTML.

Let’s get started on a function to do exactly that:

def scrape_link_from_page(page_url: str) -> str:
    podcast_page = requests.get(page_url).text
    regex = r"(?P\;https?://[^\s]+)"
    ...

We start by defining our function which takes the page_url as a string and will return a string value as well. Then we use requests to get the HTML page text by sending a GET request to the URL, much like your internet browser would if you type a URL in the address bar. Now we define a regular expression that will match the pattern of the download link we want to extract. We’ll use this regex to find the download link in the HTML page text. Here’s how it works:

(?P...) This is a named group. The matched text can be retrieved by the name URL. So basically the URL pattern we will find will be stored in a variable called URL.
\; This matches a semicolon character. The backslash is used to escape the semicolon, as it has special meaning in regular expressions. We don’t want this special meaning but the literal semicolon character, as there is a semicolon in front of the https that we want to match for the URL we need. (This is just a characteristic of this particular podcast page, other pages might have different patterns.)
https? This matches either http or https. The s? means “match zero or one s characters”. This allows the regex to match both http and https.
:// This matches the string ://, which is part of the standard format for URLs.
[^\s]+ This matches one or more (+) of any character that is not (^) a whitespace (\s) character. So basically this will match any character that is not a space, tab, or newline character. This will match the rest of the URL we need and stop adding characters as soon as a space appears which indicates the end of the URL.

So, in simple terms, this regular expression matches a semicolon followed by a URL that starts with either http or https, and continues until a whitespace character is encountered. The URL is captured in a group named url.

Now let’s complete our function:

def scrape_link_from_page(page_url: str) -> str:
    podcast_page = requests.get(page_url).text
    regex = r"(?P\;https?://[^\s]+)"
    podcast_url_dirty = re.findall(regex, podcast_page)[0]
    podcast_url = podcast_url_dirty.split(";")[1]
    return podcast_url

So after we declared the regex pattern, we use re.findall to find all matches of the pattern in the podcast page text. This will return a list of matches, and we take the first match with [0]. This will return a string that looks something like this:

;https://download.ted.com/talks/etcetcetc;

Which is pretty good, we just need to get rid of the ; characters before and after the URL. We do this by splitting the string on the ; character, and then taking the second item in the list with [1]. This will return the clean URL we need: https://download.ted.com/talks/etcetcetc

Downloading the podcast mp3 file

Ok, so now our utils file has a function to scrape the download link. It stands to reason we’ll also need a function to download the mp3 file from the URL. Let’s get started on that:

def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path:
    print("Downloading podcast...")
    podcast_audio = requests.get(podcast_url)
    save_location = output_dir / f"{unique_id}.mp3"
    ...

We define a function called download that takes 3 input arguments. The podcast_url is the URL we scraped from the podcast page as a string variable. The unique_id is a unique ID we’ll use to name the downloaded file, so we can avoid name clashes where files have the same name. This argument should be an instance of the UUID class from the uuid built-in Python library, which we’ll have a look at in a bit. The output_dir is the directory where we want to save the downloaded file as a Path object. Finally, our function will also return a Path object, which will be the path to the downloaded file.

We print a simple message to the console to show it is busy actually doing something, and then we use requests to download the podcast audio file by sending a GET request to the URL just like we did in the previous function. Then we create a save_location variable which is the path to the file we want to save. We use the output_dir argument as the parent directory, and then we use an f-string to create a filename that is the unique_id followed by the .mp3 extension.

Now let’s complete our function:

def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path:
    print("Downloading podcast...")
    podcast_audio = requests.get(podcast_url)
    save_location = output_dir / f"{unique_id}.mp3"

    with open(save_location, "wb") as file:
        file.write(podcast_audio.content)
    print("Podcast successfully downloaded!")

    return save_location

We use the open function to open the save_location file in write binary (wb) mode, and we write the podcast_audio.content to the file. This will save the podcast audio file to the save_location path. Then we print a message to the console to show the download was successful, and we return the save_location path which points to the mp3 file we just downloaded, awesome!

Getting a summary

Now there is one more function we need in our utils/podcast file. Besides just the transcription, we will also provide the user with a summary of the podcast. We’ll use ChatGPT to generate this summary, so we’ll need a simple function to do that. This one will be easy, so let’s just whip it up:

def get_summary(transcription: str) -> str:
    print("Summarizing podcast...")
    prompt = f"Summarize the following podcast into the most important points:\n\n{transcription}\n\nSummary:"

    response = CLIENT.chat.completions.create(
        model=GPT_MODEL, messages=[{"role": "user", "content": prompt}]
    )

    print("Podcast summarized!")
    summary = response.choices[0].message.content
    return summary if summary else "There was a problem generating the summary."

I assume you’re familiar with ChatGPT (if not, check out my other courses on the Finxter Academy!). We just have a simple function that takes the full transcription as a string and will return a summary as a string. We have a console print message again just to keep ourselves posted that it is doing some work and then we have a simple ChatGPT prompt.

Note the prompt ends with Summary: to prompt the model to start the summary right away without including any awkward introduction text, this is just a neat little trick you can use. We then use our CLIENT object to call the chat.completions.create endpoint, passing in the GPT_MODEL and a list of messages. We’ll just pass in the prompt as a user message. We then extract the summary from the response.choices[0].message.content. Just in case there was a problem and the summary is empty, we return a default message to inform the user.

Subtitles

Awesome! Our podcast utils are done now. Let’s move on to the subtitles utils. This one will be a much shorter file with a function that will allow us to output the transcription in subtitle format, with timestamps and everything. So go ahead and create a new file called subtitles.py in the utils folder:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py      (new file)
    1_basic_call_english_only.py
    1_multiple_languages.py
    settings.py
    .env

And inside subtitles.py get started with our imports:

from typing import Callable
from pathlib import Path

Both of these imports will be used solely to indicate the type of our function arguments (type hinting). We’ll use Callable to indicate that a function is expected as an argument, and we’ll use Path to indicate that a Path object is expected as an argument. This just makes our code clearer to read and easier to understand. Now let’s write our function, whose purpose will be to take a transcription done by Whisper and then convert it to a valid subtitle file:

def write_to_file(whisper_output: dict, writer: Callable, output_path: Path) -> Path:
    """Takes the whisper output, a writer function, and an output path, and writes subtitles to disk in the specified format."""
    with open(output_path, "w", encoding="utf-8") as sub_file:
        writer.write_result(result=whisper_output, file=sub_file)
        print(f"Subtitles generated and saved to {output_path}")

    return output_path

We take a whisper_output argument which is a dictionary containing the output Whisper gives us after we transcribe the podcast’s mp3 file. We also take a writer argument which is a function that will write the subtitles to disk, so we type-hint it with Callable. Finally, we take an output_path argument which is a Path object to the file we want to save the subtitles to. We then simply open the output path in write mode, calling the file sub_file. We then call the writer.write_result function, passing in the whisper_output and the location to save the subtitles to. Finally, we print a message to the console to show the subtitles were generated successfully, and we return the output_path which is the path to the subtitle file we just created.

Two important things to note here:

When you open the subtitle file, make sure you use the encoding="utf-8" argument. For normal English characters, this is not necessary, so you might think this is not needed. However, the AI likes to use ♪ symbols when music starts playing to make the subtitles more interesting, and you crash if you don’t specify utf-8 encoding which can actually map and save these special characters!
You might be wondering what this magical writer function is. Whisper actually comes with some utility functions that will allow us to write subtitles in correct formatting, like SRT or VTT. These utilities have a .write_result function which is what we’re calling in our code above. So we’ll be able to pass in a SRT-writer or a VTT-writer depending on what subtitle type we want to save.

Ok, so that is all our utility functions done. Now let’s move on to the main code.

Installing gradio

Before we get started you’ll need to install gradio, so in your terminal window, run:

pip install gradio

What is gradio? Gradio is a Python library that allows us to quickly create user-friendly interfaces for testing, demonstrating, and debugging machine learning models. We’ll use gradio to create a UI for our app with just a few lines of code, and it supports a wide range of input and output types like video, audio, and text. Using this super simple framework we can keep the focus on whisper and not on building a user interface. It’s pretty self-explanatory, so you’ll understand the idea as we just code along.

Creating the main file

Now let’s get started on our main code, where mostly we’ll just have to call our utility functions and tie it all together, plus create a quick gradio interface to make it user-friendly. Create a new file called 2_whisper_pods.py in the root folder of your project:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py   (new file)
    settings.py
    .env

And inside 2_whisper_pods.py get started with our imports:

import uuid
from pathlib import Path

import gradio as gr
import whisper
from whisper.utils import WriteSRT, WriteVTT

from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR
from utils import podcast, subtitles

uuid is Python’s built-in library to generate unique id’s, pathlib is familiar to us by now, and gradio is the library we just installed. We also import whisper and two writer utilities from whisper.utils, which are the writer functions we talked about in the previous section. Then we import our directory Path constants from the settings and our podcast and subtitles utils. Now continue below the imports:

WHISPER_MODEL = whisper.load_model("base")
VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))
SRT_WRITER = WriteSRT(output_dir=str(OUTPUT_TEMP_DIR))

We load the WHISPER_MODEL from the base model, and we create two writer objects by creating instances of the WriteVTT and WriteSRT classes we imported from Whisper’s utilities, passing in the output_dir as a string.

Now let’s create a function to tie it all together:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    unique_id = uuid.uuid4()

    podcast_download_url = podcast.scrape_link_from_page(page_link)
    mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR)
    ...

We define a function called transcribe_and_summarize which takes a page_link as a string and will return a tuple so we can have multiple outputs to this function. These four outputs will feed back into the gradio interface we will create later and will be:

The podcast summary
The podcast transcription
The VTT subtitle file (path)
The SRT subtitle file (path)

We then create a new unique_id which we’ll use to name the downloaded mp3 file. Note we do this inside the function as we need a unique identifier for every single transcription run to avoid name clashes. Then we use our podcast.scrape_link_from_page util to scrape the download link from the podcast page, and we use our podcast.download function to download the podcast mp3 file, passing in the podcast_download_url, unique_id, and the OUTPUT_TEMP_DIR as arguments. We then catch the mp3 file path in a variable called mp3_file. Notice how easy everything is to read because we used logical and descriptive names for all our variables and utility functions and files.

Let’s continue with our function:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    ...previous code...

    whisper_output = WHISPER_MODEL.transcribe(str(mp3_file))
    with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f:
        f.write(str(whisper_output))

    transcription = str(whisper_output["text"])
    summary = podcast.get_summary(transcription)

We call the .transcribe function by passing in the mp3_file path as a string. This will return a dictionary with the transcription and other information we catch in whisper_output. We then open a file called pods_log.txt in our root directory in write mode, and we write the whisper_output to the file. This is just for debugging purposes, so we can see what the output looks like (it’s too long to print to the console). We then extract the transcription from the whisper_output dictionary. Note that whisper_output["text"] is already a string, the reason we wrapped inside a str() call is just to make it explicit that this is a string for typing purposes. This will not add any extra overhead or computing time as values that are already a string will just pass through the str() function unaltered. Then we call our podcast.get_summary function, passing in the transcription as an argument.

Now we just need to write the subtitles to disk and return all the outputs. Continue on:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    ...previous code...

    get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}"
    vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt"))
    srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt"))

    return (summary, transcription, str(vtt_subs), str(srt_subs))

We create a lambda (nameless) function that takes a file extension as input and then returns the path to the subtitle file with that extension. For example, inputting .vtt will yield output_temp_files/unique_id.vtt, but giving it .srt will yield output_temp_files/unique_id.srt, just so we can avoid repeating the same code twice. Then we call our subtitles.write_to_file function twice, passing in the whisper_output, the VTT_WRITER and SRT_WRITER writer functions, and the get_sub_path lambda function to get the path to the subtitle file. We catch the output of these two functions in vtt_subs and srt_subs respectively. Finally, we return a tuple containing the summary, transcription, vtt_subs, and srt_subs to finish off our function.

The whole thing now looks like this:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    unique_id = uuid.uuid4()

    podcast_download_url = podcast.scrape_link_from_page(page_link)
    mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR)

    whisper_output = WHISPER_MODEL.transcribe(str(mp3_file))
    with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f:
        f.write(str(whisper_output))

    transcription = str(whisper_output["text"])
    summary = podcast.get_summary(transcription)

    get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}"
    vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt"))
    srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt"))

    return (summary, transcription, str(vtt_subs), str(srt_subs))

Creating the gradio interface

That’s very nice and well, but a typical end user does not know how to use Python and this function is not very user-friendly. So let’s create a quick gradio interface to make it easy for the user to use our app. Continue below the function:

if __name__ == "__main__":
    block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css"))

    with block:
        with gr.Group():
            # Header

            # Input textbox for podcast link

            # Button to start transcription

            # Output elements

            # btn.click definition

    block.launch(debug=True)

This is going to be the basic structure of our gradio application. First, we use if __name__ == "__main__": to make sure the code inside this block only runs if we run this file directly, and not if we import it from another file. Then we create a block object by calling gr.Blocks and passing in the path to our whisper_pods.css file in the styles directory as a string. This will allow us to style our app with CSS, which we’ll do in a bit (this .css file doesn’t exist yet). Then we open a with block: block, and inside this block we open a with gr.Group(): block. This will allow us to group elements together in our app. Then we have a bunch of comments to indicate what we’ll be doing in each block, which we’ll fill in in a moment. Finally, we call block.launch to launch our app, passing in debug=True so we get extra feedback in the console if anything goes wrong.

The header will hold a logo image for our application. We’ll use HTML to load it from the internet. We can call gr.HTML to create an HTML element, and we can pass in the HTML code as a string. We’ll use a div element with a header class, and inside this div we’ll have an img element with a link to our logo image, which I just quickly uploaded to “imgur”. We’ll also set the referrerpolicy to no-referrer to avoid any issues with the image not loading (imgur doesn’t work with a localhost referrer, which is what you’ll have when you run this app locally).

gr.HTML(
    f"""
    
    
    
    """
)

The input textbox will be where the user can paste in the podcast link. We can just call gr.Textbox to create a textbox element, and we can pass in a label to indicate what the textbox is for. We’ll call it “Google Podcasts Link” and we’ll catch the input in a variable called podcast_link_input.

podcast_link_input = gr.Textbox(label="Google Podcasts Link:")

The button will be the trigger that starts the main function. I want a full row button so we’ll call gr.Row to create a row element, and then we’ll call gr.Button to create a button element. We can just pass in the button text we want to display and associate the button with the variable name btn. We’ll use this btn object later to define the button’s behavior.

with gr.Row():
    btn = gr.Button(" Transcribe and summarize my podcast! ")

The output elements will be the summary, transcription, and two subtitle files. The first two are just a gr.Textbox which does what you’d expect and allows us to pass in a label, placeholder, and the number of lines to display by default. The autoscroll behavior will scroll all the way down to the bottom if a large transcription text is passed into the input box. Since we want the user to be able to start reading from the beginning instead of the end we set this behavior to False. We then have another gr.Row with two gr.File elements which will end up side-by-side in a single row. The label is just a label and the elem_classes is a list of classes gradio will give the element, so we can target it with CSS later on using the names vtt-sub-file and srt-sub-file.

summary_output = gr.Textbox(
    label="Podcast Summary",
    placeholder="Podcast Summary",
    lines=4,
    autoscroll=False,
)

transcription_output = gr.Textbox(
    label="Podcast Transcription",
    placeholder="Podcast Transcription",
    lines=8,
    autoscroll=False,
)

with gr.Row():
    vtt_sub_output = gr.File(
        label="VTT Subtitle file download", elem_classes=["vtt-sub-file"]
    )
    srt_sub_output = gr.File(
        label="SRT Subtitle file download", elem_classes=["srt-sub-file"]
    )

The btn.click is where we define which function to call when the button is clicked, so we give it our transcribe_and_summarize function as the first argument. The second argument is a list of inputs, in this case only our podcast_link_input. The third argument is a list of outputs, in this case, our summary_output, transcription_output, vtt_sub_output, and srt_sub_output. We’ll use these outputs to display the results of our function to the user. We just told gradio what function to run, and how to map all of the input and output elements we defined in the interface to the input and output arguments of our function!

btn.click(
    transcribe_and_summarize,
    inputs=[podcast_link_input],
    outputs=[
        summary_output,
        transcription_output,
        vtt_sub_output,
        srt_sub_output,
    ],
)

whisper_pods.py now looks like this:

imports

CONSTANTS


def transcribe_and_summarize(...)...
    ...


if __name__ == "__main__":
    block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css"))

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                
                
                
                """
            )

            podcast_link_input = gr.Textbox(label="Google Podcasts Link:")

            with gr.Row():
                btn = gr.Button(" Transcribe and summarize my podcast! ")

            summary_output = gr.Textbox(
                label="Podcast Summary",
                placeholder="Podcast Summary",
                lines=4,
                autoscroll=False,
            )

            transcription_output = gr.Textbox(
                label="Podcast Transcription",
                placeholder="Podcast Transcription",
                lines=8,
                autoscroll=False,
            )

            with gr.Row():
                vtt_sub_output = gr.File(
                    label="VTT Subtitle file download", elem_classes=["vtt-sub-file"]
                )
                srt_sub_output = gr.File(
                    label="SRT Subtitle file download", elem_classes=["srt-sub-file"]
                )

            btn.click(
                transcribe_and_summarize,
                inputs=[podcast_link_input],
                outputs=[
                    summary_output,
                    transcription_output,
                    vtt_sub_output,
                    srt_sub_output,
                ],
            )

    block.launch(debug=True)

Creating the CSS file

See how easy it was to write an interface using gradio! There is just one thing left to do, the STYLES_DIR / "whisper_pods.css" file we loaded into gradio doesn’t actually exist! Go ahead and create a new file in the styles directory called whisper_pods.css:

FINX_WHISPER (project root folder)
    output_temp_files
    output_video
    styles
        whisper_pods.css  (new file)
    test_audio_files
    utils
        __init__.py
        podcast.py
        subtitles.py
    1_basic_call_english_only.py
    1_multiple_languages.py
    2_whisper_pods.py
    settings.py
    .env

Inside whisper_pods.css paste the following code:

.header {
  padding: 2em 8em;
}

.vtt-sub-file,
.srt-sub-file {
  height: 80px;
}

We set some padding on the header image by targeting the header class, to stop the image from getting too big. Then we set the height of the subtitle file download boxes to 80px, so they don’t get smaller than this, keeping them nice and visible.

Now go back to your 2_whisper_pods.py file and run it. Give it some time to load up and you’ll see the following in your terminal:

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

CTRL + click the link to open it in your browser. You should see the following:

Go ahead and get a Google podcasts link to input. I’ll use a short podcast just for the initial test:
https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjEwNzMyNDo4MA?sa=X&ved=0CAgQuIEEahcKEwiImYLqr8qDAxUAAAAAHQAAAAAQAQ

And then click the button and wait (I’ve blurred out the transcription to respect the speaker’s copyright as this course will be published publicly):

Check the summary, transcription, and subtitle files. Try other podcasts from https://podcasts.google.com/. play around and have fun! My transcription was very good using just the base whisper model we loaded up and I never even used a bigger one! If you use non-English languages you may need a bigger model though. You can also use a .en model like base.en or small.en to get higher accuracy if you will only input English podcasts.

Also take a look at the pods_log.txt file you wrote in the root directory of your project, which holds the full whisper output. It may help you pinpoint where the problems are and how confident the model is while transcribing.

Conclusion

There we go, that is a pretty good initial minimum viable product! Of course, it has much room for improvement, for instance by using a proper front-end framework like React and streaming the transcription live to the page so the user is not left waiting so long before seeing results.

You could also use asyncio to make the ChatGPT summary call asynchronous slightly speeding up the code by writing the subtitle files to disk while the ChatGPT summary call is running at the same time, and of course, you’d want some kind of cleanup function to get rid of all the downloaded mp3 files hanging around in your output_temp_files folder. If you check it you will see all the files with the names like 0e0f5d05-9379-4124-a84d-81de7eb3e314.mp3 we generated, plus all the subtitle files with the same name for each mp3 file.

I’ll leave the rest up to your imagination! That’s it for part 2, I’ll see you soon in part 3, where we’ll be using Whisper to create a fully automatic video subtitling tool that takes only a video file as input, then transcribes the audio, creates subtitles, and embeds them into the video at the correct times! It will be fun, see you there!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper – Building a Podcast Transcribing App in Python appeared first on Be on the Right Side of Change.

OpenAI Whisper – Python Installation, Setup, & First Steps to Speech-to-Text Synthesis

Dirk van Meerveld — Thu, 25 Jan 2024 19:55:30 +0000

Welcome to this first part of the Whisper course. My name is Dirk van Meerveld and it is my pleasure to be your host and guide for this tutorial series where we will be looking at OpenAI’s amazing speech-to-text model called Whisper.

We’ll first take a look at what it is and how its basic usage works, and then we’ll explore ways in which we can practically use it in our projects. Along the way, we’ll learn about the balance between model size and accuracy, and in the final part, we’ll look at alternative options to speed it up or outsource the processing to OpenAI’s servers.

The local installation process should not be too much of a problem but is a bit different for all operating systems and system setups. Unfortunately, I cannot cover every single possible system setup configuration, so you may have to do some googling and trial and error along the way.

This is an inevitable part of software development, don’t give up and you will always get it working eventually, we all get stuck trying to get something to work with our particular system sometimes, it’s just part of the job.

If you do not like a particular configuration like running the model locally, rest assured we will cover both the different ways to run Whisper and various implementation projects over the series, so just watch through the whole thing and then take whatever projects you like and combine it with whatever version of running Whisper you liked.

Installing Whisper

First, we need to install Whisper. We’ll be using the pip package manager for this, so make sure you have that installed, but you should if you’re a Python user. In a terminal window run the following command:

pip install -U openai-whisper

The -U flag in the pip install -U openai-whisper command stands for --upgrade. It means that Whisper will either be installed or upgraded to the latest version if it is already installed.

The second thing we need to have installed is ffmpeg. What is ffmpeg? FFmpeg is a versatile multimedia framework that allows us to work with audio and video files. It supports a wide range of formats, and is highly portable, running on pretty much any operating system.

The simplest way to install ffmpeg is to use a package manager. If you’re on Windows, you can use Chocolatey to install ffmpeg by running the following command in a terminal window:

# on Windows / Chocolatey
choco install ffmpeg

If you’re on MacOS using Homebrew, you can install ffmpeg by running the following command in a terminal window:

# on MacOS / Homebrew
brew install ffmpeg

If you’re on Linux, well you probably know what to do and don’t need instructions! sudo apt update && sudo apt install ffmpeg

This may be the most challenging part of the tutorial series, to be honest. You may not run into any issues if your system is already set up well, or you may need to do quite some googling and setup work to get everything up and running. It took me some messing around to get everything working properly on my system and it’s unfortunately impossible to know exactly what you will need to do to resolve any issues you may run into. Google is your friend! Remember we’ll also cover the API in part 4 if you don’t want to run the model locally, but don’t just skip ahead as you’ll miss out on a lot of useful information.

What is Whisper?

Whisper is a speech-to-text model developed by OpenAI. What is really cool is that they open-source released this model to the public. It is a neural network that takes audio as input and outputs text. It is trained on a large dataset of audio and text pairs and has learned the text that corresponds to the audio. What is exciting about the model is that it’s not just effective at transcribing high-quality ‘gold-standard’ audio that has been recorded on studio microphones, but is also very good at transcribing audio that has considerably lower quality, or even imperfect pronunciation with a foreign accent. If you compare it with auto-generated subtitles from Youtube, for example, you will see that it really is a level apart.

Instead of diving deep into the model’s architecture and technical details that make it work behind the scenes, this course will focus on the practical application of what we can do with it and how to use it to make cool stuff.

Model sizes

There are different sizes available for the Whisper model. The smaller the size of the model, the less processing power and VRAM it needs, and the faster it will run. This comes at the cost of a lower accuracy. On the contrary, the larger the model size, the more processing power and VRAM it needs, and the longer it will take to run, but the more accurate it will be and the better it will deal with foreign languages, noise, and poor audio quality.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative Speed
tiny	39M	tiny.en	tiny	~1GB	~32x
base	74M	base.en	base	~1GB	~16x
small	244M	small.en	small	~2GB	~6x
medium	769M	medium.en	medium	~5GB	~2x
large	1550M	N/A	large	~10GB	1x

As we can see in this table from the Whisper GitHub, we have 5 different model sizes in total. There are 4 sizes for the English-only model, namely tiny.en, base.en, small.en, and medium.en. As this model only deals with the English language it is highly recommended to use one of these when you know you’re going to be transcribing English as these models are specialized at only dealing with English and therefore will give greater accuracy at a much smaller model size and run-time. This is why there is no large.en model as the medium.en model is already sufficient in size to equal the accuracy of the large multilingual model.

For the multilingual models, we have the tiny, base, small, medium, and large sizes. This whisper is trained on a whopping 680,000 hours of audio data covering a total of 97 different languages, though the performance does vary per language as more obscure languages may not work quite as well. The larger the model size the easier it will deal with such languages, specific accents, and poor audio quality.

Now if you don’t have 10GB of VRAM, don’t worry, you can often get away with using the smaller-size models as you will see. Later on, in the last part of the series, we’ll look at smaller ‘distilled’ versions of the model that can help us optimize speed further, or just outsourcing the processing to the lighting-fast OpenAI servers. Just keep watching! That being said, I actually recommend you always use the smallest version that you can get away with for your specific task. There is simply no point in adding more cost and complexity to your apps. If you don’t need it the extra model size will only slow down and raise the cost of your application.

Basic usage

Now that we have Whisper, fire up your favorite code editor, and let’s get started! I’ll be using VSCode, but you can use whatever IDE you like. Create a root folder for your project, I’ll call mine FINX_WHISPER, and then inside make a new file called 1_basic_call_english_only.py. (I’m using numbers for the file names so you can easily reference them later when you are busy coding some cool new project, but this is obviously not a good general naming convention):

FINX_WHISPER (project root folder)
    1_basic_call_english_only.py

Then open up the new Python file and start with the imports:

import whisper
from pathlib import Path

The whisper import is obvious, and pathlib will help us get the path to the audio files we want to transcribe, this way our Python file will be able to locate our audio files even if the terminal window is not currently in the same directory as the Python file. Now let’s declare some constants:

MODEL = whisper.load_model("base.en")
AUDIO_DIR = Path(__file__).parent / "test_audio_files"

First, we declare MODEL and load the base.en model. We start with the second-smallest English-only model and will scale up if and when we need to. Then we declare AUDIO_DIR and use pathlib to get the path. This works by first getting the path to the current file (1_basic_call_english_only.py), using __file__, and then getting the parent directory of that file, using .parent. Then we add the test_audio_files folder to the path using the / operator. This way we can easily access the audio files in the test_audio_files folder from our Python file.

Now let’s create the test_audio_files as it doesn’t actually exist, make sure you spell it correctly:

FINX_WHISPER (project root folder)
    test_audio_files
    1_basic_call_english_only.py

Then go ahead and add the audio files provided into the folder. They should be provided together with this video tutorial, but if for any reason you cannot find them, go to the Finxter GitHub repository for this course or you can find a copy at:

https://github.com/DirkMeer/finx_whisper

Download all the test files and put them in the folder (you can also add your own audio files if you want to, these are just provided for your convenience):

FINX_WHISPER (project root folder)
    test_audio_files
        dutch_long_repeat_file.mp3
        dutch_the_netherlands.mp3
        high_quality.mp3
        low_quality.mp3
        terrible_quality.mp3
    1_basic_call_english_only.py

Ok, back to our 1_basic_call_english_only.py file. Below the MODEL and AUDIO_DIR variables, let’s create a function that will transcribe the audio files for us:

def get_transcription(audio_file: str):
    result = MODEL.transcribe(audio_file)
    print(result)
    return result

This function takes an audio file’s path in string format as input. We then call the .transcribe() method Whisper provides for us, and pass in the audio file’s path in string format. Then we simply print and return the result for a basic test. Looks really simple right?

First, let’s try and transcribe a high-quality English audio file, as a sort of best-case scenario:

get_transcription(str(AUDIO_DIR / "high_quality.mp3"))

Notice that the function we wrote above takes a path as a string variable. This is because Whisper requires the path to the audio file as a string. AUDIO_DIR / "high_quality.mp3" returns a Path object, so we use str() to convert it to a string, or else Whisper will crash.

Getting a transcription

So go ahead and save and run the file, and you will see a large object containing all the output. Let’s take a quick look at the information available to us here, read the comments for an explanation:

{
    # First we get the full transcription
    "text": " Hi guys, this is just a quick test audio file for you. Let's see how well it does and if my speech is recognized and converted to text properly. I'm really excited to see how well this works and I hope that it will be a good test for you guys to see how well the whisper model works.",
    # Now we have the list of segments
    "segments": [
        {
            "id": 0,
            "seek": 0,
            # Start and end times in seconds
            "start": 0.0,
            "end": 3.52,
            "text": " Hi guys, this is just a quick test audio file for you.",
            # list of tokenized words from the transcription, where each word is represented by a unique number
            "tokens": [ 50363, 15902, 3730, 11, 428, 318, 655, 257, 2068, 1332, 6597, 2393, 329, 345, 13, 50539 ],
            "temperature": 0.0,
            # In the context of machine learning, temperature is a parameter that controls the randomness of predictions. A temperature of 0.0 suggests no randomness, or the model always selecting the tokens(words) with the highest probability (This is similar to the ChatGPT API temperature setting). You can pass a temperature value to the transcribe function when calling it if you want to introduce more randomness into your generations.
            # For instance: model.transcribe(audio_file, temperature=0.2)
            "avg_logprob": -0.1399546700554925,
            # The average log probability of the tokens in the segment. The closer to 0 the better, which means if the numbers get more negative, like -0.2 for instance, it means it's much less confident in it's transcription (and there are probably more errors).
            "compression_ratio": 1.5898876404494382,
            "no_speech_prob": 0.0045762090012431145,
            # Represents the probability that the segment contains no speech. We can see that it is very low.
        },
        {
            '... more segments with the same structure as above, cut for brevity ...'
        },
    ],
    "language": "en",
}

As we can see, we really get a lot of information back from the model! What is most interesting is of course the transcription itself. Notice that it is a perfect word-for-word transcription even though we used the second smallest base.en model possible. Very impressive for such a small version of the real model! Now let’s try a lower-quality audio file:

replace the last call:

get_transcription(str(AUDIO_DIR / "high_quality.mp3"))

with:

get_transcription(str(AUDIO_DIR / "low_quality.mp3"))

And when we run this with the considerably lower quality audio file, still on the base.en model, I still get a perfect transcription. If we look closely at the output object though we can clearly see the avg_logprob (explained above) has moved further away from 0, moving from -0.1399546700554925 to -0.2179246875974867 indicating the model is now much less confident in it’s transcription (though still correct).

Now let’s try a really poor-quality audio file:

get_transcription(str(AUDIO_DIR / "terrible_quality.mp3"))

And if we run this we can see that it is still half correct even though a human would have trouble understanding it:

Hi guys. This is just a quick test audio file for you. Let's see how well it does and if my speech is recognized, thank you for the context properly. I'm really excited to see how well this works and I hope that it will be a quick test for you guys to see how well the whisper model works.

We have clearly reached the limits of the base model here as part of this is incorrect, and it’s time to step up to a bigger model size. (Remember, you generally want to use the smallest model you can get away with for your use case!)

I’m going to change the model to small.en by editing the MODEL variable at the top of our file:

MODEL = whisper.load_model("small.en")

Now if we run it again:

Hi guys, this is just a quick test audio file for you. Let's see how well it does, and if my speech is recognized and converted to text properly, I'm really excited to see how well this works, and I hope that it will be a good test for you guys to see how well the Whisper model works.

There is an awkward super-long sentence with a bit too many commas but apart from that it’s perfect, even though the audio quality of this file is pretty terrible. Switching to medium.en fixes the last small imperfection with the multiple commas by the way. This is the power of Whisper!

Taking a deeper look

Now let’s take a slightly deeper look at what is happening inside Whisper while looking at using other languages and even translation. Make a new file in your root folder called 1_multiple_languages.py:

FINX_WHISPER (project root folder)
    test_audio_files
    1_basic_call_english_only.py
    1_multiple_languages.py

Then open up the new 1_multiple_languages.py file and start with the imports:

import whisper
from pathlib import Path

AUDIO_DIR = Path(__file__).parent / "test_audio_files"
model = whisper.load_model("base")

Make sure to use the base model this time, and not the base.en model, as we want to use all available languages.

First, we’ll take a slightly deeper down look to have a rough idea of what is going on as this will help us understand some important nuances. After that, we’ll greatly simplify the whole thing using the higher-level code again. Let’s write a function that will detect the language and transcribe a file for us and we’ll explain it line by line.

def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)

We define a function, which takes the path to an audio_file as a string argument. We then call Whisper’s .load_audio() method and pass in the audio file’s path. This returns a NumPy array containing the audio waveform, in float32 datatype, or in other words, an array containing the audio data as a giant list of numbers.

    audio = whisper.pad_or_trim(audio)

Next, we get a 30-second sample, either padding with silence if the file is shorter than 30 seconds or trimming it if it is longer. This is because the Whisper model is built and trained to take 30 seconds of audio as its input data each time. This doesn’t mean you cannot transcribe longer files but does have some implications we’ll get back to later.

    mel = whisper.log_mel_spectrogram(audio).to(model.device)

Make a log-Mel spectrogram and move it to the same device as the model (e.g. your GPU). A log-Mel spectrogram is a representation of a sound or audio signal that has been transformed to highlight certain perceptual characteristics.

 Spectrogram: A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time. It's essentially a heat map where x is time, the y-axis is frequency, and the color represents the loudness.

 Mel Scale: The Mel scale is a perceptual scale of pitches that emulates the human ear's response to different frequencies. We humans are much better at distinguishing small changes in pitch at low frequencies than at high frequencies. The Mel scale makes the representation match more closely with human perception as opposed to the exact mathematical frequencies.

 Logarithmic Scale: Taking the logarithm of the spectrogram values is another step to make the representation more closely match human perception. We perceive loudness on a logarithmic scale (which is why we use decibels, a logarithmic measurement, to express the loudness of sound).

 Combining these, a log-Mel spectrogram is a representation of sound that is designed to highlight the aspects that are most important for human perception. It's commonly used in audio processing tasks, including speech and music recognition.

Now that we have this log-Mel spectrogram, we can use it to detect the language of our audio file. We do this by passing it to the .detect_language() method of our model:

    language_token, language_probs = model.detect_language(mel)

This returns the language_token, which is a number we will not be using, and the language_probs which is a huge list of numbers indicating the probability for possible languages matching the sound file. As we won’t actually be using the language_token variable we can replace it with a _ to indicate that we won’t be using it. This makes it into a sort of throwaway variable that we don’t care about.

    _, language_probs = model.detect_language(mel)

Let’s take what we have so far, add a print statement to check out the language_probs, and run it using the dutch_the_netherlands.mp3 file I prepared for you:

def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    _, language_probs = model.detect_language(mel)
    print(language_probs)

detect_language_and_transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"))

Now when we run this we can see the massive language_probs list printed to our console:

{
    '.. cut for brevity ..'
    "yi": 2.012418735830579e-05,
    "ka": 2.161949907986127e-07,
    "nl": 0.9650669693946838,
    "en": 0.010499916970729828,
    "ko": 9.358442184748128e-05,
    "mn": 5.96029394728248e-06,
    "de": 0.010318436659872532,
    '.. cut for brevity ..'
}

We have a huge list of numbers here as you can see. The higher the number the more likely the the language, many are to the power of -4, -5, -6, or even lower. We can clearly see that nl (the Netherlands) is by far the highest probability, close to a perfect 1 score with 0.965. The second and third highest are en (English) and de (German) with 0.010 and 0.010 respectively which is not even close so we can be very confident that this is Dutch. Impressive for the base model that small that deals with so many languages, and Dutch not really being that big a language.

Of course, we don’t want this whole list, we just want to know the most probable language, so we can use the max function to get the highest probability.

def detect_language_and_transcribe(audio_file: str):
    ...
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")

max returns the key of the largest value in the dictionary. We pass in the dictionary as the first argument. The key argument is a function that is called on each item in the dictionary, and the item for which the function returns the largest value is the result of the max function. We can just use the built-in .get() method as the function to get the value of each item in the dictionary.

The language name codes are in ISO 639-1 format and can be found here. We add a print statement to print the detected language. I removed the previous print statement print(language_probs) we added before.

def detect_language_and_transcribe(audio_file: str):
    ...
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")
    options = whisper.DecodingOptions(language=language, task="transcribe")
    result = whisper.decode(model, mel, options)
    print(result)
    return result.text

Now we’ll decode this 30-second audio file into text. First, we create a DecodingOptions object and save it in the variable named options. The DecodingOptions object lets you set more advanced decoding options, but we’ll stick to basics for now, passing in the language we detected and the task of “transcribe”. We then call the whisper.decode function which performs decoding of the 30-second audio segment(s), provided as log-Mel spectrogram(s). We pass in the model, the mel spectrogram, and the options. This returns a DecodingResult object which we save in the variable named result. We then print the result and return the result.text.

The whole function now looks like this:

def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    _, language_probs = model.detect_language(mel)
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")
    options = whisper.DecodingOptions(language=language, task="transcribe")
    result = whisper.decode(model, mel, options)
    print(result)
    return result.text

Now let’s run it with the dutch_the_netherlands.mp3 file again:

dutch_test = detect_language_and_transcribe(
    str(AUDIO_DIR / "dutch_the_netherlands.mp3")
)

When you run this the object printed to the console will have the following transcription:

'Hoi, allemaal. Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze text te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.'

There we go, a perfect transcription! Now you probably don’t speak Dutch, but the above is a perfect word-for-word transcription of the spoken text.

Back to .transcribe

Now I’ll be honest, that was a little bit overcomplicated if we don’t need to do much personalization and just want to call the model. Also, we don’t want to limit ourselves to just 30 seconds of audio. Let’s take it back to whisper’s higher level .transcribe function which basically does all the above for us.

Make sure you comment out the dutch_test code so it doesn’t keep running:

# dutch_test = detect_language_and_transcribe(
#     str(AUDIO_DIR / "dutch_the_netherlands.mp3")
# )

Now all we need to do to use .transcribe is load a model (model = whisper.load_model("base")) which we already did in this file, and then call the .transcribe method on the model and pass in the path to the audio file as a string:

result = model.transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True)
print(result["text"])

It also has some options, in this case, we’ve set verbose to True so it will give us extra information in the console. If you go ahead and run this you will get the exact same transcription in the output as we did above:

'Hoi, allemaal. Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze text te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.'

Again, you probably don’t speak Dutch, but that’s not the point. So underneath the hood, the .transcribe function reads the entire audio file and basically processes it in 30-second windows. You could also see it did the language detection part for us automatically before starting.

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Dutch

Working with longer files

So that’s pretty good, right? Well, let’s try a longer audio file and see what happens. I’ve provided dutch_long_repeat_file.mp3 which is just the same audio file but it repeats 3 times, totaling just over 40 seconds. Let’s see what happens when we try to transcribe this file (make sure you comment out the run above):

# result = model.transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True)
# print(result["text"])


result = model.transcribe(
    str(AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    verbose=True,
    language="nl",
    task="transcribe",
)
print(result["text"])

Note we can pass in the language if we already know it, so we can skip the detection step and save some time there. So for applications where you always know the language ahead of time just pass it in to optimize your application. We pass in nl as it is the ISO-639-1 code for the Netherlands.

Now let’s run this and check the output (yours will look different from mine):

Hoi j allemaal! Dit is weer een testbestandje! Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. Je en bırak�� collecte geval. Je gievous raakt deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd! Hoi jlynn allemaal! Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. Je en driesbredmontie kunt wiring die text er metυτ�� mesma halen te laten vertalen naar het Engels om te zien hoe goed dat gaat! Ik ben benieuwd. Hoi allemaal! Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.

Now I’m not going to make you read this, but as a Dutch person, I will tell you this output is terrible and there are several characters and many words here that do not even exist in the Dutch language! So what happened? It’s the same model and the audio file is exactly the same as before, it’s just a bit longer and repeats itself. We should have gotten the same output right?

Well, it is because Whisper’s machine-learning model is limited to audio segments of only 30 seconds as its input. Because of this, it is more challenging for it to transcribe longer audio files. The .transcribe function took care of cutting the audio into 30-second segments for us and feeding them through and sort of stitching them back together, making our life a lot easier, so we didn’t really notice this extra challenge.

While whisper does use some clever tricks to improve the quality for transcribing longer audio files that need to be cut into 30-second pieces and put back together again this is inherently just a bit trickier so we saw a significant drop in transcription quality even though the audio we were transcribing was the exact same as before (just repeated 3 times in a row to make it longer).

Does this mean Whisper is only good for small files? Not at all! All we need to solve this bigger challenge of a minor language (Dutch) combined with files longer than 30 seconds is to just step up to a bigger model!

When changing the model to small instead of base:

model = whisper.load_model("small")

I got an almost perfect output with only a single very minor spelling mistake. When I changed to medium afterward it was absolutely perfect. It’s just a matter of using a bigger model until it works. Pick the model size that corresponds to the size of your challenge.

Translating

Besides just transcribing, as if that wasn’t awesome enough, Whisper can also translate pretty much all major languages to English. (If you get very hacky it can even translate English to other languages, but that is not an intended or supported feature).

So now let’s give it an audio file in a non-English language and then ask it for an English translation. We’ll feed it the dutch_the_netherlands.mp3 file again, but this time ask it for a translation (to English) so you can finally find out what I said in the audio!

result = model.transcribe(
    str(AUDIO_DIR / "dutch_the_netherlands.mp3"),
    verbose=True,
    language="nl",
    task="translate",
)
print(result["text"])

Make sure you comment out any calls above so you don’t run them by accident. I’ve already tested this out and you’ll need to load around the medium model size to get a good translation, so make sure you load that BEFORE the call above (if your computer can handle it, otherwise just try a smaller one).

model = whisper.load_model("medium")

The output is:

Hey everyone, this is a test file again. This time to test whether the Dutch language will be recognized well. After this, we can also try to translate this text into English to see how well that goes. I'm curious.

It’s really quite a decent translation, straight from spoken text. That is very impressive. For sloppy pronunciation it still works quite well – I tested this using my Korean pronunciation which is not great and the results were still pretty good.

So the different languages, longer files or perhaps slightly less native pronunciation will benefit a lot from going to larger versions of the model (as long as you have the VRAM for it). I’ll be sticking with the lower end of the spectrum models for this series as much as possible, as not everyone will have the GPU to run the larger models, but feel free to use a larger model if you have the VRAM for it.

On the flip side, if you can only run the small or even the base models, do not despair! The next two tutorials will actually do very well for accuracy running on these smaller models, and again, in the last part, we’ll look at speeding up, optimizing, or outsourcing the processing altogether.

Now that we’ve got the more boring basics out of the way, it’s time to build some cool and fun stuff and look at practical applications and integration in the next couple of parts! See you there!

Full Course: OpenAI Whisper – Building Cutting-Edge Python Apps with OpenAI Whisper

[Academy] Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper

The post OpenAI Whisper – Python Installation, Setup, & First Steps to Speech-to-Text Synthesis appeared first on Be on the Right Side of Change.

OpenAI Text-to-Speech (TTS) – I Tried the Top Ten Languages with 10 Amazing Speech Samples

Chris — Tue, 14 Nov 2023 10:18:00 +0000

Let’s check out OpenAI’s fantastic Text-to-Speech (TTS) technology. I was blown away when I first heard these voices; they sound so incredibly human, it’s almost hard to believe!

It’s like having a friendly chat in different languages, all thanks to OpenAI’s amazing speech-generation skills in the world’s top ten languages.

I used the following code:

import openai

your_openai_key = 'sk-...'
d = {
    'English': 'Finxter helps you stay on the right side of change!',
    'Mandarin Chinese (Simplified)': '...',
    'Hindi': '...',
    'Spanish': '¡Finxter te ayuda a mantenerte del lado correcto del cambio!',
    'French': 'Finxter vous aide à rester du bon côté du changement !',
    'Arabic': '...',
    'Bengali': '...',
    'Russian': 'Финкстер помогает вам оставаться на правильной стороне изменений!',
    'Portuguese': 'Finxter ajuda você a permanecer no lado certo da mudança!',
    'Indonesian': 'Finxter membantu Anda tetap di sisi yang benar dari perubahan!'
}


client = openai.OpenAI(api_key=your_openai_key)
voices = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']

for language in d:
    response = client.audio.speech.create(
        model="tts-1",
        voice='onyx',
        input=d[language]
    )

    response.stream_to_file(f'{language}.mp3')

This code snippet uses OpenAI’s Text-to-Speech (TTS) capabilities through the OpenAI Python library. It begins by importing the OpenAI module and setting up an API key. You should have installed OpenAI:

Recommended: How to Install OpenAI in Python?

A dictionary d is defined, containing sentences in various languages, each associated with a language key. I used the world’s 10 most spoken languages but for formatting reasons skipped some translations — my blog software cannot display the Unicode symbols.

The code then initializes an OpenAI client with the specified API key. It iterates over the languages in the dictionary d, using the client.audio.speech.create function to convert the text in each language to speech.

The chosen model for TTS is "tts-1" and the voice is set to ‘onyx’ for all languages.

The audio output for each language is then saved as an MP3 file named after the respective language. Here are the language samples — look at how amazing these sound:

English: Finxter helps you stay on the right side of change!

Mandarin Chinese (Simplified): Finxter 帮助你保持在变化的正确一边！

Hindi: फिंक्सटर आपको परिवर्तन के सही पक्ष में बने रहने में मदद करता है!

Spanish: ¡Finxter te ayuda a mantenerte del lado correcto del cambio!

French: Finxter vous aide à rester du bon côté du changement !

Arabic: فينكستر يساعدك على البقاء على الجانب الصحيح من التغيير!

Bengali: ফিন্ক্সটার আপনাকে পরিবর্তনের সঠিক দিকে থাকতে সাহায্য করে!

Russian: Финкстер помогает вам оставаться на правильной стороне изменений!

Portuguese: Finxter ajuda você a permanecer no lado certo da mudança!

Indonesian: Finxter membantu Anda tetap di sisi yang benar dari perubahan!

Bonus – German: Finxter hilft dir, auf der richtigen Seite der Veränderung zu bleiben.

Thanks for being an avid Finxter reader! Check out this article next:

Feel free to check out our academy courses to keep mastering prompt engineering, e.g., with Llama 2:

Prompt Engineering with Llama 2

The Llama 2 Prompt Engineering course helps you stay on the right side of change. Our course is meticulously designed to provide you with hands-on experience through genuine projects.

You’ll delve into practical applications such as book PDF querying, payroll auditing, and hotel review analytics. These aren’t just theoretical exercises; they’re real-world challenges that businesses face daily.

By studying these projects, you’ll gain a deeper comprehension of how to harness the power of Llama 2 using Python, Langchain, Pinecone, and a whole stack of highly practical tools of exponential coders in a post-ChatGPT world.

The post OpenAI Text-to-Speech (TTS) – I Tried the Top Ten Languages with 10 Amazing Speech Samples appeared first on Be on the Right Side of Change.

OpenAI Text to Speech (TTS): Minimal Example in Python

Chris — Wed, 08 Nov 2023 15:30:44 +0000

To use OpenAI’s amazing Text-to-Speech (TTS) functionality, first install the openai Python library and obtain an API key from OpenAI.

Instantiate an OpenAI client with openai.OpenAI(api_key).

Call client.audio.speech.create(model='tts_1', voice='alloy', input=your_text) to use the 'alloy' voice model.

This generates speech you can save as an MP3 file using response.stream_to_file('your_file.mp3').

First, install the OpenAI library and set up your OpenAI key.

pip install openai # Python 2 or 3
pip3 install openai # Python 3 
!pip install openai # Google Colab

Recommended: How to Install OpenAI in Python?

Second, copy and paste the following code in your Pythons script or notebook, replacing the OpenAI API key with your own.

import openai

your_openai_key = 'sk-...'
your_text = 'Finxter helps you stay on the right side of change!'

client = openai.OpenAI(api_key=your_openai_key)

response = client.audio.speech.create(
  model="tts-1",
  voice="alloy", # other voices: 'echo', 'fable', 'onyx', 'nova', 'shimmer'
  input=your_text
)

response.stream_to_file('speech.mp3')

You can now find the file 'speech.mp3' in the same folder where you ran your Python script. Easy as that!

Now check out the amazing voice that sounds like a genuine human being, doesn’t it?

At the time of writing, you can use the following voices:

voices = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']

Here are the six different voices in that order:

Alloy (male):

Echo (male):

Fable (female?):

Onyx (female):

Nova (deep male):

Shimmer (female):

Staying tuned in these rapidly changing times is crucial. Feel free to join our free email newsletter by downloading our Python and OpenAI cheat sheets:

Also, you can take our prompt engineering courses for premium success:

Prompt Engineering with Llama 2

The Llama 2 Prompt Engineering course helps you stay on the right side of change. Our course is meticulously designed to provide you with hands-on experience through genuine projects.

The post OpenAI Text to Speech (TTS): Minimal Example in Python appeared first on Be on the Right Side of Change.