<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Speech Recognition and Generation Archives - Be on the Right Side of Change</title>
	<atom:link href="https://blog.finxter.com/category/speech-recognition-and-generation/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.finxter.com/category/speech-recognition-and-generation/</link>
	<description></description>
	<lastBuildDate>Thu, 25 Jan 2024 19:57:22 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.finxter.com/wp-content/uploads/2020/08/cropped-cropped-finxter_nobackground-32x32.png</url>
	<title>Speech Recognition and Generation Archives - Be on the Right Side of Change</title>
	<link>https://blog.finxter.com/category/speech-recognition-and-generation/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>OpenAI Whisper &#8211; Speeding Up or Outsourcing the Processing</title>
		<link>https://blog.finxter.com/openai-whisper-speeding-up-or-outsourcing-the-processing/</link>
		
		<dc:creator><![CDATA[Dirk van Meerveld]]></dc:creator>
		<pubDate>Thu, 25 Jan 2024 19:57:21 +0000</pubDate>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Large Language Model (LLM)]]></category>
		<category><![CDATA[OpenAI]]></category>
		<category><![CDATA[Prompt Engineering]]></category>
		<category><![CDATA[Speech Recognition and Generation]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1654500</guid>

					<description><![CDATA[<p>🎙️ Course: This article is based on a lesson from our Finxter Academy Course Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it! Hi and welcome back! In this part, we&#8217;re going to look at some ... <a title="OpenAI Whisper &#8211; Speeding Up or Outsourcing the Processing" class="read-more" href="https://blog.finxter.com/openai-whisper-speeding-up-or-outsourcing-the-processing/" aria-label="Read more about OpenAI Whisper &#8211; Speeding Up or Outsourcing the Processing">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/openai-whisper-speeding-up-or-outsourcing-the-processing/">OpenAI Whisper &#8211; Speeding Up or Outsourcing the Processing</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f399.png" alt="🎙" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Course</strong>: This article is based on a lesson from our <strong>Finxter Academy Course</strong> <a href="https://academy.finxter.com/university/openai-whisper/"><em>Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper</em></a>. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it!</p>



<p>Hi and welcome back! In this part, we&#8217;re going to look at some alternatives to speed stuff up or outsource the processing power to OpenAI&#8217;s servers altogether. First, we&#8217;ll look at <code>faster-whisper</code> at a basic level. If you&#8217;re not sure whether you want to use this you can also just watch this part and decide whether or not you want to install it for further use later as we&#8217;re just going to cover it quickly before moving on to the web API version for the rest of this part.</p>



<p>So what is <code>faster-whisper</code>? Faster-Whisper is a quicker version of OpenAI&#8217;s Whisper speech-to-text model. As OpenAI released the <code>whisper</code> model as open-source this has naturally allowed others to try to build on and optimize it further. It uses CTranslate2, a fast engine for Transformer models, and is up to 4 times faster and uses considerably less memory than the original openai/whisper while claiming to maintain the same accuracy. You can find the GitHub repository <a href="https://github.com/SYSTRAN/faster-whisper">here</a>.</p>



<p>You can use this for the same apps we have built so far, just as a faster version of the Whisper model, so we won&#8217;t be building a new app specifically for this, as it would get repetitive and I don&#8217;t want to waste your time! You just need some syntax changes to make your app work with faster-whisper instead of the original whisper model. So we&#8217;ll take a look at the basics of fast-whisper, let you decide if you want to use/implement it, and then move on to the web-API version.</p>



<h2 class="wp-block-heading">Installing faster-whisper</h2>



<p>Note: If you do not plan on using faster-whisper or are not quite sure, there is no point in going through the install procedures, and you can skip ahead a couple of minutes to the web-API version, or just watch/read along and decide later if you want to use it.</p>



<p>Basically, to install faster-whisper you just have to run the following command in your terminal:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install faster-whisper</pre>



<p>And to support GPU execution you need to have the appropriate libraries for CUDA installed, which are <a href="https://developer.nvidia.com/cublas">cuBLAS</a> and <a href="https://developer.nvidia.com/cudnn">cuDNN</a>. This can be the slightly trickier part of the install, and again I cannot really give you platform-specific instructions or help you with the specific troubleshooting if you run into challenges. As always in software development, if you&#8217;re lucky you won&#8217;t have any problems, and if you&#8217;re not, you spend some time on Google and Stackoverflow to find the solution. If you just want to run faster-whisper on your CPU, which will of course be slower but may not be a big deal for small-scale development on your own machine, you can skip the <code>cuBLAS</code> and <code>cuDNN</code> installs.</p>



<h2 class="wp-block-heading">Using faster-whisper</h2>



<p>So let&#8217;s give it a spin to see how it works! First create a new file in your project root directory called <code>4_faster_whisper.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />3_subtitle_master.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />4_faster_whisper.py   (<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" />new file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>And inside let&#8217;s start with our imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from faster_whisper import WhisperModel
from settings import TEST_AUDIO_DIR

model_size = "small"</pre>



<p>We import the <code>WhisperModel</code> class from the <code>faster_whisper</code> package, and the <code>TEST_AUDIO_DIR</code> variable from our <code>settings.py</code> file, and then set a string variable to the value <code>small</code>. Like whisper, faster-whisper also comes with different sizes of models. Using the same naming convention we have <code>tiny.en</code>, <code>base.en</code>, <code>small.en</code>, and <code>medium.en</code> as our English-only models. For the multi-language models, we can choose between <code>tiny</code>, <code>base</code>, <code>small</code>, <code>medium</code>, or one of several versions of the full-size model, namely: <code>large-v1</code>, <code>large-v2</code>, <code>large-v3</code>, or <code>large</code>.</p>



<p>Next, we&#8217;ll create a new instance of the <code>WhisperModel</code> class, picking only one of the two options below:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">model = WhisperModel(model_size, device="cpu", compute_type="int8")
# Choose only one of these, depending on if you're running on CPU or GPU (cuda). (I'll be using the second option)
model = WhisperModel(model_size, device="cuda", compute_type="float16")</pre>



<p>More options are available, like running on <code>cuda</code> using <code>int8_float16</code> or even using <code>float32</code>, see <a href="https://opennmt.net/CTranslate2/quantization.html">here</a> for more details.</p>



<p>The <code>.transcribe</code> method for faster-whisper is slightly different:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">segments, info = model.transcribe(
    str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    beam_size=5,
)</pre>



<p>As you can see we get two returns when calling <code>model.transcribe</code> instead of the single dictionary output we had before. The first is a list of <code>segments</code> which contains the transcription. The second is a <code>NamedTuple</code> (a <code>Tuple</code> with named fields) which allows us to access information like the language (<code>info.language</code>), language probability (<code>info.language_probability</code>), etc. So let&#8217;s add some print statements to print the information and then the transcription itself to the console:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">print(f"Detected language '{info.language}' with probability {info.language_probability}")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")</pre>



<p>The first print statement just has us access some of the properties of the <code>info</code> object we discussed. The second print statement loops over the list of <code>segments</code>, and for each <code>segment</code> it will print the segment&#8217;s start time, end time, and the text of the segment itself. The <code>:.2f</code> is a formatting string that tells Python to print the number with two decimal places, for example: <code>1.23</code> instead of <code>1.23456789</code>.</p>



<p>One interesting thing to note here though is that <code>segments</code> is not actually a list. Segments is a generator, which is a different type of iterable. What this means is that the segments will be generated when you request them and not beforehand. In other words, the transcription only begins when we iterate over the <code>segments</code> and not before. Calling <code>.transcribe()</code> on our model did not start the transcription as vanilla whisper did. You can either loop over the <code>segments</code> as we did above, or you can convert the generator to a list by converting it to a list <code>list(segments)</code>.</p>



<p>One of the nice things about this generator is that we can very easily see the live transcription and print it to the console while it is still generating, which is exactly what this code will do. So let&#8217;s run it and see what happens:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Estimating duration from bitrate, this may be inaccurate
Detected language 'nl' with probability 0.931703
[0.00s -> 3.04s]  Hoi allemaal, dit is weer een testbestandje.
[3.04s -> 6.88s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[6.88s -> 12.68s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[12.68s -> 13.88s]  Ik ben benieuwd.
[13.88s -> 16.84s]  Hoi allemaal, dit is weer een testbestandje.
[16.84s -> 20.72s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[20.72s -> 26.48s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[26.48s -> 27.68s]  Ik ben benieuwd.
[27.68s -> 30.72s]  Hoi allemaal, dit is weer een testbestandje.
[30.72s -> 34.60s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[34.60s -> 40.36s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[40.36s -> 41.52s]  Ik ben benieuwd.</pre>



<p>You can see the output streaming to the console as the model transcribes. Unless you run over CPU you will also notice a pretty good speed. Now as you&#8217;re probably not Dutch I&#8217;ll just tell you the transcription above is perfect except for the one small (<code>herkent/herkend</code>) issue we had before, but as you know this can be fixed by loading a larger model size.</p>



<p>Play around with any audio file you want and see what model size you need. If you use English files pick a <code>.en</code> model for greater efficiency. Also be aware that you can pass in options into the <code>.transcribe</code> method much like the vanilla whisper model, for instance:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">segments, info = model.transcribe(
    str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    beam_size=5,
    word_timestamps=True,  # uncomment this line to get word timestamps
    # without_timestamps=True,  # uncomment this line to get rid of timestamps and just transcribe
)</pre>



<p>In conclusion, faster-whisper is a nice optimization to look into if you&#8217;re considering deploying this model in a production application somewhere. There are also other optimized versions of the whisper model out there that you can check out, like <a href="https://github.com/huggingface/distil-whisper">distil-whisper</a>. Play around and see which gives you the best trade-offs between speed and accuracy. I&#8217;ll leave the rest up to you as we move on from faster-whisper to check out the web-API version.</p>



<h2 class="wp-block-heading">Web-API version</h2>



<p>Another option we have is to simply not deploy the model anywhere but outsource this to OpenAI&#8217;s fast servers. This is kind of like making a ChatGPT call except we request a transcription instead of a chat completion. The OpenAI servers are also very optimized for machine-learning calculations (obviously) and as you&#8217;ll see they are therefore quite fast!</p>



<p>So let&#8217;s take a look at the pricing first. The cost for using the Whisper API is $0.006 per minute transcribed, rounded to the nearest second. This means a 20-minute video would cost you $0.12. This is a good solution if you don&#8217;t want to deploy the model yourself, perhaps your application will only be used occasionally and it&#8217;s simply not worth it to invest that much into having a model running somewhere. For a high-use application dealing with longer files and many users, this is not the way to go though.</p>



<p>So let&#8217;s take a quick look at how this would work practically, by building one last quick application, but this time using the web API. Our application will take any video in any language as input and will return a short quiz with questions about the video. First, create a new file in your <code>utils</code> folder named <code>openai_api.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />command.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />openai_api.py   (<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" />new file)
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />video.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />3_subtitle_master.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />4_faster_whisper.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>Inside <code>openai_api.py</code>, let&#8217;s start with our imports and some basic setup:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import typing
from pathlib import Path

from decouple import config
from openai import OpenAI


CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
MODEL = "whisper-1"

ResponseFormat = typing.Literal["text", "srt", "vtt"]</pre>



<p>We&#8217;ll use <code>typing</code> to define our allowed response formats. The rest is all imports we have used before, <code>config</code> as we&#8217;ll need to load our API key and <code>OpenAI</code> to call the APIs for Whisper and ChatGPT. We create our <code>CLIENT</code> just like last time and we save the <code>MODEL</code> in a string variable, <code>whisper-1</code> is the only option for the Whisper API for now.</p>



<p>Finally, we define a type alias named <code>ResponseFormat</code> which is a <code>Literal</code> type, which means it can only be one of the three strings we have defined, <code>text</code>, <code>srt</code>, or <code>vtt</code>. We can use this as a type hint later to indicate that if a particular variable is of type <code>ResponseFormat</code> then it should have one of these three values and nothing else. (<code>json</code> and <code>verbose_json</code> are also possible if you prefer JSON object output, but we will be skipping them as they are useless for our purposes.)</p>



<p>Now we&#8217;ll define our transcription utility function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def transcribe(
    file: Path,
    language: str | None = None,
    translate: bool = False,
    response_format: ResponseFormat = "text",
) -> str:

    print("Transcribing file...")
    options = {
        "file": file,
        "model": MODEL,
        "response_format": response_format,
    }

    if translate:
        transcript = CLIENT.audio.translations.create(**options)
    else:
        if language:
            options["language"] = language
        transcript = CLIENT.audio.transcriptions.create(**options)

    if type(transcript) != str:
        raise TypeError(
            f"Expected a string value to be returned, but got {type(transcript)} instead."
        )
    print(f"Transcription successful:\n{transcript[:100]}...")

    return transcript</pre>



<p>We define a function called <code>transcribe</code> which takes a <code>file</code> of type <code>Path</code>, a <code>language</code> of type <code>str</code> or <code>None</code>, which defaults to <code>None</code>, in which case the API will try to detect the language automatically. We also have a <code>translate</code> boolean which defaults to <code>False</code>, and a <code>response_format</code> which has to be of type <code>ResponseFormat</code>, so one of the three values we defined in the type alias, and defaults to <code>text</code>. The function returns a string.</p>



<p>We print a message to indicate the transcription is starting and then create a dictionary named options in which we pass in some options that are needed for both a translation and a transcript call, so the shared options if you will. These are the <code>file</code>, <code>model</code>, and <code>response_format</code>. If the user requests a translation we call the <code>CLIENT.audio.translations.create</code> method, passing in the <code>**options</code> dictionary as arguments as is. If <code>translation</code> = <code>False</code> it must be a transcription. For transcriptions, we can add the <code>language</code> key to the options dictionary to specify the language, but if the user didn&#8217;t provide it we can leave it out and it will just take a bit longer to do the auto-detection. This time we call the <code>CLIENT.audio.transcriptions.create</code> method, again passing in the <code>**options</code> dictionary which optionally now contains the <code>language</code> key.</p>



<p>Finally, we check if the <code>transcript</code> is a string, and if not we raise a <code>TypeError</code> to indicate something went wrong, just to make sure the user is not requesting JSON from this endpoint, which is possible and would crash the rest of our code. Otherwise, we print a message to indicate the transcription was successful and return the <code>transcript</code>.</p>



<h2 class="wp-block-heading">Video to Quiz</h2>



<p>As we&#8217;re going to be building a video-to-quiz app, we need one more utility function inside this <code>openai_api.py</code> file, which will take a transcript and generate some questions for us. Continue below the <code>transcribe</code> function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">PROMPT_SETUP = """You are a text-to-quiz app. The user will provide you a video transcription in textual format. You will generate a list of questions for the user to answer about this video. Depending on the length of the transcription, stick to a maximum of 5 questions. All questions should be solely about the video transcription content provided by the user and should be answerable by reading the transcription. Do not provide the answers, but only the questions. The transcription the user provides is based on a video, and may include timestamps, please ignore these timestamps and just treat it as one single transcription containing all the content in the video.
List and number each item on a separate line.
"""

from tenacity import retry, stop_after_attempt, stop_after_delay</pre>



<p>First, we define a constant to hold the prompt setup instruction for ChatGPT. Just go ahead and copy mine. It&#8217;s a fairly basic setup that asks for questions related to the video so we can make a quiz tailor-made for the input video. We also import <code>retry</code>, <code>stop_after_attempt</code>, and <code>stop_after_delay</code> from the <code>tenacity</code> package. (Go ahead and move the tenacity imports line to the top of your file with the other imports instead of here in the middle.) We can use these to make our code a bit more robust when calling APIs or taking actions that do not have a 100% success rate. It&#8217;s fairly easy to use and I just want to show you that this tool is out there, you&#8217;ll see how it works in a second.</p>



<p>Let&#8217;s code up the function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def text_to_quiz(text: str) -> str:
    print("Converting text to quiz...")
    messages = [
        {"role": "system", "content": PROMPT_SETUP},
        {"role": "user", "content": text},
    ]
    result = CLIENT.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
    )
    content = result.choices[0].message.content
    if content == None:  # Just a quick sanity check
        raise ValueError("There was an error while trying to generate the quiz.")
    print(f"Text to quiz conversion completed.")
    return content</pre>



<p>Our function takes a string which is the transcription and returns a string as output. We create a list of messages with the first being the system message, holding our <code>PROMPT_SETUP</code>, and the second being the user message which has the transcription as its content. We then call the <code>CLIENT.chat.completions.create</code> method, passing in the <code>model</code> and <code>messages</code> as arguments. We&#8217;ll use <code>gpt-3.5-turbo-1106</code> which is the newest gpt-3.5 model out there and is frankly good enough. You can use gpt-4 but make sure you consider the cost, it is considerably more expensive and not really needed for this use case. If you&#8217;re worried about the lower maximum input size, or &#8216;context window&#8217; of gpt-3.5, know that it has a 16k context limit that can easily handle long video transcriptions, though most are not really as long as you might think they are.</p>



<p>We then access the <code>content</code> of the first choice&#8217;s message in the <code>result</code> object, which should hold our quiz. We do a quick sanity check to make sure we received a valid response, and then print a message to indicate the conversion was successful and return the <code>content</code>.</p>



<p>So that&#8217;s pretty simple, right? But what if we get no content back? Do we really want to just raise an error and give up immediately? Let&#8217;s use the tenacity library so we can try again in case of a failure. The only single thing we have to change is to add the <code>@retry</code> decorator before our function, the only thing that changes is the first line:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">@retry(stop=stop_after_attempt(3) | stop_after_delay(60))
def text_to_quiz(text: str) -> str:
    print("Converting text to quiz...")
    messages = [
        {"role": "system", "content": PROMPT_SETUP},
        {"role": "user", "content": text},
    ]
    result = CLIENT.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
    )
    content = result.choices[0].message.content
    if content == None:  # Just a quick sanity check
        raise ValueError("There was an error while trying to generate the quiz.")
    print(f"Text to quiz conversion completed.")
    return content</pre>



<p>And just like that, our function is set up to try up to three times or (<code>|</code>) for a max of 60 seconds, just in case the API call fails for some reason. Notice how easy it is to use the Tenacity library. This is not required but it&#8217;s a nice way to make your code more robust just in case.</p>



<h2 class="wp-block-heading">Putting it all together</h2>



<p>That&#8217;s our <code>openai_api.py</code> file done! Go ahead and save and close it. Now let&#8217;s create a new file in our project root directory called <code>4_vid_to_quiz.py</code> to put it all together:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />command.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />openai_api.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />video.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />3_subtitle_master.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />4_faster_whisper.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />4_vid_to_quiz.py   (<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" />new file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>Inside <code>4_vid_to_quiz.py</code> let&#8217;s start with our imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import os
import uuid
from pathlib import Path

import gradio as gr

from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR
from utils import openai_api, video


API_UPLOAD_LIMIT_BYTES = 26214400  # 25mb</pre>



<p>We will use <code>os</code> to check the size of the file we will upload, as there is a size limit to the API. We have some imports you&#8217;ve seen before, and some of our directories from the <code>settings</code> file plus our <code>openai_api</code> and <code>video</code> utilities. We also define a constant <code>API_UPLOAD_LIMIT_BYTES</code> which is the maximum size of the file we can upload to the API, which is 25 MB.</p>



<p>Let&#8217;s start with a quick function to check if the file is not too big:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def check_upload_size(input_file: str) -> None:
    """Check the video file size is within the API upload limit."""
    input_file_size = os.path.getsize(input_file)
    if input_file_size > API_UPLOAD_LIMIT_BYTES:
        raise ValueError(
            f"File size of {input_file_size} bytes ({input_file_size / 1024 / 1024:.2f} MB) exceeds the API upload limit of {API_UPLOAD_LIMIT_BYTES} bytes ({API_UPLOAD_LIMIT_BYTES / 1024 / 1024:.2f} MB). Please use a shorter video or lower the audio quality settings."
        )</pre>



<p>We take an input file path as a string and then use <code>os.path.getsize</code> to get the size of the file in bytes, and then check if it is larger than our <code>API_UPLOAD_LIMIT_BYTES</code>. If it is, we raise a <code>ValueError</code> to indicate the file is too large. We also print a message to indicate the file size and the API upload limit. That&#8217;s all there is to this function.</p>



<p>Let&#8217;s move on to our <code>main</code> function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def main(input_video: str) -> str:
    """Takes a video file as string path and returns a quiz as string."""
    unique_id = uuid.uuid4()

    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=OUTPUT_TEMP_DIR / f"{unique_id}.mp3",
        mono=True,
    )

    check_upload_size(mp3_file)
    transcription = openai_api.transcribe(
        Path(mp3_file), language="en", translate=False, response_format="text"
    )

    quiz = openai_api.text_to_quiz(transcription)
    return quiz</pre>



<p>This is the function the gradio button will call when clicked. It takes an input_video as string input and will return the quiz in string format. We don&#8217;t really care about the name of the mp3 file we&#8217;ll extract from the video here so we just use a <code>uuid</code> to make it unique. Now we use our <code>video.to_mp3</code> utility function from the previous part to extract the audio from the video.</p>



<p>We pass in the <code>input_video</code> as the video file, our project root directory as the <code>log_directory</code>, and our <code>output_path</code> is the <code>OUTPUT_TEMP_DIR</code> with the <code>uuid</code> and <code>.mp3</code> extension pasted on. Finally, this is the time to use the <code>mono</code> option we built into the <code>to_mp3</code> function but didn&#8217;t use last time. So far the size of our files has not been that important, but now that we have a web API it suddenly becomes relevant.</p>



<p>Whisper down-mixes audio to mono before processing anyway, and the API has an upload limit of roughly 25MB per transcription request. So we can save a lot of space by dropping the channels to 1, from stereo to mono audio, which allows us to make much longer requests as we can drastically lower the bitrate with only 1 audio channel.</p>



<p>Sending stereo audio would exceed the file limit after about 20 minutes of audio at 192kbps quality. We more than halved the quality to 80kbps which is still considered decent quality for mono mp3 files and allows us to transcribe way longer files. You can also try playing with the other audio quality settings or lower the bitrate even further to 64kbps for mono if you want to go even further.</p>



<p>After that, we run our <code>check_upload_size</code> check to make sure the file is not too large, and then we call our <code>openai_api.transcribe</code> function, passing in the <code>mp3_file</code> as the <code>file</code>, <code>language="en"</code> as the language, <code>translate=False</code> as we don&#8217;t want to translate, and <code>response_format="text"</code> as we want the transcription in text format. We then call our <code>openai_api.text_to_quiz</code> function, passing in the <code>transcription</code> as the <code>text</code> and returning the resulting <code>quiz</code>.</p>



<h2 class="wp-block-heading">Gradio Interface</h2>



<p>Finally, we&#8217;ll create our gradio interface:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">if __name__ == "__main__":
    block = gr.Blocks(
        css=str(STYLES_DIR / "vid2quiz.css"),
        theme=gr.themes.Soft(primary_hue=gr.themes.colors.yellow),
    )

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                &lt;div class="header">
                &lt;img src="https://i.imgur.com/oEtZKEh.png" referrerpolicy="no-referrer" class="header-img" />
                &lt;/div>
                """
            )
            with gr.Row():
                input_video = gr.Video(
                    label="Input Video", sources=["upload"], mirror_webcam=False
                )
                output_quiz_text = gr.Textbox(label="Quiz")
            with gr.Row():
                button_text = "<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4dd.png" alt="📝" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Make a quiz about this video! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4dd.png" alt="📝" class="wp-smiley" style="height: 1em; max-height: 1em;" />"
                btn = gr.Button(value=button_text, elem_classes=["button-row"])

            btn.click(main, inputs=[input_video], outputs=[output_quiz_text])

    block.launch(debug=True)</pre>



<p>All of this will be familiar by now, I just used a different CSS file we&#8217;ll have to create, and used a slightly different <code>primary_hue</code> for the team than last time. The &#8216;imgur&#8217; image link has changed as well to give you a new header logo and below that, we just take an input video and have an output <code>Textbox</code>. Our button has a CSS class of <code>button-row</code> again so we can style it and clicking the button runs the function with the input video and the output going to the output textbox.</p>



<p>Let&#8217;s add the CSS file to our <code>styles</code> folder:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitle_master.css
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />vid2quiz.css      (<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" />new file)
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />whisper_pods.css
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />command.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />openai_api.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />video.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />3_subtitle_master.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />4_faster_whisper.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />4_vid_to_quiz.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>And inside <code>vid2quiz.css</code> let&#8217;s add the following:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">.header {
  display: flex;
  justify-content: center;
  align-items: center;
  padding: 2em 8em;
}

.header-img {
  max-width: 50%;
}

.header,
.button-row {
  background-color: #0c1d36;
}</pre>



<p>We use <code>flex</code> to center the header image vertically and horizontally and apply the usual padding. We give the <code>header-img</code> class a <code>max-width</code> of 50% so it doesn&#8217;t take up the entire width of the screen. Finally, we give the <code>header</code> and <code>button-row</code> classes a background color of <code>#0c1d36</code> which is a dark blue color.</p>



<p>Ok, you know the drill, let&#8217;s run it and see what happens!</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" src="https://academy.finxter.com/wp-content/uploads/2024/01/4_vid2quiz_interface-1024x531.png" alt="" class="wp-image-4066"/></figure>
</div>


<p>Ok, looking good, so let&#8217;s upload a video and then request a quiz about it. I used a random video from YouTube, namely <a href="https://www.youtube.com/watch?v=fb-58KobeFU">Hot Dr Pepper from the 1960s</a>, just because it showed up when I opened the YouTube website. Let&#8217;s see how it does:</p>



<figure class="wp-block-image size-large"><img decoding="async" src="https://academy.finxter.com/wp-content/uploads/2024/01/4_vid2quiz_output-1024x587.png" alt="" class="wp-image-4065"/></figure>



<p>Perfect, exactly what we wanted, and this was all powered by the OpenAI API! You&#8217;ll also notice it was probably reasonably fast, considering it had to convert the whole video and then transcribe it and generate a quiz.</p>



<p>One important limitation of the app in this particular form is that it can handle videos up to about ~48 minutes in length (with the 80kbps mono settings), because of the upload limit. If you want to handle longer videos you could split them up and put the transcripts back together, but honestly, if you&#8217;re going to be handling files of that length you&#8217;re probably better off deploying the model yourself to save cost as it is calculated per minute of audio.</p>



<p>A fun idea is that you can also use the translation option in our <code>utils.get_transcription</code> function to have foreign language videos as input and then English questions about the foreign language video as output. This could be cool for a foreign language learning app or test.</p>



<p>So that&#8217;s it for the whisper course. I hope you enjoyed it and now have a good idea of how to use Whisper, what you can use it for, and the various deployment options. The next step is up to you and limited only by your imagination!</p>



<p>As always, it was an honor and a pleasure to take this journey together, and I hope to see you next time!</p>



<h2 class="wp-block-heading">Full Course: OpenAI Whisper &#8211; Building Cutting-Edge Python Apps with OpenAI Whisper</h2>



<p>Check out our full OpenAI Whisper course with video lessons, easy explanations, GitHub, and a downloadable PDF certificate to prove your speech processing skills to your employer and freelancing clients:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://academy.finxter.com/university/openai-whisper/"><img fetchpriority="high" decoding="async" width="908" height="257" src="https://blog.finxter.com/wp-content/uploads/2024/01/image-154.png" alt="" class="wp-image-1654506" srcset="https://blog.finxter.com/wp-content/uploads/2024/01/image-154.png 908w, https://blog.finxter.com/wp-content/uploads/2024/01/image-154-300x85.png 300w, https://blog.finxter.com/wp-content/uploads/2024/01/image-154-768x217.png 768w" sizes="(max-width: 908px) 100vw, 908px" /></a></figure>
</div>


<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f449.png" alt="👉" class="wp-smiley" style="height: 1em; max-height: 1em;" /> [<strong>Academy</strong>] <a href="https://academy.finxter.com/university/openai-whisper/" data-type="link" data-id="https://academy.finxter.com/university/openai-whisper/">Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper</a></p>
<p>The post <a href="https://blog.finxter.com/openai-whisper-speeding-up-or-outsourcing-the-processing/">OpenAI Whisper &#8211; Speeding Up or Outsourcing the Processing</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>OpenAI Whisper Example &#8211; Building a Subtitle Generator &#038; Embedder</title>
		<link>https://blog.finxter.com/openai-whisper-example-building-a-subtitle-generator-embedder/</link>
		
		<dc:creator><![CDATA[Dirk van Meerveld]]></dc:creator>
		<pubDate>Thu, 25 Jan 2024 19:57:05 +0000</pubDate>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Large Language Model (LLM)]]></category>
		<category><![CDATA[OpenAI]]></category>
		<category><![CDATA[Speech Recognition and Generation]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1654504</guid>

					<description><![CDATA[<p>🎙️ Course: This article is based on a lesson from our Finxter Academy Course Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it! Welcome back to part 3, where we&#8217;ll use Whisper to build another really ... <a title="OpenAI Whisper Example &#8211; Building a Subtitle Generator &#038; Embedder" class="read-more" href="https://blog.finxter.com/openai-whisper-example-building-a-subtitle-generator-embedder/" aria-label="Read more about OpenAI Whisper Example &#8211; Building a Subtitle Generator &#038; Embedder">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/openai-whisper-example-building-a-subtitle-generator-embedder/">OpenAI Whisper Example &#8211; Building a Subtitle Generator &#038; Embedder</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f399.png" alt="🎙" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Course</strong>: This article is based on a lesson from our <strong>Finxter Academy Course</strong> <a href="https://academy.finxter.com/university/openai-whisper/"><em>Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper</em></a>. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it!</p>



<p>Welcome back to part 3, where we&#8217;ll use Whisper to build another really cool app. In this part, we&#8217;ll look at how to work with video files. After all, many of the practical applications of speech recognition don&#8217;t come in convenient MP3 files, but rather in video files. We&#8217;ll be building a subtitle generator and embedder, which will take a video file as input, transcribe it, and then embed the subtitles into the video file itself, feeding the result back to the end user.</p>



<p>Before we can get started on the main code, we will need to write some utilities again, just like in the previous part. The utilities we&#8217;ll need this time are:</p>



<ul class="wp-block-list">
<li>Subtitles -&gt; We just can reuse the subtitle-to-disk utility from the previous part. (Done<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" />)</li>



<li>Video -&gt; We will need a way to convert a video file to an mp3 file so that we can feed it to Whisper.</li>



<li>Commands -&gt; We will need a way to run commands on the command line, as there are multiple ffmpeg commands we&#8217;ll need to run both for the video conversion and the subtitle embedding.</li>
</ul>



<p>So let&#8217;s get started with the command utility. Inside the <code>utils</code> folder, first create a new file named <code>command.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />command.py   (<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" />new file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>Then inside the <code>command.py</code> file let&#8217;s start with our imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import datetime
import subprocess
from pathlib import Path</pre>



<p>We&#8217;re going to run commands and provide some very basic logging as well. We imported the <code>datetime</code> module so we can add timestamps to our logs, and pathlib should be familiar by now. The <code>subprocess</code> module in Python is used to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. It allows you to execute system commands and interact with them programmatically. It&#8217;s basically a bit like opening a terminal window inside your Python code.</p>



<p>Next, we&#8217;ll start with an extremely simple function that will print a message but in blue letters:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def print_blue(message: str) -> None:
    print(f"\033[94m{message}\033[00m")</pre>



<p>The <code>\033[94m</code> and <code>\033[00m</code> are ANSI escape codes, which are used to add color and formatting to text in terminal output. The <code>94</code> is the code for blue, and the <code>00</code> is the code for reset. You can find a list of all the codes here: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors. We will print the commands we execute to the terminal in blue, which helps them stand out from the other white text output and makes it easier for us to check our commands.</p>



<h2 class="wp-block-heading">Running system commands</h2>



<p>Next, we&#8217;ll create a function that will run a command like you would run on the command line:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def run_and_log(command: str, log_directory: Path) -> None:
    print_blue(f"Running command: \n{command}")
    with open(log_directory / "commands_log.txt", "a+", encoding="utf-8") as file:
        subprocess.call(
            command,
            stdout=file,
            stderr=file,
        )
        file.write(
            f"\nRan command: {command}\nDate/time: {datetime.datetime.now()}\n\n\n\n"
        )</pre>



<p>We create a function called <code>run_and_log</code>, which takes two arguments: <code>command</code> which is a string, and <code>log_directory</code> which is a Path and indicates the directory where we want to save the log file. We then print the command we&#8217;re about to execute in blue, and then open the log file in append mode. The <code>a+</code> means that we will append to the file if it exists, and create it if it doesn&#8217;t. Again, we use the <code>encoding="utf-8"</code> argument to make sure that we can write non-ASCII characters to the file as well. If you do not do this you will eventually run into trouble.</p>



<p>Inside the <code>with open</code> context manager, so while the file is open, we call the <code>subprocess.call</code> function. This function takes a command as input and executes it, so as the first argument we pass the <code>command</code> variable. The second argument is <code>stdout=file</code>, which means that we will write the output of the command to the file (instead of the console). The third argument is <code>stderr=file</code>, which means that we will write any errors to the file as well. So we basically execute the command and whatever output there is gets logged inside the text file.</p>



<p>After that, we write what command we executed and a timestamp to the file, and use a couple of <code>\n</code> to add some newlines to the file so that the next command will be lower down, making them easy to distinguish from each other.</p>



<p>Now let&#8217;s run a quick test, using the extremely simple terminal command <code>echo 'hello'</code>, which will simply print <code>hello</code> to the console. Let&#8217;s run this command and see if our function works:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">run_and_log("echo 'hello'", Path.cwd())</pre>



<p>For the path we&#8217;ve used the <code>Path.cwd()</code> method in Python&#8217;s <code>pathlib</code> module which returns the current working directory as a <code>Path</code> object. This is the terminal&#8217;s current directory when you run the script. (This is just for a quick test, we don&#8217;t want to go through the trouble of importing the base directory in here)</p>



<p>Go ahead and run the <code>command.py</code> file, and whatever directory your terminal was in when you ran the script should now have a file named <code>commands_log.txt</code> with the following inside:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">hello

Ran command: echo 'hello'
Date/time: 2024-01-14 12:13:49.535692</pre>



<p>It worked! We&#8217;ve successfully logged the output of <code>hello</code> followed by our logging information of the time and command executed. Make sure you remove or comment out the <code>run_and_log</code> line before we continue, as we don&#8217;t want to run this command every time we run the script.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># run_and_log("echo 'hello'", Path.cwd())</pre>



<h2 class="wp-block-heading">A peculiar issue with slashes</h2>



<p>With our <code>run_and_log</code> function completed, we have just one more function to create in here. There is a small discrepancy between the file paths where ffmpeg will expect a different format for the system commands than our Python code will give us. So we need to write a short utility to fix the path. This issue only occurs with the subtitle path when trying to embed the subtitles using ffmpeg system commands, and I&#8217;m honestly not sure why it occurs, but this is the type of thing you will run into during your software development journey.</p>



<p>If you keep looking you&#8217;ll always find a solution, never despair, but I&#8217;ll save you this time and tell you about the issue ahead of time!</p>



<ul class="wp-block-list">
<li>The path <code>C:\Users\dirk\test/subtitle.vtt</code> will not work in the command and will give errors as it get&#8217;s messed up and then is unable to be parsed as a valid path.\</li>



<li>What we need is <code>C\:\\Users\\dirk\\test\\subtitle.vtt</code> instead. Notice there is an extra <code>\</code> after the <code>C</code> and after every <code>\</code> in the path. The first <code>\</code> is an escape character, which means that the second <code>\</code> is not interpreted as a special character but as a literal <code>\</code>.</li>



<li>This issue only affects the subtitle path and not the input or output video paths, so we only need to fix the subtitle path.</li>
</ul>



<p>Below the <code>run_and_log</code> function inside the <code>command.py</code> file, add a new function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def format_ffmpeg_filepath(path: Path) -> str:
    """Turns C:\Users\dirk\test/subtitle.vtt into C\:\\Users\\dirk\\test\\subtitle.vtt"""
    string_path = str(path)
    return string_path.replace("\\", "\\\\").replace("/", "\\\\").replace(":", "\\:")</pre>



<p>We take a <code>Path</code> as input, and then first convert it to a string so we can use string methods on it to fix the format. We then use the <code>replace</code> method to replace all the <code>\</code> with <code>\\</code> and all the <code>/</code> with <code>\\</code>. We also replace the <code>:</code> with <code>\:</code>. Now I see you looking mighty confused! Why so many slashes? Well, remember the first <code>\</code> is the escape character so that the second slash is interpreted not as an operator but as a literal slash string-character.</p>



<ul class="wp-block-list">
<li>So in order to replace <code>\</code> we need to target it using <code>\\</code>, as we need the escape character to indicate we want to target the <code>\</code> string-character and not use it as an operator, so a single <code>\</code> won&#8217;t work as it would be interpreted as the slash operator.</li>



<li>Likewise, to replace it with <code>\\</code> we need to use <code>\\\\</code> as each slash typed needs a slash to escape it, so that each second slash is interpreted as a literal slash string-character.</li>



<li>So the above function just means that <code>\</code> is replaced by <code>\\</code>, <code>/</code> is replaced by <code>\\</code>, and <code>:</code> is replaced by <code>\:</code>. It just looks so confusing because of all the extra escape characters which also happen to be slashes! Phew<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f92f.png" alt="🤯" class="wp-smiley" style="height: 1em; max-height: 1em;" />.</li>
</ul>



<h2 class="wp-block-heading">Video utility functions</h2>



<p>Okay so with that out of the way, go ahead and save and close the <code>command.py</code> file. It&#8217;s time for our <code>video</code> utility file next, so create a new file called <code>video.py</code> inside the utils folder:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
            <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py
            <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
            <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
            <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />command.py
            <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />video.py   (<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" />new file)
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>Don&#8217;t worry, this one won&#8217;t be so bad <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f600.png" alt="😀" class="wp-smiley" style="height: 1em; max-height: 1em;" />! Open up your new <code>video.py</code> file and let&#8217;s start with our imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from pathlib import Path
from . import command</pre>



<p>All we need is <code>Path</code> for input argument type-hinting and the <code>command</code> module we just created. Next, we&#8217;ll create a function that will convert a video file to an mp3 file so it can be fed to Whisper:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def to_mp3(
    input_video: str, log_directory: Path, output_path: Path, mono: bool = False
) -> str:
    output_path_string = str(output_path)

    channels = 1 if mono else 2
    bitrate = 80 if mono else 192

    command_to_run = f'ffmpeg -i "{input_video}" -vn -ar 44100 -ac {channels} -b:a {bitrate}k "{output_path_string}"'
    command.run_and_log(command_to_run, log_directory)
    print(f"Video converted to mp3 and saved to {output_path_string}")

    return output_path_string</pre>



<p>We define a function named <code>to_mp3</code> which takes an <code>input_video</code> as a string, a <code>log_directory</code> as a Path, an output_path as a Path, and a <code>mono</code> option as a boolean. The function returns a string in the end, which holds the output path. The <code>input_video</code> path is a string because gradio will feed it to us, which is why it is not a <code>Path</code> object like the <code>log_directory</code> and <code>output_path</code>. Make sure you always keep track of what type all the variables are or you will run into trouble eventually passing in a Path object where a string is expected, or vice versa.</p>



<p>First, we get a string version of the <code>output_path</code> and save it in <code>output_path_string</code>. Then we check if the <code>mono</code> option is set to <code>True</code> or <code>False</code>, and set the <code>channels</code> and <code>bitrate</code> variables accordingly. If <code>mono</code> is <code>True</code> we set <code>channels</code> to <code>1</code> and <code>bitrate</code> to <code>80</code>, and if <code>mono</code> is <code>False</code> we set <code>channels</code> to <code>2</code> and <code>bitrate</code> to <code>192</code>. We won&#8217;t actually need this mono option until part 4, but we might as well add it now.</p>



<p>Then we get to the command, first preparing it in a variable named <code>command_to_run</code>. We use the <code>ffmpeg</code> command and pass in the <code>input_video</code> as the input file (<code>-i</code>). We then use the <code>-vn</code> option to disable video recording, the <code>-ar</code> option to set the audio sampling frequency to 44100 Hz, the <code>-ac</code> option to set the number of audio channels to <code>channels</code>, and the <code>-b:a</code> option to set the audio bitrate to <code>bitrate</code> kbps. We then pass in the <code>output_path_string</code> as the output file location.</p>



<p>Notice that the command is contained inside an f-string which has single quotes on the outside (<code>f'command'</code>). Make sure you imitate this perfectly, using the single quotes on the outside and the double quotes around the variable names of <code>"{input_video}"</code> and <code>"{output_path_string}"</code>. We need these double quotes because the user input video file is likely to have spaces in the name, and not having double quotes around a name with spaces inside will cause the command to fail.</p>



<p>Then we call the <code>run_and_log</code> function from our <code>command</code> module, passing in the command and the directory we want to log to, printing a message to the console, and returning the output_path_string.</p>



<p>That completes our <code>video.py</code> file, go ahead and save and close it. We&#8217;re ready to start on the main code now!</p>



<h2 class="wp-block-heading">Subtitle Master &#8211; Putting it all together</h2>



<p>In your root folder, create a new file named <code>3_subtitle_master.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />command.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />video.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />3_subtitle_master.py   (<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" />new file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>Inside, let&#8217;s start with our imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import os
import uuid

import gradio as gr
import whisper
from whisper.utils import WriteVTT

from settings import BASE_DIR, OUTPUT_TEMP_DIR, OUTPUT_VIDEO_DIR, STYLES_DIR
from utils import command, subtitles, video</pre>



<p>We import <code>os</code> to do some filename splitting, and all the other imports are familiar from previous parts. To finish up we import several directories from our <code>settings</code> file and the <code>command</code>, <code>subtitles</code>, and <code>video</code> modules from our <code>utils</code> folder, reusing the <code>subtitles</code> module from the previous part.</p>



<p>Next up are our constants for the file:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">MODEL = whisper.load_model("base.en")
VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))</pre>



<p>We just load up a model, I&#8217;ll start with <code>base.en</code> as it will probably be good enough to get started. Then we instantiate a <code>WriteVTT</code> object like we did last time, indicating we want to save the subtitles in the temp directory.</p>



<p>As we are going to be returning a video to the end user this time, I would like to include the original video name in the output file, though we&#8217;ll still need a uuid as well to guarantee unique names (the user might upload the same file twice!). So let&#8217;s create a quick function that gets us a unique project name. Say the user inputs a file named <code>my_video.mp4</code>, we want the function to return <code>my_video_0f646333-0464-43a1-a75c-ed57c47fbcd5</code> so that we basically have a uuid with the filename in front of it. We can then add <code>.mp3</code> or <code>.srt</code> or whatever file extension we need at the end, making sure all the files for this project have the same but unique project name.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def get_unique_project_name(input_video: str) -> str:
    """Get a unique subtitle-master project name to avoid file-name clashes."""
    unique_id = uuid.uuid4()
    filename = os.path.basename(input_video)
    base_fname, _ = os.path.splitext(filename)
    return f"{base_fname}_{unique_id}"</pre>



<p>The function takes the input path as a string and then generates a <code>uuid</code>. We then get the filename using <code>os.path.basename</code>, which takes a path like <code>C:\Users\dirk\test\my_video.mp4</code> and returns <code>my_video.mp4</code>. We then use <code>os.path.splitext</code> to split the filename into a base filename and an extension, so <code>my_video.mp4</code> becomes <code>my_video</code> and <code>.mp4</code>. We catch the base name as <code>base_fname</code> and the extension under the variable name <code>_</code> as we don&#8217;t need it. We then return the base filename with the uuid appended to it.</p>



<p>Now let&#8217;s get started on our main function below that will tie it all together:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def main(input_video: str) -> str:
    """Takes a video file as string path and returns a video file with subtitles embedded as string path."""
    unique_project_name = get_unique_project_name(input_video)
    get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}"
    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=get_temp_output_path(".mp3"),
    )</pre>



<p>We&#8217;ll take an input video, which gradio will pass to our main function as a string path. The function will return a string path pointing towards the processed video file with embedded subtitles back to gradio. First, we get a unique project name using the function we just wrote. Then we create a simple lambda function like the one we had in part 2. It takes an extension like <code>.mp3</code> as input and returns <code>output_dir/project_name.mp3</code>, as we&#8217;ll need temporary directories for both our <code>.mp3</code> and our <code>.vtt</code> files, and this way we only have one place to change if we ever need to change the output directory.</p>



<p>Then we call the <code>to_mp3</code> function from our <code>video</code> module, passing in the input video, the project&#8217;s base directory as the log directory, and the output path as the <code>get_temp_output_path</code> lambda function with <code>.mp3</code> as the extension. We save the return of the function as the variable named <code>mp3_file</code>.</p>



<p>Continuing on:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def main(input_video: str) -> str:
    ...previous code...

    whisper_output = MODEL.transcribe(mp3_file, beam_size=5)
    vtt_subs = subtitles.write_to_file(
        whisper_output,
        writer=VTT_WRITER,
        output_path=get_temp_output_path(".vtt"),
    )</pre>



<p>We call the <code>transcribe</code> method on our <code>MODEL</code> object, which has an instance of Whisper, passing in the <code>mp3_file</code> as the input file, and setting the <code>beam_size</code> to <code>5</code>. We then call the <code>write_to_file</code> function from our <code>subtitles</code> module, passing in the <code>whisper_output</code> as the transcript, the <code>VTT_WRITER</code> as the writer, and the <code>get_temp_output_path</code> lambda function with <code>.vtt</code> as the extension as the output path.</p>



<p>So what is this <code>beam_size</code> parameter? Well, it&#8217;s one of a number of possible parameters we can pass into the <code>transcribe</code> method. The <code>beam_size</code> parameter is the number of beams to use in the beam search. The higher the number, the more accurate the transcription will be, but the slower it will be as well. The default is <code>5</code>, and I&#8217;ve found that this is a good balance between speed and accuracy. The only reason I&#8217;ve passed it in explicitly here is to make you aware of these parameters. It basically refers to the number of different potential paths that will be explored, from which the most likely one is chosen. Here are some of the other possible parameters:</p>



<ul class="wp-block-list">
<li><code>temperature</code> -&gt; The higher the temperature, the more likely it is that the model will choose a less likely character. You can think of it in a similar way as the <code>temperature</code> setting you get with ChatGPT calls. The default is <code>0</code> and will simply always return the most likely predictions only, <code>0</code> is what we have been using so far.</li>



<li><code>beam_size</code> -&gt; The number of beams to use in the beam search. We just discussed this one above. It is only applicable when the temperature is set to <code>0</code>, and its default value is <code>5</code>.</li>



<li><code>best_of</code> -&gt; Selects multiple random samples, only for use with a nonzero temperature and will generate more diverse (and possibly wrong) samples.</li>



<li><code>task</code> -&gt; Either <code>transcribe</code> or <code>translate</code>. We&#8217;ve used this one before and it defaults to <code>transcribe</code>.</li>



<li><code>language</code> -&gt; The language to use when <code>task</code> = <code>translation</code>. Defaults to <code>None</code> which will perform a language detection first.</li>



<li><code>device</code> -&gt; The device to use for inference. Defaults to <code>cuda</code> if you have a cuda enabled GPU, otherwise, it will default to <code>cpu</code>.</li>



<li><code>verbose</code> -&gt; Whether to print out the progress and debug messages, defaults to <code>True</code>.</li>
</ul>



<p>And there are more. For general use, you&#8217;ll probably do fine with the defaults most of the time, but be aware that you can tweak these parameters to get better results if you need to.</p>



<p>Back to our code, let&#8217;s continue:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def main(input_video: str) -> str:
    ...previous code...

    vtt_string_path = command.format_ffmpeg_filepath(vtt_subs)
    output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4"
    embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"'

    command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR)

    return str(output_video_path)</pre>



<p>We need to run another <code>ffmpeg</code> system command to embed the subtitles we have created into our video file. We first get the <code>vtt_string_path</code> by passing in the <code>vtt_subs</code> path we already have into that crazy function with all the <code>////</code> backslashes we called <code>format_ffmpeg_filepath</code>, remember? After that, we save our desired output video path in a variable by just combining our <code>OUTPUT_VIDEO_DIR</code> with the <code>unique_project_name</code> and pasting <code>_subs.mp4</code> at the end for good measure.</p>



<p>Now we prepare the <code>ffmpeg</code> command we&#8217;re about to run in a separate variable for readability. We use the <code>input_video</code> as the input file (<code>-i</code>), and then use the <code>-vf</code> option to add a video filter. The video filter we use is <code>subtitles</code> and we pass in the <code>vtt_string_path</code> as the subtitle file. We then pass in the <code>output_video_path</code> as the output file.</p>



<p>Notice again that the whole command is inside single brackets <code>'</code> inside of which we have path variables in double brackets <code>"</code> to avoid trouble if there are spaces in the filename. But as we have to pass in <code>"subtitles='{vtt_string_path}'"</code> which requires another level of brackets again, going back to the single brackets <code>'</code> would cause trouble as we have already used these to open the string variable at the start, so we have to escape them using the backslash <code>\'</code> instead.</p>



<p>Then we call the <code>run_and_log</code> function from our <code>command</code> module, passing in the command we just wrote, and the <code>BASE_DIR</code> as the log directory. We then return the <code>output_video_path</code> as a string, as gradio doesn&#8217;t want a Path object.</p>



<p>The whole <code>main</code> function now looks like this:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def main(input_video: str) -> str:
    """Takes a video file as string path and returns a video file with subtitles embedded as string path."""
    unique_project_name = get_unique_project_name(input_video)
    get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}"
    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=get_temp_output_path(".mp3"),
    )

    whisper_output = MODEL.transcribe(mp3_file, beam_size=5)
    vtt_subs = subtitles.write_to_file(
        whisper_output,
        writer=VTT_WRITER,
        output_path=get_temp_output_path(".vtt"),
    )

    vtt_string_path = command.format_ffmpeg_filepath(vtt_subs)
    output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4"
    embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"'

    command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR)

    return str(output_video_path)</pre>



<h2 class="wp-block-heading">Building the interface</h2>



<p>Now all we need to do to run this is create another gradio interface. As you are already familiar with gradio now we&#8217;ll go through this one a bit more quickly, the principles are the same as last time. Below your main function, continue with:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">if __name__ == "__main__":
    block = gr.Blocks(
        css=str(STYLES_DIR / "subtitle_master.css"),
        theme=gr.themes.Soft(primary_hue=gr.themes.colors.emerald),
    )

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                &lt;div class="header">
                &lt;img src="https://i.imgur.com/dxHMfCI.png" referrerpolicy="no-referrer" />
                &lt;/div>
                """
            )
            with gr.Row():
                input_video = gr.Video(
                    label="Input Video", sources=["upload"], mirror_webcam=False
                )
                output_video = gr.Video()
            with gr.Row():
                button_text = "<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f39e.png" alt="🎞" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Subtitle my video! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f39e.png" alt="🎞" class="wp-smiley" style="height: 1em; max-height: 1em;" />"
                btn = gr.Button(value=button_text, elem_classes=["button-row"])

            btn.click(main, inputs=[input_video], outputs=[output_video])

    block.launch(debug=True)</pre>



<p>We use the <code>if __name__ == "__main__":</code> guard to make sure that the code inside only runs when we run the file directly. We create the gradio <code>block</code> object just like we did before, passing in a <code>css</code> file that doesn&#8217;t exist yet, but this time we also pass in a <code>theme</code>. I&#8217;ll pass in the <code>gr.themes.Soft()</code> which has a bit of a different style to it, and set the accent color to emerald by passing in <code>primary_hue=gr.themes.colors.emerald</code> when calling <code>Soft()</code>. This will match nicely with the logo I have prepared for you with this application.</p>



<p>Then we open the <code>block</code> object using the with statement, and open up a new <code>Group</code> inside of it, just like we did before, so we can build our block interface. The HTML object is the same as in the last part, except I changed the image link URL to give you a new logo for this app. Then we open up a new <code>Row</code> and add a <code>Video</code> object for the input video, passing in <code>sources=["upload"]</code> so that the user can upload a video file, and setting <code>mirror_webcam=False</code> as we don&#8217;t want to take the user&#8217;s webcam as input. Still on the same <code>Row</code>, so next to the input video, we declare another <code>Video</code> object for the output video file.</p>



<p>We then have a row that only has a button for which we provide a text and a class of <code>button-row</code> so we can target it with CSS. The <code>btn.click</code> declaration is a lot simpler this time as we just call the <code>main</code> function with only a single input of <code>input_video</code> and only one output of <code>output_video</code>. Finally, we call <code>.launch</code> on the block just like last time.</p>



<p>That&#8217;s our code done! You&#8217;re probably dying to run it, but wait! We have to create a quick CSS file to finish it off. Create a new file named <code>subtitle_master.css</code> inside the <code>styles</code> folder:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitle_master.css   (<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" />new file)
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />whisper_pods.css
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />command.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />video.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />3_subtitle_master.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>Inside we&#8217;ll just write some quick CSS styles:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">.header {
  padding: 2em 8em;
}

.header,
.button-row {
  background-color: #1d366f7e;
}</pre>



<p>We just gave the <code>header</code> class some padding to stop the logo image from being too large and then gave both the <code>header</code> and <code>button-row</code> classes a background color of <code>#1d366f7e</code> which is a nice dark blue half-transparent color. Save and close the file, and we&#8217;re ready to run! Go ahead and run the <code>3_subtitle_master.py</code> file, and give it some time to load. Click the link in your terminal window again to open the interface in your browser, and you should see something like this:</p>



<figure class="wp-block-image size-large"><img decoding="async" src="https://academy.finxter.com/wp-content/uploads/2024/01/3_subtitle_master-1024x543.png" alt="" class="wp-image-4061"/></figure>



<p>Yours won&#8217;t have Korean in the input video box though, but whatever your computer&#8217;s language is set to. Go ahead and upload a video file, wait a second for it to load, and then press the <code>subtitle my video</code> button. This may take quite a while if you&#8217;re not on the fastest system with a powerful GPU, but you&#8217;ll see the commands and steps being executed in your terminal window just like we set up. Eventually, you&#8217;ll see the output video appear with the subtitles embedded, each one perfectly in time with the video, and you can play it back and download it!</p>



<figure class="wp-block-image size-large"><img decoding="async" src="https://academy.finxter.com/wp-content/uploads/2024/01/3_subtitle_output-1024x609.png" alt="" class="wp-image-4060"/></figure>



<p>You can check the <code>commands_log.txt</code> file in the root directory to see all the commands that were run, and you can check the <code>output_temp_files</code> folder to see the temporary files that were created during the process, and the <code>output_video</code> folder to see the final output video file. If you need some extra quality, set a higher model like <code>small.en</code> or <code>medium.en</code>.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>That&#8217;s pretty awesome! An automatic subtitler that will subtitle any video for you all on its own. You can build on this maybe by accepting YouTube links or adding translation functionality so you can have English subtitles on foreign language videos, which could be cool for language learning. Make sure you don&#8217;t use the <code>.en</code> model if you want to use other languages obviously.</p>



<p>To make a real production-grade application use a front-end framework and have some kind of progress or stream the live transcription to the page to stop the user getting bored, or allow them to do something else while the file processes in the background. A production app would have to run on a server with good processing power and GPU.</p>



<p>That&#8217;s it for part 3, I&#8217;ll see you soon in part 4 where we&#8217;ll look at ways to speed up Whisper or outsource the processing using the OpenAI API endpoint in the cloud. We&#8217;ll also build one more app using the cloud API to round off the series. See you there soon!</p>



<h2 class="wp-block-heading">Full Course: OpenAI Whisper &#8211; Building Cutting-Edge Python Apps with OpenAI Whisper</h2>



<p>Check out our full OpenAI Whisper course with video lessons, easy explanations, GitHub, and a downloadable PDF certificate to prove your speech processing skills to your employer and freelancing clients:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://academy.finxter.com/university/openai-whisper/"><img decoding="async" width="908" height="257" src="https://blog.finxter.com/wp-content/uploads/2024/01/image-154.png" alt="" class="wp-image-1654506" srcset="https://blog.finxter.com/wp-content/uploads/2024/01/image-154.png 908w, https://blog.finxter.com/wp-content/uploads/2024/01/image-154-300x85.png 300w, https://blog.finxter.com/wp-content/uploads/2024/01/image-154-768x217.png 768w" sizes="(max-width: 908px) 100vw, 908px" /></a></figure>
</div>


<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f449.png" alt="👉" class="wp-smiley" style="height: 1em; max-height: 1em;" /> [<strong>Academy</strong>] <a href="https://academy.finxter.com/university/openai-whisper/" data-type="link" data-id="https://academy.finxter.com/university/openai-whisper/">Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper</a></p>
<p>The post <a href="https://blog.finxter.com/openai-whisper-example-building-a-subtitle-generator-embedder/">OpenAI Whisper Example &#8211; Building a Subtitle Generator &#038; Embedder</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>OpenAI Whisper &#8211; Building a Podcast Transcribing App in Python</title>
		<link>https://blog.finxter.com/openai-whisper-building-a-podcast-transcribing-app-in-python/</link>
		
		<dc:creator><![CDATA[Dirk van Meerveld]]></dc:creator>
		<pubDate>Thu, 25 Jan 2024 19:56:17 +0000</pubDate>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Large Language Model (LLM)]]></category>
		<category><![CDATA[OpenAI]]></category>
		<category><![CDATA[Speech Recognition and Generation]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1654503</guid>

					<description><![CDATA[<p>🎙️ Course: This article is based on a lesson from our Finxter Academy Course Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it! Welcome back to part 2, where we&#8217;ll start practically applying our Whisper skills ... <a title="OpenAI Whisper &#8211; Building a Podcast Transcribing App in Python" class="read-more" href="https://blog.finxter.com/openai-whisper-building-a-podcast-transcribing-app-in-python/" aria-label="Read more about OpenAI Whisper &#8211; Building a Podcast Transcribing App in Python">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/openai-whisper-building-a-podcast-transcribing-app-in-python/">OpenAI Whisper &#8211; Building a Podcast Transcribing App in Python</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f399.png" alt="🎙" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Course</strong>: This article is based on a lesson from our <strong>Finxter Academy Course</strong> <a href="https://academy.finxter.com/university/openai-whisper/"><em>Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper</em></a>. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it!</p>



<p>Welcome back to part 2, where we&#8217;ll start practically applying our Whisper skills to build useful stuff. We obviously cannot just rely on the user needing to give us MP3 files to transcribe, they may want to just link a podcast for example. Here, we&#8217;ll be building a real application that can transcribe podcasts to text or subtitle format by taking just a podcast link as input.</p>



<p>Before we get started on the main code, we&#8217;ll do some basic setup work and create the helper functions we need to run in our main code. Keeping things separated across multiple functions and files will help keep our code a lot more clean and readable compared to just having one big script that does everything at the same time.</p>



<h2 class="wp-block-heading">Saving our constants to a separate file</h2>



<p>First, there are a couple of settings we&#8217;ll be using again and again over the next three parts, namely the paths to the input and output folders for the mp3 files, subtitles, and whatever else we will be processing. Instead of importing <code>pathlib</code> in every single file and then writing <code>BASE_DIR = Path(__file__).parent</code> we&#8217;ll just write this in a separate file and import it everywhere we need it. This will also make it easier to change the paths later if we need to.</p>



<p>In your project folder create a new file called <code>settings.py</code>, making sure to put it in the root folder of your project:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py</pre>



<p>In <code>settings.py</code>, write the following code:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from pathlib import Path

BASE_DIR = Path(__file__).parent
OUTPUT_TEMP_DIR = BASE_DIR / "output_temp_files"
OUTPUT_VIDEO_DIR = BASE_DIR / "output_video"
STYLES_DIR = BASE_DIR / "styles"
TEST_AUDIO_DIR = BASE_DIR / "test_audio_files"</pre>



<p>We first get the root directory of the project using <code>Path(__file__).parent</code>, and then we create a few more paths relative to the root directory. We&#8217;ll use these paths in our main code to save the output files to the correct folders. Go ahead and also create empty folders for the <code>output_temp_files</code>, <code>output_video</code>, and <code>styles</code> folders, making sure to spell them correctly:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files     (new empty folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video          (new empty folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles                (new empty folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files      (already existing folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py</pre>



<p>That&#8217;s our folders and paths setup done. We can just import these variables to access the folders from any file in our project. There is one more <code>setting</code> we need to define, but we cannot hardcode this one in our source code. We need to get our API key for OpenAI, as we&#8217;ll be using some ChatGPT in this part of the course. You&#8217;ll also need your API key for later parts. Go to https://platform.openai.com/api-keys and copy your API key. If you don&#8217;t have one, make sure to get one. You&#8217;ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called <code>.env</code> in the root folder of your project:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env                  (new file)</pre>



<p>And paste your API key in there like this, making sure not to use any spaces or quotes:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">OPENAI_API_KEY=your_api_key_here</pre>



<p>Then go ahead and save and close this file.</p>



<h2 class="wp-block-heading">Creating a utils folder for our helper functions</h2>



<p>Now let&#8217;s create a new folder named <code>utils</code> to hold our helper functions, and then inside this new folder create an empty file called <code>__init__.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils                 (new folder)
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py       (new empty file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>The <code>__init__.py</code> file is required to make Python treat the <code>utils</code> folder as a package, which will allow us to import the functions from within our other files. You don&#8217;t need to write anything in this file, just create it and leave it empty.</p>



<p>Our first utils file will deal with the podcast-related functions, so create a file called <code>podcast.py</code> in the <code>utils</code> folder:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py        (new file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>Inside <code>podcast.py</code> get started with our imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re
import uuid
from pathlib import Path

import requests
from decouple import config
from openai import OpenAI</pre>



<p>The <code>re</code> library deals with regular expressions and will help us find the podcast download page link amongst the page text. The <code>uuid</code> library lets us generate unique id&#8217;s, <code>pathlib</code> is familiar to us by now, and <code>requests</code> will help us download the podcast mp3 file. <code>decouple</code> will help us read our API key from the <code>.env</code> file, and <code>openai</code> will help us use the OpenAI API. If you have not used <code>decouple</code> before, make sure you run the install command in your terminal:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install python-decouple</pre>



<p>Back in <code>podcast.py</code> let&#8217;s create a few constants that we&#8217;ll be using in our functions:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">GPT_MODEL = "gpt-3.5-turbo-1106"
CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))</pre>



<p>First, we set the ChatGPT model we&#8217;ll be using to request a podcast summary later on. Then we create a <code>CLIENT</code> object that we&#8217;ll use to make requests to the OpenAI API. We pass in our API key as a string, and we use <code>config</code> to read the API key from the <code>.env</code> file. Note that <code>config("OPENAI_API_KEY")</code> already returns a string value, the <code>str()</code> call surrounding it is just there to make it explicit and will not convert values that are already strings to a string again for the second time or something weird like that.</p>



<h2 class="wp-block-heading">Scraping the podcast download link from the podcast page</h2>



<p>So what are some of the functions we&#8217;ll need in here? For this example application I will be using <code>Google Podcasts</code> as our podcast source. This means we will get an input link like this:<br>https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjExMTk3MDo4MA?sa=X&amp;ved=0CAgQuIEEahcKEwiIzMnavduDAxUAAAAAHQAAAAAQAQ</p>



<p>If you load this page in your browser, you will see an HTML page, with a play button. This is the kind of page link the user will input into our app, so first of all we will need a function to extract the <code>.mp3</code> download link from this page&#8217;s HTML.</p>



<p>Let&#8217;s get started on a function to do exactly that:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def scrape_link_from_page(page_url: str) -> str:
    podcast_page = requests.get(page_url).text
    regex = r"(?P&lt;url>\;https?://[^\s]+)"
    ...</pre>



<p>We start by defining our function which takes the <code>page_url</code> as a string and will return a string value as well. Then we use <code>requests</code> to get the HTML page text by sending a <code>GET</code> request to the URL, much like your internet browser would if you type a URL in the address bar. Now we define a regular expression that will match the pattern of the download link we want to extract. We&#8217;ll use this regex to find the download link in the HTML page text. Here&#8217;s how it works:</p>



<ul class="wp-block-list">
<li><code>(?P&lt;url&gt;...)</code> This is a named group. The matched text can be retrieved by the name URL. So basically the URL pattern we will find will be stored in a variable called URL.</li>



<li><code>\;</code> This matches a semicolon character. The backslash is used to escape the semicolon, as it has special meaning in regular expressions. We don&#8217;t want this special meaning but the literal semicolon character, as there is a semicolon in front of the https that we want to match for the URL we need. (This is just a characteristic of this particular podcast page, other pages might have different patterns.)</li>



<li><code>https?</code> This matches either http or https. The s? means &#8220;match zero or one s characters&#8221;. This allows the regex to match both http and https.</li>



<li><code>://</code> This matches the string ://, which is part of the standard format for URLs.</li>



<li><code>[^\s]+</code> This matches one or more (<code>+</code>) of any character that is not (<code>^</code>) a whitespace (<code>\s</code>) character. So basically this will match any character that is not a space, tab, or newline character. This will match the rest of the URL we need and stop adding characters as soon as a space appears which indicates the end of the URL.</li>
</ul>



<p>So, in simple terms, this regular expression matches a semicolon followed by a URL that starts with either http or https, and continues until a whitespace character is encountered. The URL is captured in a group named url.</p>



<p>Now let&#8217;s complete our function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def scrape_link_from_page(page_url: str) -> str:
    podcast_page = requests.get(page_url).text
    regex = r"(?P&lt;url>\;https?://[^\s]+)"
    podcast_url_dirty = re.findall(regex, podcast_page)[0]
    podcast_url = podcast_url_dirty.split(";")[1]
    return podcast_url</pre>



<p>So after we declared the regex pattern, we use <code>re.findall</code> to find all matches of the pattern in the podcast page text. This will return a list of matches, and we take the first match with <code>[0]</code>. This will return a string that looks something like this:</p>



<p><code>;https://download.ted.com/talks/etcetcetc;</code></p>



<p>Which is pretty good, we just need to get rid of the <code>;</code> characters before and after the URL. We do this by splitting the string on the <code>;</code> character, and then taking the second item in the list with <code>[1]</code>. This will return the clean URL we need: https://download.ted.com/talks/etcetcetc</p>



<h2 class="wp-block-heading">Downloading the podcast mp3 file</h2>



<p>Ok, so now our utils file has a function to scrape the download link. It stands to reason we&#8217;ll also need a function to download the mp3 file from the URL. Let&#8217;s get started on that:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path:
    print("Downloading podcast...")
    podcast_audio = requests.get(podcast_url)
    save_location = output_dir / f"{unique_id}.mp3"
    ...</pre>



<p>We define a function called <code>download</code> that takes 3 input arguments. The <code>podcast_url</code> is the URL we scraped from the podcast page as a string variable. The <code>unique_id</code> is a unique ID we&#8217;ll use to name the downloaded file, so we can avoid name clashes where files have the same name. This argument should be an instance of the <code>UUID</code> class from the <code>uuid</code> built-in Python library, which we&#8217;ll have a look at in a bit. The <code>output_dir</code> is the directory where we want to save the downloaded file as a <code>Path</code> object. Finally, our function will also return a <code>Path</code> object, which will be the path to the downloaded file.</p>



<p>We print a simple message to the console to show it is busy actually doing something, and then we use <code>requests</code> to download the podcast audio file by sending a <code>GET</code> request to the URL just like we did in the previous function. Then we create a <code>save_location</code> variable which is the path to the file we want to save. We use the <code>output_dir</code> argument as the parent directory, and then we use an f-string to create a filename that is the <code>unique_id</code> followed by the <code>.mp3</code> extension.</p>



<p>Now let&#8217;s complete our function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path:
    print("Downloading podcast...")
    podcast_audio = requests.get(podcast_url)
    save_location = output_dir / f"{unique_id}.mp3"

    with open(save_location, "wb") as file:
        file.write(podcast_audio.content)
    print("Podcast successfully downloaded!")

    return save_location</pre>



<p>We use the <code>open</code> function to open the <code>save_location</code> file in write binary (<code>wb</code>) mode, and we write the <code>podcast_audio.content</code> to the file. This will save the podcast audio file to the <code>save_location</code> path. Then we print a message to the console to show the download was successful, and we return the <code>save_location</code> path which points to the mp3 file we just downloaded, awesome!</p>



<h2 class="wp-block-heading">Getting a summary</h2>



<p>Now there is one more function we need in our <code>utils/podcast</code> file. Besides just the transcription, we will also provide the user with a summary of the podcast. We&#8217;ll use ChatGPT to generate this summary, so we&#8217;ll need a simple function to do that. This one will be easy, so let&#8217;s just whip it up:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def get_summary(transcription: str) -> str:
    print("Summarizing podcast...")
    prompt = f"Summarize the following podcast into the most important points:\n\n{transcription}\n\nSummary:"

    response = CLIENT.chat.completions.create(
        model=GPT_MODEL, messages=[{"role": "user", "content": prompt}]
    )

    print("Podcast summarized!")
    summary = response.choices[0].message.content
    return summary if summary else "There was a problem generating the summary."</pre>



<p>I assume you&#8217;re familiar with ChatGPT (if not, check out my other courses on the Finxter Academy!). We just have a simple function that takes the full <code>transcription</code> as a string and will return a summary as a string. We have a console print message again just to keep ourselves posted that it is doing some work and then we have a simple ChatGPT prompt.</p>



<p>Note the prompt ends with <code>Summary:</code> to prompt the model to start the summary right away without including any awkward introduction text, this is just a neat little trick you can use. We then use our <code>CLIENT</code> object to call the <code>chat.completions.create</code> endpoint, passing in the <code>GPT_MODEL</code> and a list of messages. We&#8217;ll just pass in the prompt as a user message. We then extract the <code>summary</code> from the <code>response.choices[0].message.content</code>. Just in case there was a problem and the summary is empty, we return a default message to inform the user.</p>



<h2 class="wp-block-heading">Subtitles</h2>



<p>Awesome! Our <code>podcast</code> utils are done now. Let&#8217;s move on to the <code>subtitles</code> utils. This one will be a much shorter file with a function that will allow us to output the transcription in subtitle format, with timestamps and everything. So go ahead and create a new file called <code>subtitles.py</code> in the <code>utils</code> folder:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py      (new file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>And inside <code>subtitles.py</code> get started with our imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from typing import Callable
from pathlib import Path</pre>



<p>Both of these imports will be used solely to indicate the type of our function arguments (type hinting). We&#8217;ll use <code>Callable</code> to indicate that a function is expected as an argument, and we&#8217;ll use <code>Path</code> to indicate that a <code>Path</code> object is expected as an argument. This just makes our code clearer to read and easier to understand. Now let&#8217;s write our function, whose purpose will be to take a transcription done by Whisper and then convert it to a valid subtitle file:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def write_to_file(whisper_output: dict, writer: Callable, output_path: Path) -> Path:
    """Takes the whisper output, a writer function, and an output path, and writes subtitles to disk in the specified format."""
    with open(output_path, "w", encoding="utf-8") as sub_file:
        writer.write_result(result=whisper_output, file=sub_file)
        print(f"Subtitles generated and saved to {output_path}")

    return output_path</pre>



<p>We take a <code>whisper_output</code> argument which is a dictionary containing the output Whisper gives us after we transcribe the podcast&#8217;s mp3 file. We also take a <code>writer</code> argument which is a function that will write the subtitles to disk, so we type-hint it with <code>Callable</code>. Finally, we take an <code>output_path</code> argument which is a <code>Path</code> object to the file we want to save the subtitles to. We then simply open the output path in write mode, calling the file <code>sub_file</code>. We then call the <code>writer.write_result</code> function, passing in the <code>whisper_output</code> and the location to save the subtitles to. Finally, we print a message to the console to show the subtitles were generated successfully, and we return the <code>output_path</code> which is the path to the subtitle file we just created.</p>



<p>Two important things to note here:</p>



<ul class="wp-block-list">
<li>When you open the subtitle file, make sure you use the <code>encoding="utf-8"</code> argument. For normal English characters, this is not necessary, so you might think this is not needed. However, the AI likes to use ♪ symbols when music starts playing to make the subtitles more interesting, and you crash if you don&#8217;t specify utf-8 encoding which can actually map and save these special characters!</li>



<li>You might be wondering what this magical <code>writer</code> function is. Whisper actually comes with some utility functions that will allow us to write subtitles in correct formatting, like <code>SRT</code> or <code>VTT</code>. These utilities have a <code>.write_result</code> function which is what we&#8217;re calling in our code above. So we&#8217;ll be able to pass in a SRT-writer or a VTT-writer depending on what subtitle type we want to save.</li>
</ul>



<p>Ok, so that is all our utility functions done. Now let&#8217;s move on to the main code.</p>



<h2 class="wp-block-heading">Installing gradio</h2>



<p>Before we get started you&#8217;ll need to install <code>gradio</code>, so in your terminal window, run:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install gradio</pre>



<p>What is <code>gradio</code>? Gradio is a Python library that allows us to quickly create user-friendly interfaces for testing, demonstrating, and debugging machine learning models. We&#8217;ll use gradio to create a UI for our app with just a few lines of code, and it supports a wide range of input and output types like video, audio, and text. Using this super simple framework we can keep the focus on whisper and not on building a user interface. It&#8217;s pretty self-explanatory, so you&#8217;ll understand the idea as we just code along.</p>



<h2 class="wp-block-heading">Creating the main file</h2>



<p>Now let&#8217;s get started on our main code, where mostly we&#8217;ll just have to call our utility functions and tie it all together, plus create a quick gradio interface to make it user-friendly. Create a new file called <code>2_whisper_pods.py</code> in the root folder of your project:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py   (new file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>And inside <code>2_whisper_pods.py</code> get started with our imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import uuid
from pathlib import Path

import gradio as gr
import whisper
from whisper.utils import WriteSRT, WriteVTT

from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR
from utils import podcast, subtitles</pre>



<p><code>uuid</code> is Python&#8217;s built-in library to generate unique id&#8217;s, <code>pathlib</code> is familiar to us by now, and <code>gradio</code> is the library we just installed. We also import <code>whisper</code> and two writer utilities from <code>whisper.utils</code>, which are the writer functions we talked about in the previous section. Then we import our directory <code>Path</code> constants from the <code>settings</code> and our <code>podcast</code> and <code>subtitles</code> utils. Now continue below the imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">WHISPER_MODEL = whisper.load_model("base")
VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))
SRT_WRITER = WriteSRT(output_dir=str(OUTPUT_TEMP_DIR))</pre>



<p>We load the <code>WHISPER_MODEL</code> from the <code>base</code> model, and we create two writer objects by creating instances of the <code>WriteVTT</code> and <code>WriteSRT</code> classes we imported from Whisper&#8217;s utilities, passing in the <code>output_dir</code> as a string.</p>



<p>Now let&#8217;s create a function to tie it all together:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    unique_id = uuid.uuid4()

    podcast_download_url = podcast.scrape_link_from_page(page_link)
    mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR)
    ...</pre>



<p>We define a function called <code>transcribe_and_summarize</code> which takes a <code>page_link</code> as a string and will return a tuple so we can have multiple outputs to this function. These four outputs will feed back into the gradio interface we will create later and will be:</p>



<ul class="wp-block-list">
<li>The podcast summary</li>



<li>The podcast transcription</li>



<li>The VTT subtitle file (path)</li>



<li>The SRT subtitle file (path)</li>
</ul>



<p>We then create a new <code>unique_id</code> which we&#8217;ll use to name the downloaded mp3 file. Note we do this inside the function as we need a unique identifier for every single transcription run to avoid name clashes. Then we use our <code>podcast.scrape_link_from_page</code> util to scrape the download link from the podcast page, and we use our <code>podcast.download</code> function to download the podcast mp3 file, passing in the <code>podcast_download_url</code>, <code>unique_id</code>, and the <code>OUTPUT_TEMP_DIR</code> as arguments. We then catch the mp3 file path in a variable called <code>mp3_file</code>. Notice how easy everything is to read because we used logical and descriptive names for all our variables and utility functions and files.</p>



<p>Let&#8217;s continue with our function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    ...previous code...

    whisper_output = WHISPER_MODEL.transcribe(str(mp3_file))
    with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f:
        f.write(str(whisper_output))

    transcription = str(whisper_output["text"])
    summary = podcast.get_summary(transcription)</pre>



<p>We call the <code>.transcribe</code> function by passing in the <code>mp3_file</code> path as a string. This will return a dictionary with the transcription and other information we catch in <code>whisper_output</code>. We then open a file called <code>pods_log.txt</code> in our root directory in write mode, and we write the <code>whisper_output</code> to the file. This is just for debugging purposes, so we can see what the output looks like (it&#8217;s too long to print to the console). We then extract the <code>transcription</code> from the <code>whisper_output</code> dictionary. Note that <code>whisper_output["text"]</code> is already a string, the reason we wrapped inside a <code>str()</code> call is just to make it explicit that this is a string for typing purposes. This will not add any extra overhead or computing time as values that are already a string will just pass through the <code>str()</code> function unaltered. Then we call our <code>podcast.get_summary</code> function, passing in the <code>transcription</code> as an argument.</p>



<p>Now we just need to write the subtitles to disk and return all the outputs. Continue on:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    ...previous code...

    get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}"
    vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt"))
    srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt"))

    return (summary, transcription, str(vtt_subs), str(srt_subs))</pre>



<p>We create a lambda (nameless) function that takes a file extension as input and then returns the path to the subtitle file with that extension. For example, inputting <code>.vtt</code> will yield <code>output_temp_files/unique_id.vtt</code>, but giving it <code>.srt</code> will yield <code>output_temp_files/unique_id.srt</code>, just so we can avoid repeating the same code twice. Then we call our <code>subtitles.write_to_file</code> function twice, passing in the <code>whisper_output</code>, the <code>VTT_WRITER</code> and <code>SRT_WRITER</code> writer functions, and the <code>get_sub_path</code> lambda function to get the path to the subtitle file. We catch the output of these two functions in <code>vtt_subs</code> and <code>srt_subs</code> respectively. Finally, we return a tuple containing the <code>summary</code>, <code>transcription</code>, <code>vtt_subs</code>, and <code>srt_subs</code> to finish off our function.</p>



<p>The whole thing now looks like this:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    unique_id = uuid.uuid4()

    podcast_download_url = podcast.scrape_link_from_page(page_link)
    mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR)

    whisper_output = WHISPER_MODEL.transcribe(str(mp3_file))
    with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f:
        f.write(str(whisper_output))

    transcription = str(whisper_output["text"])
    summary = podcast.get_summary(transcription)

    get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}"
    vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt"))
    srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt"))

    return (summary, transcription, str(vtt_subs), str(srt_subs))</pre>



<h2 class="wp-block-heading">Creating the gradio interface</h2>



<p>That&#8217;s very nice and well, but a typical end user does not know how to use Python and this function is not very user-friendly. So let&#8217;s create a quick gradio interface to make it easy for the user to use our app. Continue below the function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">if __name__ == "__main__":
    block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css"))

    with block:
        with gr.Group():
            # Header

            # Input textbox for podcast link

            # Button to start transcription

            # Output elements

            # btn.click definition

    block.launch(debug=True)</pre>



<p>This is going to be the basic structure of our <code>gradio</code> application. First, we use <code>if __name__ == "__main__":</code> to make sure the code inside this block only runs if we run this file directly, and not if we import it from another file. Then we create a <code>block</code> object by calling <code>gr.Blocks</code> and passing in the path to our <code>whisper_pods.css</code> file in the <code>styles</code> directory as a string. This will allow us to style our app with CSS, which we&#8217;ll do in a bit (this .css file doesn&#8217;t exist yet). Then we open a <code>with block:</code> block, and inside this block we open a <code>with gr.Group():</code> block. This will allow us to group elements together in our app. Then we have a bunch of comments to indicate what we&#8217;ll be doing in each block, which we&#8217;ll fill in in a moment. Finally, we call <code>block.launch</code> to launch our app, passing in <code>debug=True</code> so we get extra feedback in the console if anything goes wrong.</p>



<ul class="wp-block-list">
<li>The header will hold a logo image for our application. We&#8217;ll use HTML to load it from the internet. We can call <code>gr.HTML</code> to create an HTML element, and we can pass in the HTML code as a string. We&#8217;ll use a <code>div</code> element with a <code>header</code> class, and inside this <code>div</code> we&#8217;ll have an <code>img</code> element with a link to our logo image, which I just quickly uploaded to &#8220;imgur&#8221;. We&#8217;ll also set the <code>referrerpolicy</code> to <code>no-referrer</code> to avoid any issues with the image not loading (imgur doesn&#8217;t work with a <code>localhost</code> referrer, which is what you&#8217;ll have when you run this app locally).</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">gr.HTML(
    f"""
    &lt;div class="header">
    &lt;img src="https://i.imgur.com/8Xu2rwG.png" referrerpolicy="no-referrer" />
    &lt;/div>
    """
)</pre>



<ul class="wp-block-list">
<li>The input textbox will be where the user can paste in the podcast link. We can just call <code>gr.Textbox</code> to create a textbox element, and we can pass in a label to indicate what the textbox is for. We&#8217;ll call it &#8220;Google Podcasts Link&#8221; and we&#8217;ll catch the input in a variable called <code>podcast_link_input</code>.</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">podcast_link_input = gr.Textbox(label="Google Podcasts Link:")</pre>



<ul class="wp-block-list">
<li>The button will be the trigger that starts the main function. I want a full row button so we&#8217;ll call <code>gr.Row</code> to create a row element, and then we&#8217;ll call <code>gr.Button</code> to create a button element. We can just pass in the button text we want to display and associate the button with the variable name <code>btn</code>. We&#8217;ll use this <code>btn</code> object later to define the button&#8217;s behavior.</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">with gr.Row():
    btn = gr.Button("<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f399.png" alt="🎙" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Transcribe and summarize my podcast! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f399.png" alt="🎙" class="wp-smiley" style="height: 1em; max-height: 1em;" />")</pre>



<ul class="wp-block-list">
<li>The output elements will be the summary, transcription, and two subtitle files. The first two are just a <code>gr.Textbox</code> which does what you&#8217;d expect and allows us to pass in a label, placeholder, and the number of lines to display by default. The <code>autoscroll</code> behavior will scroll all the way down to the bottom if a large transcription text is passed into the input box. Since we want the user to be able to start reading from the beginning instead of the end we set this behavior to <code>False</code>. We then have another <code>gr.Row</code> with two <code>gr.File</code> elements which will end up side-by-side in a single row. The <code>label</code> is just a label and the <code>elem_classes</code> is a list of classes gradio will give the element, so we can target it with CSS later on using the names <code>vtt-sub-file</code> and <code>srt-sub-file</code>.</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">summary_output = gr.Textbox(
    label="Podcast Summary",
    placeholder="Podcast Summary",
    lines=4,
    autoscroll=False,
)

transcription_output = gr.Textbox(
    label="Podcast Transcription",
    placeholder="Podcast Transcription",
    lines=8,
    autoscroll=False,
)

with gr.Row():
    vtt_sub_output = gr.File(
        label="VTT Subtitle file download", elem_classes=["vtt-sub-file"]
    )
    srt_sub_output = gr.File(
        label="SRT Subtitle file download", elem_classes=["srt-sub-file"]
    )</pre>



<ul class="wp-block-list">
<li>The <code>btn.click</code> is where we define which function to call when the button is clicked, so we give it our <code>transcribe_and_summarize</code> function as the first argument. The second argument is a list of inputs, in this case only our <code>podcast_link_input</code>. The third argument is a list of outputs, in this case, our <code>summary_output</code>, <code>transcription_output</code>, <code>vtt_sub_output</code>, and <code>srt_sub_output</code>. We&#8217;ll use these outputs to display the results of our function to the user. We just told gradio what function to run, and how to map all of the input and output elements we defined in the interface to the input and output arguments of our function!</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">btn.click(
    transcribe_and_summarize,
    inputs=[podcast_link_input],
    outputs=[
        summary_output,
        transcription_output,
        vtt_sub_output,
        srt_sub_output,
    ],
)</pre>



<p><code>whisper_pods.py</code> now looks like this:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">imports

CONSTANTS


def transcribe_and_summarize(...)...
    ...


if __name__ == "__main__":
    block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css"))

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                &lt;div class="header">
                &lt;img src="https://i.imgur.com/8Xu2rwG.png" referrerpolicy="no-referrer" />
                &lt;/div>
                """
            )

            podcast_link_input = gr.Textbox(label="Google Podcasts Link:")

            with gr.Row():
                btn = gr.Button("<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f399.png" alt="🎙" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Transcribe and summarize my podcast! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f399.png" alt="🎙" class="wp-smiley" style="height: 1em; max-height: 1em;" />")

            summary_output = gr.Textbox(
                label="Podcast Summary",
                placeholder="Podcast Summary",
                lines=4,
                autoscroll=False,
            )

            transcription_output = gr.Textbox(
                label="Podcast Transcription",
                placeholder="Podcast Transcription",
                lines=8,
                autoscroll=False,
            )

            with gr.Row():
                vtt_sub_output = gr.File(
                    label="VTT Subtitle file download", elem_classes=["vtt-sub-file"]
                )
                srt_sub_output = gr.File(
                    label="SRT Subtitle file download", elem_classes=["srt-sub-file"]
                )

            btn.click(
                transcribe_and_summarize,
                inputs=[podcast_link_input],
                outputs=[
                    summary_output,
                    transcription_output,
                    vtt_sub_output,
                    srt_sub_output,
                ],
            )

    block.launch(debug=True)</pre>



<h2 class="wp-block-heading">Creating the CSS file</h2>



<p>See how easy it was to write an interface using gradio! There is just one thing left to do, the <code>STYLES_DIR / "whisper_pods.css"</code> file we loaded into gradio doesn&#8217;t actually exist! Go ahead and create a new file in the <code>styles</code> directory called <code>whisper_pods.css</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_temp_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />output_video
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />styles
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />whisper_pods.css  (new file)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />utils
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />__init__.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />podcast.py
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />subtitles.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />2_whisper_pods.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />settings.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />.env</pre>



<p>Inside <code>whisper_pods.css</code> paste the following code:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">.header {
  padding: 2em 8em;
}

.vtt-sub-file,
.srt-sub-file {
  height: 80px;
}</pre>



<p>We set some padding on the header image by targeting the <code>header</code> class, to stop the image from getting too big. Then we set the height of the subtitle file download boxes to 80px, so they don&#8217;t get smaller than this, keeping them nice and visible.</p>



<p>Now go back to your <code>2_whisper_pods.py</code> file and run it. Give it some time to load up and you&#8217;ll see the following in your terminal:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.</pre>



<p>CTRL + click the link to open it in your browser. You should see the following:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" src="https://academy.finxter.com/wp-content/uploads/2024/01/2_gradio_interface-1024x877.png" alt="" class="wp-image-4056"/></figure>
</div>


<p>Go ahead and get a Google podcasts link to input. I&#8217;ll use a short podcast just for the initial test:<br>https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjEwNzMyNDo4MA?sa=X&amp;ved=0CAgQuIEEahcKEwiImYLqr8qDAxUAAAAAHQAAAAAQAQ</p>



<p>And then click the button and wait (I&#8217;ve blurred out the transcription to respect the speaker&#8217;s copyright as this course will be published publicly):</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" src="https://academy.finxter.com/wp-content/uploads/2024/01/2_gradio_output-974x1024.png" alt="" class="wp-image-4055"/></figure>
</div>


<p>Check the summary, transcription, and subtitle files. Try other podcasts from https://podcasts.google.com/. play around and have fun! My transcription was very good using just the <code>base</code> whisper model we loaded up and I never even used a bigger one! If you use non-English languages you may need a bigger model though. You can also use a <code>.en</code> model like <code>base.en</code> or <code>small.en</code> to get higher accuracy if you will only input English podcasts.</p>



<p>Also take a look at the <code>pods_log.txt</code> file you wrote in the root directory of your project, which holds the full whisper output. It may help you pinpoint where the problems are and how confident the model is while transcribing.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>There we go, that is a pretty good initial minimum viable product! Of course, it has much room for improvement, for instance by using a proper front-end framework like React and streaming the transcription live to the page so the user is not left waiting so long before seeing results.</p>



<p>You could also use asyncio to make the ChatGPT summary call asynchronous slightly speeding up the code by writing the subtitle files to disk while the ChatGPT summary call is running at the same time, and of course, you&#8217;d want some kind of cleanup function to get rid of all the downloaded mp3 files hanging around in your <code>output_temp_files</code> folder. If you check it you will see all the files with the names like <code>0e0f5d05-9379-4124-a84d-81de7eb3e314.mp3</code> we generated, plus all the subtitle files with the same name for each mp3 file.</p>



<p>I&#8217;ll leave the rest up to your imagination! That&#8217;s it for part 2, I&#8217;ll see you soon in part 3, where we&#8217;ll be using Whisper to create a fully automatic video subtitling tool that takes only a video file as input, then transcribes the audio, creates subtitles, and embeds them into the video at the correct times! It will be fun, see you there!</p>



<h2 class="wp-block-heading">Full Course: OpenAI Whisper &#8211; Building Cutting-Edge Python Apps with OpenAI Whisper</h2>



<p>Check out our full OpenAI Whisper course with video lessons, easy explanations, GitHub, and a downloadable PDF certificate to prove your speech processing skills to your employer and freelancing clients:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://academy.finxter.com/university/openai-whisper/"><img decoding="async" width="908" height="257" src="https://blog.finxter.com/wp-content/uploads/2024/01/image-154.png" alt="" class="wp-image-1654506" srcset="https://blog.finxter.com/wp-content/uploads/2024/01/image-154.png 908w, https://blog.finxter.com/wp-content/uploads/2024/01/image-154-300x85.png 300w, https://blog.finxter.com/wp-content/uploads/2024/01/image-154-768x217.png 768w" sizes="(max-width: 908px) 100vw, 908px" /></a></figure>
</div>


<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f449.png" alt="👉" class="wp-smiley" style="height: 1em; max-height: 1em;" /> [<strong>Academy</strong>] <a href="https://academy.finxter.com/university/openai-whisper/" data-type="link" data-id="https://academy.finxter.com/university/openai-whisper/">Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper</a></p>
<p>The post <a href="https://blog.finxter.com/openai-whisper-building-a-podcast-transcribing-app-in-python/">OpenAI Whisper &#8211; Building a Podcast Transcribing App in Python</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>OpenAI Whisper &#8211; Python Installation, Setup, &#038; First Steps to Speech-to-Text Synthesis</title>
		<link>https://blog.finxter.com/openai-whisper-python-installation-setup-first-steps-to-speech-to-text-synthesis/</link>
		
		<dc:creator><![CDATA[Dirk van Meerveld]]></dc:creator>
		<pubDate>Thu, 25 Jan 2024 19:55:30 +0000</pubDate>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Large Language Model (LLM)]]></category>
		<category><![CDATA[OpenAI]]></category>
		<category><![CDATA[Prompt Engineering]]></category>
		<category><![CDATA[Speech Recognition and Generation]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1654502</guid>

					<description><![CDATA[<p>🎙️ Course: This article is based on a lesson from our Finxter Academy Course Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it! Welcome to this first part of the Whisper course. My name is Dirk ... <a title="OpenAI Whisper &#8211; Python Installation, Setup, &#038; First Steps to Speech-to-Text Synthesis" class="read-more" href="https://blog.finxter.com/openai-whisper-python-installation-setup-first-steps-to-speech-to-text-synthesis/" aria-label="Read more about OpenAI Whisper &#8211; Python Installation, Setup, &#038; First Steps to Speech-to-Text Synthesis">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/openai-whisper-python-installation-setup-first-steps-to-speech-to-text-synthesis/">OpenAI Whisper &#8211; Python Installation, Setup, &#038; First Steps to Speech-to-Text Synthesis</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f399.png" alt="🎙" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Course</strong>: This article is based on a lesson from our <strong>Finxter Academy Course</strong> <a href="https://academy.finxter.com/university/openai-whisper/"><em>Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper</em></a>. Check it out for video lessons, GitHub, and a downloadable PDF course certificate with your name on it!</p>



<p>Welcome to this first part of the Whisper course. My name is Dirk van Meerveld and it is my pleasure to be your host and guide for this tutorial series where we will be looking at OpenAI&#8217;s amazing speech-to-text model called Whisper.</p>



<p>We&#8217;ll first take a look at what it is and how its basic usage works, and then we&#8217;ll explore ways in which we can practically use it in our projects. Along the way, we&#8217;ll learn about the balance between model size and accuracy, and in the final part, we&#8217;ll look at alternative options to speed it up or outsource the processing to OpenAI&#8217;s servers.</p>



<p>The local installation process should not be too much of a problem but is a bit different for all operating systems and system setups. Unfortunately, I cannot cover every single possible system setup configuration, so you may have to do some googling and trial and error along the way.</p>



<p>This is an inevitable part of software development, don&#8217;t give up and you will always get it working eventually, we all get stuck trying to get something to work with our particular system sometimes, it&#8217;s just part of the job.</p>



<p>If you do not like a particular configuration like running the model locally, rest assured we will cover both the different ways to run Whisper and various implementation projects over the series, so just watch through the whole thing and then take whatever projects you like and combine it with whatever version of running Whisper you liked.</p>



<h2 class="wp-block-heading">Installing Whisper</h2>



<p>First, we need to install Whisper. We&#8217;ll be using the pip package manager for this, so make sure you have that installed, but you should if you&#8217;re a Python user. In a terminal window run the following command:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install -U openai-whisper</pre>



<p>The <code>-U</code> flag in the <code>pip install -U openai-whisper</code> command stands for <code>--upgrade</code>. It means that Whisper will either be installed or upgraded to the latest version if it is already installed.</p>



<p>The second thing we need to have installed is <code>ffmpeg</code>. What is <code>ffmpeg</code>? FFmpeg is a versatile multimedia framework that allows us to work with audio and video files. It supports a wide range of formats, and is highly portable, running on pretty much any operating system.</p>



<p>The simplest way to install <code>ffmpeg</code> is to use a package manager. If you&#8217;re on Windows, you can use <a href="https://chocolatey.org/install">Chocolatey</a> to install <code>ffmpeg</code> by running the following command in a terminal window:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># on Windows / Chocolatey
choco install ffmpeg</pre>



<p>If you&#8217;re on MacOS using Homebrew, you can install <code>ffmpeg</code> by running the following command in a terminal window:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># on MacOS / Homebrew
brew install ffmpeg</pre>



<p>If you&#8217;re on Linux, well you probably know what to do and don&#8217;t need instructions! <code>sudo apt update &amp;&amp; sudo apt install ffmpeg</code></p>



<p>This may be the most challenging part of the tutorial series, to be honest. You may not run into any issues if your system is already set up well, or you may need to do quite some googling and setup work to get everything up and running. It took me some messing around to get everything working properly on my system and it&#8217;s unfortunately impossible to know exactly what you will need to do to resolve any issues you may run into. Google is your friend! Remember we&#8217;ll also cover the API in part 4 if you don&#8217;t want to run the model locally, but don&#8217;t just skip ahead as you&#8217;ll miss out on a lot of useful information.</p>



<h2 class="wp-block-heading">What is Whisper?</h2>



<p>Whisper is a speech-to-text model developed by OpenAI. What is really cool is that they open-source released this model to the public. It is a neural network that takes audio as input and outputs text. It is trained on a large dataset of audio and text pairs and has learned the text that corresponds to the audio. What is exciting about the model is that it&#8217;s not just effective at transcribing high-quality &#8216;gold-standard&#8217; audio that has been recorded on studio microphones, but is also very good at transcribing audio that has considerably lower quality, or even imperfect pronunciation with a foreign accent. If you compare it with auto-generated subtitles from Youtube, for example, you will see that it really is a level apart.</p>



<p>Instead of diving deep into the model&#8217;s architecture and technical details that make it work behind the scenes, this course will focus on the practical application of what we can do with it and how to use it to make cool stuff.</p>



<h2 class="wp-block-heading">Model sizes</h2>



<p>There are different sizes available for the Whisper model. The smaller the size of the model, the less processing power and VRAM it needs, and the faster it will run. This comes at the cost of a lower accuracy. On the contrary, the larger the model size, the more processing power and VRAM it needs, and the longer it will take to run, but the more accurate it will be and the better it will deal with foreign languages, noise, and poor audio quality.</p>



<figure class="wp-block-table"><table><thead><tr><th>Size</th><th>Parameters</th><th>English-only model</th><th>Multilingual model</th><th>Required VRAM</th><th>Relative Speed</th></tr></thead><tbody><tr><td>tiny</td><td>39M</td><td>tiny.en</td><td>tiny</td><td>~1GB</td><td>~32x</td></tr><tr><td>base</td><td>74M</td><td>base.en</td><td>base</td><td>~1GB</td><td>~16x</td></tr><tr><td>small</td><td>244M</td><td>small.en</td><td>small</td><td>~2GB</td><td>~6x</td></tr><tr><td>medium</td><td>769M</td><td>medium.en</td><td>medium</td><td>~5GB</td><td>~2x</td></tr><tr><td>large</td><td>1550M</td><td>N/A</td><td>large</td><td>~10GB</td><td>1x</td></tr></tbody></table></figure>



<p>As we can see in this table from the <a href="https://github.com/openai/whisper">Whisper GitHub</a>, we have 5 different model sizes in total. There are 4 sizes for the English-only model, namely <code>tiny.en</code>, <code>base.en</code>, <code>small.en</code>, and <code>medium.en</code>. As this model only deals with the English language it is highly recommended to use one of these when you know you&#8217;re going to be transcribing English as these models are specialized at only dealing with English and therefore will give greater accuracy at a much smaller model size and run-time. This is why there is no <code>large.en</code> model as the <code>medium.en</code> model is already sufficient in size to equal the accuracy of the <code>large</code> multilingual model.</p>



<p>For the multilingual models, we have the <code>tiny</code>, <code>base</code>, <code>small</code>, <code>medium</code>, and <code>large</code> sizes. This whisper is trained on a whopping 680,000 hours of audio data covering a total of 97 different languages, though the performance does vary per language as more obscure languages may not work quite as well. The larger the model size the easier it will deal with such languages, specific accents, and poor audio quality.</p>



<p>Now if you don&#8217;t have 10GB of VRAM, don&#8217;t worry, you can often get away with using the smaller-size models as you will see. Later on, in the last part of the series, we&#8217;ll look at smaller &#8216;distilled&#8217; versions of the model that can help us optimize speed further, or just outsourcing the processing to the lighting-fast OpenAI servers. Just keep watching! That being said, I actually recommend you always use the smallest version that you can get away with for your specific task. There is simply no point in adding more cost and complexity to your apps. If you don&#8217;t need it the extra model size will only slow down and raise the cost of your application.</p>



<h2 class="wp-block-heading">Basic usage</h2>



<p>Now that we have Whisper, fire up your favorite code editor, and let&#8217;s get started! I&#8217;ll be using VSCode, but you can use whatever IDE you like. Create a root folder for your project, I&#8217;ll call mine <code>FINX_WHISPER</code>, and then inside make a new file called <code>1_basic_call_english_only.py</code>. (I&#8217;m using numbers for the file names so you can easily reference them later when you are busy coding some cool new project, but this is obviously not a good general naming convention):</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py</pre>



<p>Then open up the new Python file and start with the imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import whisper
from pathlib import Path</pre>



<p>The <code>whisper</code> import is obvious, and <code>pathlib</code> will help us get the path to the audio files we want to transcribe, this way our Python file will be able to locate our audio files even if the terminal window is not currently in the same directory as the Python file. Now let&#8217;s declare some constants:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">MODEL = whisper.load_model("base.en")
AUDIO_DIR = Path(__file__).parent / "test_audio_files"</pre>



<p>First, we declare <code>MODEL</code> and load the <code>base.en</code> model. We start with the second-smallest English-only model and will scale up if and when we need to. Then we declare <code>AUDIO_DIR</code> and use <code>pathlib</code> to get the path. This works by first getting the path to the current file (<code>1_basic_call_english_only.py</code>), using <code>__file__</code>, and then getting the parent directory of that file, using <code>.parent</code>. Then we add the <code>test_audio_files</code> folder to the path using the <code>/</code> operator. This way we can easily access the audio files in the <code>test_audio_files</code> folder from our Python file.</p>



<p>Now let&#8217;s create the <code>test_audio_files</code> as it doesn&#8217;t actually exist, make sure you spell it correctly:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py</pre>



<p>Then go ahead and add the audio files provided into the folder. They should be provided together with this video tutorial, but if for any reason you cannot find them, go to the Finxter GitHub repository for this course or you can find a copy at:</p>



<figure class="wp-block-embed"><div class="wp-block-embed__wrapper">
https://github.com/DirkMeer/finx_whisper
</div></figure>



<p>Download all the test files and put them in the folder (you can also add your own audio files if you want to, these are just provided for your convenience):</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f50a.png" alt="🔊" class="wp-smiley" style="height: 1em; max-height: 1em;" />dutch_long_repeat_file.mp3
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f50a.png" alt="🔊" class="wp-smiley" style="height: 1em; max-height: 1em;" />dutch_the_netherlands.mp3
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f50a.png" alt="🔊" class="wp-smiley" style="height: 1em; max-height: 1em;" />high_quality.mp3
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f50a.png" alt="🔊" class="wp-smiley" style="height: 1em; max-height: 1em;" />low_quality.mp3
        <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f50a.png" alt="🔊" class="wp-smiley" style="height: 1em; max-height: 1em;" />terrible_quality.mp3
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py</pre>



<p>Ok, back to our <code>1_basic_call_english_only.py</code> file. Below the <code>MODEL</code> and <code>AUDIO_DIR</code> variables, let&#8217;s create a function that will transcribe the audio files for us:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def get_transcription(audio_file: str):
    result = MODEL.transcribe(audio_file)
    print(result)
    return result</pre>



<p>This function takes an audio file&#8217;s path in string format as input. We then call the <code>.transcribe()</code> method Whisper provides for us, and pass in the audio file&#8217;s path in string format. Then we simply print and return the result for a basic test. Looks really simple right?</p>



<p>First, let&#8217;s try and transcribe a high-quality English audio file, as a sort of best-case scenario:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">get_transcription(str(AUDIO_DIR / "high_quality.mp3"))</pre>



<p>Notice that the function we wrote above takes a path as a string variable. This is because Whisper requires the path to the audio file as a string. <code>AUDIO_DIR / "high_quality.mp3"</code> returns a <code>Path</code> object, so we use <code>str()</code> to convert it to a string, or else Whisper will crash.</p>



<h2 class="wp-block-heading">Getting a transcription</h2>



<p>So go ahead and save and run the file, and you will see a large object containing all the output. Let&#8217;s take a quick look at the information available to us here, read the comments for an explanation:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">{
    # First we get the full transcription
    "text": " Hi guys, this is just a quick test audio file for you. Let's see how well it does and if my speech is recognized and converted to text properly. I'm really excited to see how well this works and I hope that it will be a good test for you guys to see how well the whisper model works.",
    # Now we have the list of segments
    "segments": [
        {
            "id": 0,
            "seek": 0,
            # Start and end times in seconds
            "start": 0.0,
            "end": 3.52,
            "text": " Hi guys, this is just a quick test audio file for you.",
            # list of tokenized words from the transcription, where each word is represented by a unique number
            "tokens": [ 50363, 15902, 3730, 11, 428, 318, 655, 257, 2068, 1332, 6597, 2393, 329, 345, 13, 50539 ],
            "temperature": 0.0,
            # In the context of machine learning, temperature is a parameter that controls the randomness of predictions. A temperature of 0.0 suggests no randomness, or the model always selecting the tokens(words) with the highest probability (This is similar to the ChatGPT API temperature setting). You can pass a temperature value to the transcribe function when calling it if you want to introduce more randomness into your generations.
            # For instance: model.transcribe(audio_file, temperature=0.2)
            "avg_logprob": -0.1399546700554925,
            # The average log probability of the tokens in the segment. The closer to 0 the better, which means if the numbers get more negative, like -0.2 for instance, it means it's much less confident in it's transcription (and there are probably more errors).
            "compression_ratio": 1.5898876404494382,
            "no_speech_prob": 0.0045762090012431145,
            # Represents the probability that the segment contains no speech. We can see that it is very low.
        },
        {
            '... more segments with the same structure as above, cut for brevity ...'
        },
    ],
    "language": "en",
}</pre>



<p>As we can see, we really get a lot of information back from the model! What is most interesting is of course the transcription itself. Notice that it is a perfect word-for-word transcription even though we used the second smallest <code>base.en</code> model possible. Very impressive for such a small version of the real model! Now let&#8217;s try a lower-quality audio file:</p>



<p>replace the last call:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">get_transcription(str(AUDIO_DIR / "high_quality.mp3"))</pre>



<p>with:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">get_transcription(str(AUDIO_DIR / "low_quality.mp3"))</pre>



<p>And when we run this with the considerably lower quality audio file, still on the <code>base.en</code> model, I still get a perfect transcription. If we look closely at the output object though we can clearly see the <code>avg_logprob</code> (explained above) has moved further away from 0, moving from <code>-0.1399546700554925</code> to <code>-0.2179246875974867</code> indicating the model is now much less confident in it&#8217;s transcription (though still correct).</p>



<p>Now let&#8217;s try a really poor-quality audio file:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">get_transcription(str(AUDIO_DIR / "terrible_quality.mp3"))</pre>



<p>And if we run this we can see that it is still half correct even though a human would have trouble understanding it:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Hi guys. This is just a quick test audio file for you. Let's see how well it does and if my speech is recognized, thank you for the context properly. I'm really excited to see how well this works and I hope that it will be a quick test for you guys to see how well the whisper model works.</pre>



<p>We have clearly reached the limits of the base model here as part of this is incorrect, and it&#8217;s time to step up to a bigger model size. (Remember, you generally want to use the smallest model you can get away with for your use case!)</p>



<p>I&#8217;m going to change the model to <code>small.en</code> by editing the <code>MODEL</code> variable at the top of our file:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">MODEL = whisper.load_model("small.en")</pre>



<p>Now if we run it again:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Hi guys, this is just a quick test audio file for you. Let's see how well it does, and if my speech is recognized and converted to text properly, I'm really excited to see how well this works, and I hope that it will be a good test for you guys to see how well the Whisper model works.</pre>



<p>There is an awkward super-long sentence with a bit too many commas but apart from that it&#8217;s perfect, even though the audio quality of this file is pretty terrible. Switching to <code>medium.en</code> fixes the last small imperfection with the multiple commas by the way. This is the power of Whisper!</p>



<h2 class="wp-block-heading">Taking a deeper look</h2>



<p>Now let&#8217;s take a slightly deeper look at what is happening inside Whisper while looking at using other languages and even translation. Make a new file in your root folder called <code>1_multiple_languages.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />FINX_WHISPER (project root folder)
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c1.png" alt="📁" class="wp-smiley" style="height: 1em; max-height: 1em;" />test_audio_files
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_basic_call_english_only.py
    <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4c4.png" alt="📄" class="wp-smiley" style="height: 1em; max-height: 1em;" />1_multiple_languages.py</pre>



<p>Then open up the new <code>1_multiple_languages.py</code> file and start with the imports:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import whisper
from pathlib import Path

AUDIO_DIR = Path(__file__).parent / "test_audio_files"
model = whisper.load_model("base")</pre>



<p>Make sure to use the <code>base</code> model this time, and not the <code>base.en</code> model, as we want to use all available languages.</p>



<p>First, we&#8217;ll take a slightly deeper down look to have a rough idea of what is going on as this will help us understand some important nuances. After that, we&#8217;ll greatly simplify the whole thing using the higher-level code again. Let&#8217;s write a function that will detect the language and transcribe a file for us and we&#8217;ll explain it line by line.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)</pre>



<p>We define a function, which takes the path to an <code>audio_file</code> as a string argument. We then call Whisper&#8217;s <code>.load_audio()</code> method and pass in the audio file&#8217;s path. This returns a NumPy array containing the audio waveform, in float32 datatype, or in other words, an array containing the audio data as a giant list of numbers.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">    audio = whisper.pad_or_trim(audio)</pre>



<p>Next, we get a 30-second sample, either padding with silence if the file is shorter than 30 seconds or trimming it if it is longer. This is because the Whisper model is built and trained to take 30 seconds of audio as its input data each time. This doesn&#8217;t mean you cannot transcribe longer files but does have some implications we&#8217;ll get back to later.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">    mel = whisper.log_mel_spectrogram(audio).to(model.device)</pre>



<p>Make a log-Mel spectrogram and move it to the same device as the model (e.g. your GPU). A log-Mel spectrogram is a representation of a sound or audio signal that has been transformed to highlight certain perceptual characteristics.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f468-200d-1f3eb.png" alt="👨‍🏫" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Spectrogram: A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time. It's essentially a heat map where x is time, the y-axis is frequency, and the color represents the loudness.

<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f468-200d-1f3eb.png" alt="👨‍🏫" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Mel Scale: The Mel scale is a perceptual scale of pitches that emulates the human ear's response to different frequencies. We humans are much better at distinguishing small changes in pitch at low frequencies than at high frequencies. The Mel scale makes the representation match more closely with human perception as opposed to the exact mathematical frequencies.

<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f468-200d-1f3eb.png" alt="👨‍🏫" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Logarithmic Scale: Taking the logarithm of the spectrogram values is another step to make the representation more closely match human perception. We perceive loudness on a logarithmic scale (which is why we use decibels, a logarithmic measurement, to express the loudness of sound).

<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f468-200d-1f3eb.png" alt="👨‍🏫" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Combining these, a log-Mel spectrogram is a representation of sound that is designed to highlight the aspects that are most important for human perception. It's commonly used in audio processing tasks, including speech and music recognition.</pre>



<p>Now that we have this log-Mel spectrogram, we can use it to detect the language of our audio file. We do this by passing it to the <code>.detect_language()</code> method of our model:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">    language_token, language_probs = model.detect_language(mel)</pre>



<p>This returns the <code>language_token</code>, which is a number we will not be using, and the <code>language_probs</code> which is a huge list of numbers indicating the probability for possible languages matching the sound file. As we won&#8217;t actually be using the <code>language_token</code> variable we can replace it with a <code>_</code> to indicate that we won&#8217;t be using it. This makes it into a sort of throwaway variable that we don&#8217;t care about.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">    _, language_probs = model.detect_language(mel)</pre>



<p>Let&#8217;s take what we have so far, add a print statement to check out the <code>language_probs</code>, and run it using the <code>dutch_the_netherlands.mp3</code> file I prepared for you:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    _, language_probs = model.detect_language(mel)
    print(language_probs)

detect_language_and_transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"))</pre>



<p>Now when we run this we can see the massive <code>language_probs</code> list printed to our console:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">{
    '.. cut for brevity ..'
    "yi": 2.012418735830579e-05,
    "ka": 2.161949907986127e-07,
    "nl": 0.9650669693946838,
    "en": 0.010499916970729828,
    "ko": 9.358442184748128e-05,
    "mn": 5.96029394728248e-06,
    "de": 0.010318436659872532,
    '.. cut for brevity ..'
}</pre>



<p>We have a huge list of numbers here as you can see. The higher the number the more likely the the language, many are to the power of <code>-4</code>, <code>-5</code>, <code>-6</code>, or even lower. We can clearly see that <code>nl</code> (the Netherlands) is by far the highest probability, close to a perfect 1 score with <code>0.965</code>. The second and third highest are <code>en</code> (English) and <code>de</code> (German) with <code>0.010</code> and <code>0.010</code> respectively which is not even close so we can be very confident that this is Dutch. Impressive for the <code>base</code> model that small that deals with so many languages, and Dutch not really being that big a language.</p>



<p>Of course, we don&#8217;t want this whole list, we just want to know the most probable language, so we can use the max function to get the highest probability.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def detect_language_and_transcribe(audio_file: str):
    ...
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")</pre>



<p><code>max</code> returns the key of the largest value in the dictionary. We pass in the dictionary as the first argument. The <code>key</code> argument is a function that is called on each item in the dictionary, and the item for which the function returns the largest value is the result of the <code>max</code> function. We can just use the built-in <code>.get()</code> method as the function to get the value of each item in the dictionary.</p>



<p>The language name codes are in ISO 639-1 format and can be found <a href="https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes">here</a>. We add a print statement to print the detected language. I removed the previous print statement <code>print(language_probs)</code> we added before.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def detect_language_and_transcribe(audio_file: str):
    ...
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")
    options = whisper.DecodingOptions(language=language, task="transcribe")
    result = whisper.decode(model, mel, options)
    print(result)
    return result.text</pre>



<p>Now we&#8217;ll decode this 30-second audio file into text. First, we create a <code>DecodingOptions</code> object and save it in the variable named options. The <code>DecodingOptions</code> object lets you set more advanced decoding options, but we&#8217;ll stick to basics for now, passing in the <code>language</code> we detected and the task of &#8220;transcribe&#8221;. We then call the <code>whisper.decode</code> function which performs decoding of the 30-second audio segment(s), provided as log-Mel spectrogram(s). We pass in the model, the mel spectrogram, and the options. This returns a <code>DecodingResult</code> object which we save in the variable named <code>result</code>. We then print the <code>result</code> and return the <code>result.text</code>.</p>



<p>The whole function now looks like this:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def detect_language_and_transcribe(audio_file: str):
    audio = whisper.load_audio(audio_file)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    _, language_probs = model.detect_language(mel)
    language: str = max(language_probs, key=language_probs.get)
    print(f"Detected language: {language}")
    options = whisper.DecodingOptions(language=language, task="transcribe")
    result = whisper.decode(model, mel, options)
    print(result)
    return result.text</pre>



<p>Now let&#8217;s run it with the <code>dutch_the_netherlands.mp3</code> file again:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">dutch_test = detect_language_and_transcribe(
    str(AUDIO_DIR / "dutch_the_netherlands.mp3")
)</pre>



<p>When you run this the object printed to the console will have the following transcription:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">'Hoi, allemaal. Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze text te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.'</pre>



<p>There we go, a perfect transcription! Now you probably don&#8217;t speak Dutch, but the above is a perfect word-for-word transcription of the spoken text.</p>



<h2 class="wp-block-heading">Back to .transcribe</h2>



<p>Now I&#8217;ll be honest, that was a little bit overcomplicated if we don&#8217;t need to do much personalization and just want to call the model. Also, we don&#8217;t want to limit ourselves to just 30 seconds of audio. Let&#8217;s take it back to whisper&#8217;s higher level <code>.transcribe</code> function which basically does all the above for us.</p>



<p>Make sure you comment out the <code>dutch_test</code> code so it doesn&#8217;t keep running:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># dutch_test = detect_language_and_transcribe(
#     str(AUDIO_DIR / "dutch_the_netherlands.mp3")
# )</pre>



<p>Now all we need to do to use <code>.transcribe</code> is load a model (<code>model = whisper.load_model("base")</code>) which we already did in this file, and then call the <code>.transcribe</code> method on the model and pass in the path to the audio file as a string:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">result = model.transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True)
print(result["text"])</pre>



<p>It also has some options, in this case, we&#8217;ve set <code>verbose</code> to <code>True</code> so it will give us extra information in the console. If you go ahead and run this you will get the exact same transcription in the output as we did above:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">'Hoi, allemaal. Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze text te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.'</pre>



<p>Again, you probably don&#8217;t speak Dutch, but that&#8217;s not the point. So underneath the hood, the <code>.transcribe</code> function reads the entire audio file and basically processes it in 30-second windows. You could also see it did the language detection part for us automatically before starting.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Dutch</pre>



<h2 class="wp-block-heading">Working with longer files</h2>



<p>So that&#8217;s pretty good, right? Well, let&#8217;s try a longer audio file and see what happens. I&#8217;ve provided <code>dutch_long_repeat_file.mp3</code> which is just the same audio file but it repeats 3 times, totaling just over 40 seconds. Let&#8217;s see what happens when we try to transcribe this file (make sure you comment out the run above):</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># result = model.transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True)
# print(result["text"])


result = model.transcribe(
    str(AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    verbose=True,
    language="nl",
    task="transcribe",
)
print(result["text"])</pre>



<p>Note we can pass in the language if we already know it, so we can skip the detection step and save some time there. So for applications where you always know the language ahead of time just pass it in to optimize your application. We pass in <code>nl</code> as it is the ISO-639-1 code for the Netherlands.</p>



<p>Now let&#8217;s run this and check the output (yours will look different from mine):</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Hoi j allemaal! Dit is weer een testbestandje! Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. Je en bırak�� collecte geval. Je gievous raakt deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd! Hoi jlynn allemaal! Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. Je en driesbredmontie kunt wiring die text er metυτ�� mesma halen te laten vertalen naar het Engels om te zien hoe goed dat gaat! Ik ben benieuwd. Hoi allemaal! Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.</pre>



<p>Now I&#8217;m not going to make you read this, but as a Dutch person, I will tell you this output is terrible and there are several characters and many words here that do not even exist in the Dutch language! So what happened? It&#8217;s the same model and the audio file is exactly the same as before, it&#8217;s just a bit longer and repeats itself. We should have gotten the same output right?</p>



<p>Well, it is because Whisper&#8217;s machine-learning model is limited to audio segments of only 30 seconds as its input. Because of this, it is more challenging for it to transcribe longer audio files. The <code>.transcribe</code> function took care of cutting the audio into 30-second segments for us and feeding them through and sort of stitching them back together, making our life a lot easier, so we didn&#8217;t really notice this extra challenge.</p>



<p>While whisper does use some clever tricks to improve the quality for transcribing longer audio files that need to be cut into 30-second pieces and put back together again this is inherently just a bit trickier so we saw a significant drop in transcription quality even though the audio we were transcribing was the exact same as before (just repeated 3 times in a row to make it longer).</p>



<p>Does this mean Whisper is only good for small files? Not at all! All we need to solve this bigger challenge of a minor language (Dutch) combined with files longer than 30 seconds is to just step up to a bigger model!</p>



<p>When changing the model to <code>small</code> instead of <code>base</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">model = whisper.load_model("small")</pre>



<p>I got an almost perfect output with only a single very minor spelling mistake. When I changed to <code>medium</code> afterward it was absolutely perfect. It&#8217;s just a matter of using a bigger model until it works. Pick the model size that corresponds to the size of your challenge.</p>



<h2 class="wp-block-heading">Translating</h2>



<p>Besides just transcribing, as if that wasn&#8217;t awesome enough, Whisper can also translate pretty much all major languages to English. (If you get very hacky it can even translate English to other languages, but that is not an intended or supported feature).</p>



<p>So now let&#8217;s give it an audio file in a non-English language and then ask it for an English translation. We&#8217;ll feed it the <code>dutch_the_netherlands.mp3</code> file again, but this time ask it for a translation (to English) so you can finally find out what I said in the audio!</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">result = model.transcribe(
    str(AUDIO_DIR / "dutch_the_netherlands.mp3"),
    verbose=True,
    language="nl",
    task="translate",
)
print(result["text"])</pre>



<p>Make sure you comment out any calls above so you don&#8217;t run them by accident. I&#8217;ve already tested this out and you&#8217;ll need to load around the <code>medium</code> model size to get a good translation, so make sure you load that BEFORE the call above (if your computer can handle it, otherwise just try a smaller one).</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">model = whisper.load_model("medium")</pre>



<p>The output is:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Hey everyone, this is a test file again. This time to test whether the Dutch language will be recognized well. After this, we can also try to translate this text into English to see how well that goes. I'm curious.</pre>



<p>It&#8217;s really quite a decent translation, straight from spoken text. That is very impressive. For sloppy pronunciation it still works quite well &#8211; I tested this using my Korean pronunciation which is not great and the results were still pretty good.</p>



<p>So the different languages, longer files or perhaps slightly less native pronunciation will benefit a lot from going to larger versions of the model (as long as you have the VRAM for it). I&#8217;ll be sticking with the lower end of the spectrum models for this series as much as possible, as not everyone will have the GPU to run the larger models, but feel free to use a larger model if you have the VRAM for it.</p>



<p>On the flip side, if you can only run the small or even the base models, do not despair! The next two tutorials will actually do very well for accuracy running on these smaller models, and again, in the last part, we&#8217;ll look at speeding up, optimizing, or outsourcing the processing altogether.</p>



<p>Now that we&#8217;ve got the more boring basics out of the way, it&#8217;s time to build some cool and fun stuff and look at practical applications and integration in the next couple of parts! See you there!</p>



<h2 class="wp-block-heading">Full Course: OpenAI Whisper &#8211; Building Cutting-Edge Python Apps with OpenAI Whisper</h2>



<p>Check out our full OpenAI Whisper course with video lessons, easy explanations, GitHub, and a downloadable PDF certificate to prove your speech processing skills to your employer and freelancing clients:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://academy.finxter.com/university/openai-whisper/"><img loading="lazy" decoding="async" width="908" height="257" src="https://blog.finxter.com/wp-content/uploads/2024/01/image-154.png" alt="" class="wp-image-1654506" srcset="https://blog.finxter.com/wp-content/uploads/2024/01/image-154.png 908w, https://blog.finxter.com/wp-content/uploads/2024/01/image-154-300x85.png 300w, https://blog.finxter.com/wp-content/uploads/2024/01/image-154-768x217.png 768w" sizes="auto, (max-width: 908px) 100vw, 908px" /></a></figure>
</div>


<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f449.png" alt="👉" class="wp-smiley" style="height: 1em; max-height: 1em;" /> [<strong>Academy</strong>] <a href="https://academy.finxter.com/university/openai-whisper/" data-type="link" data-id="https://academy.finxter.com/university/openai-whisper/">Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper</a></p>



<p></p>
<p>The post <a href="https://blog.finxter.com/openai-whisper-python-installation-setup-first-steps-to-speech-to-text-synthesis/">OpenAI Whisper &#8211; Python Installation, Setup, &#038; First Steps to Speech-to-Text Synthesis</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>OpenAI Text-to-Speech (TTS) &#8211; I Tried the Top Ten Languages with 10 Amazing Speech Samples</title>
		<link>https://blog.finxter.com/openai-text-to-speech-tts-i-tried-the-top-ten-languages-10-speech-samples/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Tue, 14 Nov 2023 10:18:00 +0000</pubDate>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Large Language Model (LLM)]]></category>
		<category><![CDATA[OpenAI]]></category>
		<category><![CDATA[Prompt Engineering]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Speech Recognition and Generation]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1652983</guid>

					<description><![CDATA[<p>Let&#8217;s check out OpenAI&#8217;s fantastic Text-to-Speech (TTS) technology. I was blown away when I first heard these voices; they sound so incredibly human, it&#8217;s almost hard to believe! It&#8217;s like having a friendly chat in different languages, all thanks to OpenAI&#8217;s amazing speech-generation skills in the world&#8217;s top ten languages. I used the following code: ... <a title="OpenAI Text-to-Speech (TTS) &#8211; I Tried the Top Ten Languages with 10 Amazing Speech Samples" class="read-more" href="https://blog.finxter.com/openai-text-to-speech-tts-i-tried-the-top-ten-languages-10-speech-samples/" aria-label="Read more about OpenAI Text-to-Speech (TTS) &#8211; I Tried the Top Ten Languages with 10 Amazing Speech Samples">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/openai-text-to-speech-tts-i-tried-the-top-ten-languages-10-speech-samples/">OpenAI Text-to-Speech (TTS) &#8211; I Tried the Top Ten Languages with 10 Amazing Speech Samples</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Let&#8217;s check out OpenAI&#8217;s fantastic Text-to-Speech (TTS) technology. I was blown away when I first heard these voices; they sound so incredibly human, it&#8217;s almost hard to believe! </p>



<p>It&#8217;s like having a friendly chat in different languages, all thanks to OpenAI&#8217;s amazing speech-generation skills in the world&#8217;s top ten languages.</p>



<p>I used the following code:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import openai

your_openai_key = 'sk-...'
d = {
    'English': 'Finxter helps you stay on the right side of change!',
    'Mandarin Chinese (Simplified)': '...',
    'Hindi': '...',
    'Spanish': '¡Finxter te ayuda a mantenerte del lado correcto del cambio!',
    'French': 'Finxter vous aide à rester du bon côté du changement !',
    'Arabic': '...',
    'Bengali': '...',
    'Russian': 'Финкстер помогает вам оставаться на правильной стороне изменений!',
    'Portuguese': 'Finxter ajuda você a permanecer no lado certo da mudança!',
    'Indonesian': 'Finxter membantu Anda tetap di sisi yang benar dari perubahan!'
}


client = openai.OpenAI(api_key=your_openai_key)
voices = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']

for language in d:
    response = client.audio.speech.create(
        model="tts-1",
        voice='onyx',
        input=d[language]
    )

    response.stream_to_file(f'{language}.mp3')
</pre>



<p>This code snippet uses <a href="https://blog.finxter.com/openai-text-to-speech-tts-minimal-example-in-python/" data-type="post" data-id="1652777">OpenAI&#8217;s Text-to-Speech (TTS)</a> capabilities through the <a href="https://blog.finxter.com/openai-python-api-a-helpful-illustrated-guide-in-5-steps/" data-type="post" data-id="1487700">OpenAI Python</a> library. It begins by importing the OpenAI module and setting up an API key. You should have installed OpenAI:</p>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f9d1-200d-1f4bb.png" alt="🧑‍💻" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/how-to-install-openai-in-python/" data-type="post" data-id="1170845" target="_blank" rel="noreferrer noopener">How to Install OpenAI in Python?</a></p>



<p>A dictionary <code>d</code> is defined, containing sentences in various languages, each associated with a language key. I used the world&#8217;s 10 most spoken languages but for formatting reasons skipped some translations &#8212; my blog software cannot display the Unicode symbols.</p>



<p>The code then initializes an OpenAI client with the specified API key. It iterates over the languages in the <a href="https://blog.finxter.com/python-create-dictionary-the-ultimate-guide/" data-type="post" data-id="1651200">dictionary</a> <code>d</code>, using the <code>client.audio.speech.create</code> function to convert the text in each language to speech. </p>



<p>The chosen model for TTS is <code>"tts-1"</code> and the voice is set to &#8216;onyx&#8217; for all languages. </p>



<p>The audio output for each language is then saved as an MP3 file named after the respective language. Here are the language samples &#8212; look at how amazing these sound: <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>English:</strong> Finxter helps you stay on the right side of change!</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/English-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Mandarin Chinese (Simplified):</strong> Finxter 帮助你保持在变化的正确一边！</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/Mandarin-Chinese-Simplified-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Hindi:</strong> फिंक्सटर आपको परिवर्तन के सही पक्ष में बने रहने में मदद करता है!</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/Hindi-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Spanish:</strong> ¡Finxter te ayuda a mantenerte del lado correcto del cambio!</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/Spanish-2.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>French:</strong> Finxter vous aide à rester du bon côté du changement !</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/French-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Arabic:</strong> فينكستر يساعدك على البقاء على الجانب الصحيح من التغيير!</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/Arabic-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Bengali:</strong> ফিন্ক্সটার আপনাকে পরিবর্তনের সঠিক দিকে থাকতে সাহায্য করে!</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/Bengali-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Russian:</strong> Финкстер помогает вам оставаться на правильной стороне изменений!</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/Russian-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Portuguese:</strong> Finxter ajuda você a permanecer no lado certo da mudança!</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/Portuguese-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>Indonesian:</strong> Finxter membantu Anda tetap di sisi yang benar dari perubahan!</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/Indonesian-1.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Bonus &#8211; <strong>German</strong>: Finxter hilft dir, auf der richtigen Seite der Veränderung zu bleiben.</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/German.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Thanks for being an avid Finxter reader! Check out this article next: <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /> </p>



<figure class="wp-block-image size-large"><a href="https://blog.finxter.com/openai-text-to-speech-tts-minimal-example-in-python/"><img loading="lazy" decoding="async" width="1024" height="585" src="https://blog.finxter.com/wp-content/uploads/2023/11/54aa8411-185a-4204-8159-abcf9bfc0d29-1-1024x585.webp" alt="" class="wp-image-1653017" srcset="https://blog.finxter.com/wp-content/uploads/2023/11/54aa8411-185a-4204-8159-abcf9bfc0d29-1-1024x585.webp 1024w, https://blog.finxter.com/wp-content/uploads/2023/11/54aa8411-185a-4204-8159-abcf9bfc0d29-1-300x171.webp 300w, https://blog.finxter.com/wp-content/uploads/2023/11/54aa8411-185a-4204-8159-abcf9bfc0d29-1-768x439.webp 768w, https://blog.finxter.com/wp-content/uploads/2023/11/54aa8411-185a-4204-8159-abcf9bfc0d29-1-1536x878.webp 1536w, https://blog.finxter.com/wp-content/uploads/2023/11/54aa8411-185a-4204-8159-abcf9bfc0d29-1.webp 1792w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></a></figure>



<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f9d1-200d-1f4bb.png" alt="🧑‍💻" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/openai-text-to-speech-tts-minimal-example-in-python/" data-type="link" data-id="https://blog.finxter.com/openai-text-to-speech-tts-minimal-example-in-python/">OpenAI Text to Speech (TTS): Minimal Example in Python</a></p>



<p>Feel free to check out our <a href="http://academy.finxter.com/">academy courses</a> to keep mastering prompt engineering, e.g., with Llama 2:</p>



<h2 class="wp-block-heading">Prompt Engineering with Llama 2</h2>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> The <strong><a href="https://academy.finxter.com/university/prompt-engineering-with-llama-2/">Llama 2 Prompt Engineering course</a></strong> helps you stay on the right side of change. Our course is meticulously designed to provide you with <em>hands-on experience through genuine projects</em>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://academy.finxter.com/university/prompt-engineering-with-llama-2/" target="_blank" rel="noreferrer noopener"><img loading="lazy" decoding="async" width="919" height="261" src="https://blog.finxter.com/wp-content/uploads/2023/09/image-101.png" alt="" class="wp-image-1651689" srcset="https://blog.finxter.com/wp-content/uploads/2023/09/image-101.png 919w, https://blog.finxter.com/wp-content/uploads/2023/09/image-101-300x85.png 300w, https://blog.finxter.com/wp-content/uploads/2023/09/image-101-768x218.png 768w" sizes="auto, (max-width: 919px) 100vw, 919px" /></a></figure>
</div>


<p>You&#8217;ll delve into practical applications such as book PDF querying, payroll auditing, and hotel review analytics. These aren&#8217;t just theoretical exercises; they&#8217;re real-world challenges that businesses face daily.</p>



<p>By studying these projects, you&#8217;ll gain a deeper comprehension of how to harness the power of Llama 2 using <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f40d.png" alt="🐍" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Python, <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f517.png" alt="🔗" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f99c.png" alt="🦜" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Langchain, <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f332.png" alt="🌲" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Pinecone, and a whole stack of highly <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2692.png" alt="⚒" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f6e0.png" alt="🛠" class="wp-smiley" style="height: 1em; max-height: 1em;" /> practical tools of exponential coders in a post-ChatGPT world.</p>
<p>The post <a href="https://blog.finxter.com/openai-text-to-speech-tts-i-tried-the-top-ten-languages-10-speech-samples/">OpenAI Text-to-Speech (TTS) &#8211; I Tried the Top Ten Languages with 10 Amazing Speech Samples</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/English-1.mp3" length="64800" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/Mandarin-Chinese-Simplified-1.mp3" length="72000" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/Hindi-1.mp3" length="92640" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/Spanish-2.mp3" length="74400" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/French-1.mp3" length="71040" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/Arabic-1.mp3" length="103680" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/Bengali-1.mp3" length="98880" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/Russian-1.mp3" length="90720" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/Portuguese-1.mp3" length="72480" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/Indonesian-1.mp3" length="77760" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/German.mp3" length="87840" type="audio/mpeg" />

			</item>
		<item>
		<title>OpenAI Text to Speech (TTS): Minimal Example in Python</title>
		<link>https://blog.finxter.com/openai-text-to-speech-tts-minimal-example-in-python/</link>
		
		<dc:creator><![CDATA[Chris]]></dc:creator>
		<pubDate>Wed, 08 Nov 2023 15:30:44 +0000</pubDate>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[Large Language Model (LLM)]]></category>
		<category><![CDATA[OpenAI]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Speech Recognition and Generation]]></category>
		<category><![CDATA[Text Processing]]></category>
		<guid isPermaLink="false">https://blog.finxter.com/?p=1652777</guid>

					<description><![CDATA[<p>To use OpenAI&#8217;s amazing Text-to-Speech (TTS) functionality, first install the openai Python library and obtain an API key from OpenAI. Instantiate an OpenAI client with openai.OpenAI(api_key). Call client.audio.speech.create(model='tts_1', voice='alloy', input=your_text) to use the 'alloy' voice model. This generates speech you can save as an MP3 file using response.stream_to_file('your_file.mp3'). First, install the OpenAI library and set ... <a title="OpenAI Text to Speech (TTS): Minimal Example in Python" class="read-more" href="https://blog.finxter.com/openai-text-to-speech-tts-minimal-example-in-python/" aria-label="Read more about OpenAI Text to Speech (TTS): Minimal Example in Python">Read more</a></p>
<p>The post <a href="https://blog.finxter.com/openai-text-to-speech-tts-minimal-example-in-python/">OpenAI Text to Speech (TTS): Minimal Example in Python</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p class="has-global-color-8-background-color has-background">To use OpenAI&#8217;s amazing Text-to-Speech (TTS) functionality, first install the <code>openai</code> Python library and obtain an API key from OpenAI. <br><br>Instantiate an OpenAI <code>client</code> with <code>openai.OpenAI(api_key)</code>. <br><br>Call <code>client.audio.speech.create(model='tts_1', voice='alloy', input=your_text)</code> to use the <code>'alloy'</code> voice model. <br><br>This generates speech you can save as an MP3 file using <code>response.stream_to_file('your_file.mp3')</code>. </p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>First, install the OpenAI library and set up your OpenAI key. </p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip install openai # Python 2 or 3
pip3 install openai # Python 3 
!pip install openai # Google Colab</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://blog.finxter.com/how-to-install-openai-in-python/"><img loading="lazy" decoding="async" width="1024" height="575" src="https://blog.finxter.com/wp-content/uploads/2023/11/image-1-2-1024x575.png" alt="" class="wp-image-1652778" srcset="https://blog.finxter.com/wp-content/uploads/2023/11/image-1-2-1024x575.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/11/image-1-2-300x168.png 300w, https://blog.finxter.com/wp-content/uploads/2023/11/image-1-2-768x431.png 768w, https://blog.finxter.com/wp-content/uploads/2023/11/image-1-2.png 1364w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></a></figure>
</div>


<p class="has-base-2-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f9d1-200d-1f4bb.png" alt="🧑‍💻" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/how-to-install-openai-in-python/" data-type="post" data-id="1170845" target="_blank" rel="noreferrer noopener">How to Install OpenAI in Python?</a></p>



<p>Second, copy and paste the following code in your Pythons script or notebook, replacing the OpenAI API key with your own.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="3,4" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import openai

your_openai_key = 'sk-...'
your_text = 'Finxter helps you stay on the right side of change!'

client = openai.OpenAI(api_key=your_openai_key)

response = client.audio.speech.create(
  model="tts-1",
  voice="alloy", # other voices: 'echo', 'fable', 'onyx', 'nova', 'shimmer'
  input=your_text
)

response.stream_to_file('speech.mp3')
</pre>



<p>You can now find the file <code>'speech.mp3'</code> in the same folder where you ran your Python script. Easy as that!</p>



<p>Now check out the amazing voice that sounds like a genuine human being, doesn&#8217;t it? <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f447.png" alt="👇" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/speech4.mp3"></audio></figure>



<p>At the time of writing, you can use the following voices:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">voices = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="585" src="https://blog.finxter.com/wp-content/uploads/2023/11/895d4972-fc04-42f0-8d4f-8186acab703a-1024x585.webp" alt="" class="wp-image-1652793" srcset="https://blog.finxter.com/wp-content/uploads/2023/11/895d4972-fc04-42f0-8d4f-8186acab703a-1024x585.webp 1024w, https://blog.finxter.com/wp-content/uploads/2023/11/895d4972-fc04-42f0-8d4f-8186acab703a-300x171.webp 300w, https://blog.finxter.com/wp-content/uploads/2023/11/895d4972-fc04-42f0-8d4f-8186acab703a-768x439.webp 768w, https://blog.finxter.com/wp-content/uploads/2023/11/895d4972-fc04-42f0-8d4f-8186acab703a-1536x878.webp 1536w, https://blog.finxter.com/wp-content/uploads/2023/11/895d4972-fc04-42f0-8d4f-8186acab703a.webp 1792w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>


<p>Here are the six different voices in that order:</p>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f468.png" alt="👨" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Alloy </strong>(male):</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/speech_alloy.mp3"></audio></figure>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f468-200d-1f9b2.png" alt="👨‍🦲" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Echo </strong>(male):</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/speech_echo.mp3"></audio></figure>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f984.png" alt="🦄" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Fable </strong>(female?):</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/speech_fable.mp3"></audio></figure>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f469.png" alt="👩" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Onyx </strong>(female):</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/speech_nova.mp3"></audio></figure>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f9d3.png" alt="🧓" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Nova </strong>(deep male):</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/speech_onyx.mp3"></audio></figure>



<p><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f483.png" alt="💃" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Shimmer </strong>(female):</p>



<figure class="wp-block-audio"><audio controls src="https://blog.finxter.com/wp-content/uploads/2023/11/speech_shimmer.mp3"></audio></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Staying tuned in these rapidly changing times is crucial. Feel free to join our free email newsletter by downloading our Python and OpenAI cheat sheets:</p>






<p>Also, you can take our prompt engineering courses for premium success:</p>



<h2 class="wp-block-heading">Prompt Engineering with Llama 2</h2>



<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f4a1.png" alt="💡" class="wp-smiley" style="height: 1em; max-height: 1em;" /> The <strong><a href="https://academy.finxter.com/university/prompt-engineering-with-llama-2/">Llama 2 Prompt Engineering course</a></strong> helps you stay on the right side of change. Our course is meticulously designed to provide you with <em>hands-on experience through genuine projects</em>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://academy.finxter.com/university/prompt-engineering-with-llama-2/" target="_blank" rel="noreferrer noopener"><img loading="lazy" decoding="async" width="919" height="261" src="https://blog.finxter.com/wp-content/uploads/2023/09/image-101.png" alt="" class="wp-image-1651689" srcset="https://blog.finxter.com/wp-content/uploads/2023/09/image-101.png 919w, https://blog.finxter.com/wp-content/uploads/2023/09/image-101-300x85.png 300w, https://blog.finxter.com/wp-content/uploads/2023/09/image-101-768x218.png 768w" sizes="auto, (max-width: 919px) 100vw, 919px" /></a></figure>
</div>


<p>You&#8217;ll delve into practical applications such as book PDF querying, payroll auditing, and hotel review analytics. These aren&#8217;t just theoretical exercises; they&#8217;re real-world challenges that businesses face daily.</p>



<p>By studying these projects, you&#8217;ll gain a deeper comprehension of how to harness the power of Llama 2 using <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f40d.png" alt="🐍" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Python, <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f517.png" alt="🔗" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f99c.png" alt="🦜" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Langchain, <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f332.png" alt="🌲" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Pinecone, and a whole stack of highly <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2692.png" alt="⚒" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f6e0.png" alt="🛠" class="wp-smiley" style="height: 1em; max-height: 1em;" /> practical tools of exponential coders in a post-ChatGPT world.</p>
<p>The post <a href="https://blog.finxter.com/openai-text-to-speech-tts-minimal-example-in-python/">OpenAI Text to Speech (TTS): Minimal Example in Python</a> appeared first on <a href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p>
]]></content:encoded>
					
		
		<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/speech4.mp3" length="63840" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/speech_alloy.mp3" length="63840" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/speech_echo.mp3" length="60000" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/speech_fable.mp3" length="64800" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/speech_nova.mp3" length="64800" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/speech_onyx.mp3" length="65760" type="audio/mpeg" />
<enclosure url="https://blog.finxter.com/wp-content/uploads/2023/11/speech_shimmer.mp3" length="65280" type="audio/mpeg" />

			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 
Minified using Disk

Served from: blog.finxter.com @ 2026-04-27 04:15:36 by W3 Total Cache
-->