OpenAI’s Speech-to-Text API: A Comprehensive Guide

Here’s an obvious statement: audio and video recordings are an essential part of our daily lives.

Recently, I consulted a young startup in the food AI space on how to convert video to text, and I recommended them Google’s speech recognition software. Yet, Google’s models are not great — only best in class at this point in time.

Enter… OpenAI. 🚀

To alleviate the problem of either relying on suboptimal speech-to-text Python libraries or manually transcribing audio files (🥴), OpenAI has recently introduced its state-of-the-art speech-to-text API that can transcribe audio files into text in real time.

In this article, we will provide a comprehensive guide to OpenAI’s speech-to-text API.

Overview

OpenAI’s speech-to-text API provides two endpoints, transcriptions and translations, based on their state-of-the-art open-source large-v2 Whisper model.

The transcriptions endpoint can be used to transcribe audio into whatever language the audio is in.
On the other hand, the translations endpoint can transcribe the audio into English, regardless of the original language of the audio.

Currently, file uploads are limited to 25 MB, and the following input file types are supported:

mp3,
mp4,
mpeg,
mpga,
m4a,
wav, and
webm.

So, yes, you can transcribe both video and audio!

In case you skimmed over the previous sentence, here it is again in bold:

👉 Whisper API can transcribe both video and audio file formats! 👈

Supported Languages

The Whisper API supports a wide range of languages, including Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

At the time of writing this, I found the following up-to-date performance scores of the model (lower is better):

Read the paper if you’re interested in going down this rabbit hole!

Installing OpenAI Library

First, you need to install the openai library before you can use it in your Python code.

pip install openai

You can learn more in our detailed Finxter tutorial:

💡 Recommended: How to Install OpenAI in Python?

Transcriptions Endpoint

The transcriptions endpoint takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. OpenAI currently supports multiple input and output file formats.

To transcribe audio, you can use the following Python code:

import openai
audio_file = open("/path/to/file/my_audio.mp3", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)

If you’re an avid reader of the Finxter blog, you know the vital role of Python one-liners. With OpenAI’s Whisper you can transcribe an audio or video file in a single line of Python code!

import openai;print(openai.Audio.transcribe("whisper-1", open("godfather.mp3", "rb")))

Fantastic! ♥️

By default, the response type will be JSON, with the raw text included:

{
  "text": "I'm gonna make him an offer he can't refuse."
}

Using Curl in the Command Line (Alternative)

Additional parameters can be set in the request by adding more --form lines with the relevant options. Here’s an example from the docs using curl, i.e., not Python:

curl --request POST \
  --url https://api.openai.com/v1/audio/transcriptions \
  --header 'Authorization: Bearer TOKEN' \
  --header 'Content-Type: multipart/form-data' \
  --form file=@openai.mp3 \
  --form model=whisper-1 \
  --form response_format=text

Let’s move back to Python: 👇

Translations Endpoint

The translations endpoint takes as input the audio file in any of the supported languages and transcribes, if necessary, the audio into English.

To translate audio, you can use the following Python code:

import openai
audio_file = open("my_file.mp3", "rb")
transcript = openai.Audio.translate("whisper-1", audio_file)

Alternatively, you could also pass an .mp4 video file into it!

Longer Inputs

By default, the Whisper API only supports files that are less than 25 MB. If you have an audio file that is longer than that, you will need to break it up into chunks of 25 MBs or less or use a compressed audio format. To handle longer inputs, you can use the PyDub open-source Python package to split the audio.

from pydub import AudioSegment

song = AudioSegment.from_mp3("good_morning.mp3")

# PyDub handles time in milliseconds
ten_minutes = 10 * 60 * 1000

first_10_minutes = song[:ten_minutes]

first_10_minutes.export("good_morning_10.mp3", format="mp3")

Now, you may also want to read this Finxter tutorial on a related topic:

💡 Recommended: How to Transcribe Large Audio to Text Python

Prompting

You can use a prompt to improve the quality of the transcripts generated by the Whisper API.

The model will try to match the style of the prompt. For example, if your prompt uses capitalization and punctuation, the model will also do so (or at least try).

The prompting system is limited compared to OpenAI’s other language models, but it still provides some control over the generated audio.

Here are some examples of how prompts can be used:

Unlikely Words or Acronyms

To correct specific words or acronyms that the model often misrecognizes in the audio (e.g., 'Finxter'), you can include them in the prompt.

For example, the following prompt improves the transcription of the word Finxter: The transcript is about Finxter, a coding education platform. Most transcription tools would otherwise misspell the word 'Finxter'.

Preserve Context After Splitting Due to Large Input Size

To preserve the context of a file that was split into segments, you can prompt the model with the transcript of the preceding segment.

This will make the transcript more accurate, as the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier.

Add Punctuation

If the model skips punctuation in the transcript, you can use a prompt that includes punctuation to avoid this.

For example: Hi there, it's me, Chris!

Not Filtering Out Filler Words (‘Ummmmmm’)

If you want to keep the filler words in your transcript for whatever reason 🤨, you can use a prompt that contains them.

For example: Umm, Finxter is like, hmm... Okay, umm, Finxter is like umm an academy teaching umm exponential technologies.

Writing Style

If you want a specific writing style for the transcribed text such as old medieval English or traditional Chinese or in the style of Warren Buffett, you can provide some examples in the prompt or flat-out tell the model to transcribe in that style.

Conclusion

OpenAI’s speech-to-text API provides an efficient and accurate solution for transcribing audio into text. With support for a wide range of languages and the ability to handle longer inputs, this API can be used in various applications such as subtitling, captioning, and transcription. By using prompts, users can further improve the quality of the generated transcripts.