Project Idea
A good friend and his wife recently founded an AI startup in the lifestyle niche that uses machine learning to discover specific real-world patterns from videos.
For their business system, they need a pipeline that takes a video file, converts it to audio, and transcribes the audio to standard text that is then used for further processing.
I couldn’t help but work on a basic solution to help fix their business problem. In this project, I’ll share my code solution to transcribe an audio file — I hope it can be of some use to you as well!
So, let’s get started!
πͺ LARGE AUDIO FILES: This solution will also work for large audio files longer than, say, a few minutes of speech.
In the meantime, there’s a new tool in town that is undoubtedly the best — check out this tutorial instead:
π‘ Recommended Tutorial: OpenAIβs Speech-to-Text API: A Comprehensive Guide
Solution Overview
To transcribe a large audio file in Python, follow these rough steps:
- Step 1: Import Google’s Speech Recognition and Pydub libraries and create a speech recognition object using the
Recognizer()
method. - Step 2: Define a function
transcribe_large_audio()
that takes in a path to an audio file as an argument. - Step 3: Inside the function, open the audio file with Pydub and split it into chunks based on durations of silence. This makes sure to avoid the RequestError when Google complains about the file size being too large.
- Step 4: Create a folder to store the chunks, and recognize each chunk separately using the Speech Recognition library, convert it to text, and store it in a variable.
- Step 5: Finally, return the whole transcription and print it to both the console and a text file.
Preparation
Before you start, make sure to install both the speech_recognition
and the pydub
module in your programming environment.
π Recommended: How to Install a Library in Python?
In particular, run the following two commands in your shell or terminal:
pip3.9 install pydub pip3.9 install SpeechRecognition
This is for my Python version 3.9 installation, anyways.
I’m sure you have a more recent version installed already, so check your Python version before installation to avoid installing the two libraries for the wrong Python version on your computer — a common mistake of beginners!
Wait for the installation to complete before moving on!
Done? Let’s move on to the code! π
Python Code
Without further ado, here’s how to implement a speech recognition pipeline in basic Python code:
# Import libraries import speech_recognition as sr import os from pydub import AudioSegment from pydub.silence import split_on_silence # Create a speech recognition object r = sr.Recognizer() def transcribe_large_audio(path): """Split audio into chunks and apply speech recognition""" # Open audio file with pydub sound = AudioSegment.from_wav(path) # Split audio where silence is 700ms or greater and get chunks chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700) # Create folder to store audio chunks folder_name = "audio-chunks" if not os.path.isdir(folder_name): os.mkdir(folder_name) whole_text = "" # Process each chunk for i, audio_chunk in enumerate(chunks, start=1): # Export chunk and save in folder chunk_filename = os.path.join(folder_name, f"chunk{i}.wav") audio_chunk.export(chunk_filename, format="wav") # Recognize chunk with sr.AudioFile(chunk_filename) as source: audio_listened = r.record(source) # Convert to text try: text = r.recognize_google(audio_listened) except sr.UnknownValueError as e: print("Error:", str(e)) else: text = f"{text.capitalize()}. " print(chunk_filename, ":", text) whole_text += text # Return text for all chunks return whole_text result = transcribe_large_audio('sample_audio.wav') print(result) print(result, file=open('result.txt', 'w'))
Don’t worry if you didn’t get it yet — I’ll give some more explanations next.
However, you can already copy&paste this code in a Python file (e.g., code.py
) that resides in the same folder as your sample audio file. Then replace the 'sample_audio.wav'
with your specific audio filename and run the Python script.
π Recommended: How to Execute a Python Script?
Explanation
The code imports the Speech Recognition and Pydub libraries and creates a speech recognition object.
The speech recognition object is used to convert audio to text:
# Import libraries import speech_recognition as sr import os from pydub import AudioSegment from pydub.silence import split_on_silence # Create a speech recognition object r = sr.Recognizer()
It then defines a function (transcribe_large_audio
) that takes in a path to an audio file as an argument.
Inside the function, the audio is opened with Pydub and split into chunks based on silence. The chunks are then divided by a minimum silence length, a silence threshold, and a keep silence period:
def transcribe_large_audio(path): """Split audio into chunks and apply speech recognition""" # Open audio file with pydub sound = AudioSegment.from_wav(path) # Split audio where silence is 700ms or greater and get chunks chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700)
A folder is then created to store the chunks, and each chunk is recognized by the Speech Recognition library.
The audio chunk is then converted to text using the Speech Recognition library.
The text is stored in a variable, and the function returns the whole transcription, which is printed to both the console and a text file:
# Create folder to store audio chunks folder_name = "audio-chunks" if not os.path.isdir(folder_name): os.mkdir(folder_name) whole_text = "" # Process each chunk for i, audio_chunk in enumerate(chunks, start=1): # Export chunk and save in folder chunk_filename = os.path.join(folder_name, f"chunk{i}.wav") audio_chunk.export(chunk_filename, format="wav") # Recognize chunk with sr.AudioFile(chunk_filename) as source: audio_listened = r.record(source) # Convert to text try: text = r.recognize_google(audio_listened) except sr.UnknownValueError as e: print("Error:", str(e)) else: text = f"{text.capitalize()}. " print(chunk_filename, ":", text) whole_text += text
Where to Go From Here?
Thanks for reading the whole article! Make sure to join our email academy of ~150,000 coders, and counting. We have plenty of free stuff and coding projects! π
π Recommended: Coding Your Own Google Home and Launch Spotify in Python