Large Audio to Text? Here’s My Speech Recognition Solution in Python

Project Idea

A good friend and his wife recently founded an AI startup in the lifestyle niche that uses machine learning to discover specific real-world patterns from videos.

For their business system, they need a pipeline that takes a video file, converts it to audio, and transcribes the audio to standard text that is then used for further processing.

I couldn’t help but work on a basic solution to help fix their business problem. In this project, I’ll share my code solution to transcribe an audio file — I hope it can be of some use to you as well!

So, let’s get started!

πŸ’ͺ LARGE AUDIO FILES: This solution will also work for large audio files longer than, say, a few minutes of speech.

In the meantime, there’s a new tool in town that is undoubtedly the best — check out this tutorial instead:

πŸ’‘ Recommended Tutorial: OpenAI’s Speech-to-Text API: A Comprehensive Guide

Solution Overview

To transcribe a large audio file in Python, follow these rough steps:

  • Step 1: Import Google’s Speech Recognition and Pydub libraries and create a speech recognition object using the Recognizer() method.
  • Step 2: Define a function transcribe_large_audio() that takes in a path to an audio file as an argument.
  • Step 3: Inside the function, open the audio file with Pydub and split it into chunks based on durations of silence. This makes sure to avoid the RequestError when Google complains about the file size being too large.
  • Step 4: Create a folder to store the chunks, and recognize each chunk separately using the Speech Recognition library, convert it to text, and store it in a variable.
  • Step 5: Finally, return the whole transcription and print it to both the console and a text file.

Preparation

Before you start, make sure to install both the speech_recognition and the pydub module in your programming environment.

πŸ‘‰ Recommended: How to Install a Library in Python?

In particular, run the following two commands in your shell or terminal:

pip3.9 install pydub
pip3.9 install SpeechRecognition

This is for my Python version 3.9 installation, anyways.

I’m sure you have a more recent version installed already, so check your Python version before installation to avoid installing the two libraries for the wrong Python version on your computer — a common mistake of beginners!

Wait for the installation to complete before moving on!

PIP install pydub
PIP install Speech Recognition

Done? Let’s move on to the code! πŸ‘

Python Code

Without further ado, here’s how to implement a speech recognition pipeline in basic Python code:

# Import libraries
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence

# Create a speech recognition object
r = sr.Recognizer()

def transcribe_large_audio(path):
    """Split audio into chunks and apply speech recognition"""
    # Open audio file with pydub
    sound = AudioSegment.from_wav(path)

    # Split audio where silence is 700ms or greater and get chunks
    chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700)
    
    # Create folder to store audio chunks
    folder_name = "audio-chunks"
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    
    whole_text = ""
    # Process each chunk
    for i, audio_chunk in enumerate(chunks, start=1):
        # Export chunk and save in folder
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")

        # Recognize chunk
        with sr.AudioFile(chunk_filename) as source:
            audio_listened = r.record(source)
            # Convert to text
            try:
                text = r.recognize_google(audio_listened)
            except sr.UnknownValueError as e:
                print("Error:", str(e))
            else:
                text = f"{text.capitalize()}. "
                print(chunk_filename, ":", text)
                whole_text += text

    # Return text for all chunks
    return whole_text

result = transcribe_large_audio('sample_audio.wav')

print(result)
print(result, file=open('result.txt', 'w'))

Don’t worry if you didn’t get it yet — I’ll give some more explanations next.

However, you can already copy&paste this code in a Python file (e.g., code.py) that resides in the same folder as your sample audio file. Then replace the 'sample_audio.wav' with your specific audio filename and run the Python script.

πŸ‘‰ Recommended: How to Execute a Python Script?

Explanation

The code imports the Speech Recognition and Pydub libraries and creates a speech recognition object.

The speech recognition object is used to convert audio to text:

# Import libraries
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence

# Create a speech recognition object
r = sr.Recognizer()

It then defines a function (transcribe_large_audio) that takes in a path to an audio file as an argument.

Inside the function, the audio is opened with Pydub and split into chunks based on silence. The chunks are then divided by a minimum silence length, a silence threshold, and a keep silence period:

def transcribe_large_audio(path):
    """Split audio into chunks and apply speech recognition"""
    # Open audio file with pydub
    sound = AudioSegment.from_wav(path)

    # Split audio where silence is 700ms or greater and get chunks
    chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700)

A folder is then created to store the chunks, and each chunk is recognized by the Speech Recognition library.

The audio chunk is then converted to text using the Speech Recognition library.

The text is stored in a variable, and the function returns the whole transcription, which is printed to both the console and a text file:

    # Create folder to store audio chunks
    folder_name = "audio-chunks"
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    
    whole_text = ""
    # Process each chunk
    for i, audio_chunk in enumerate(chunks, start=1):
        # Export chunk and save in folder
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")

        # Recognize chunk
        with sr.AudioFile(chunk_filename) as source:
            audio_listened = r.record(source)
            # Convert to text
            try:
                text = r.recognize_google(audio_listened)
            except sr.UnknownValueError as e:
                print("Error:", str(e))
            else:
                text = f"{text.capitalize()}. "
                print(chunk_filename, ":", text)
                whole_text += text

Where to Go From Here?

Thanks for reading the whole article! Make sure to join our email academy of ~150,000 coders, and counting. We have plenty of free stuff and coding projects! πŸ™‚

πŸ‘‰ Recommended: Coding Your Own Google Home and Launch Spotify in Python