Coding Your Own Google Home and Launch Spotify in Python

Doesn’t this project sound exciting?

Project Goal

Project goal: code your own Google Home with Python and learn how to use speech recognition to launch Spotify and play songs!

Ever wanted to code a powerful yet simple tool that is more bespoke than mainstream devices?

We will learn how to implement it in Python with a bunch of powerful libraries!

Breaking down our problem, there are three tasks ahead of us:

  • processing speech and convert it into text
  • based on some string condition in the text, open a process (here, the Spotify app)
  • interact with the process

Performing Speech Recognition

Don’t do the heavy lifting yourself!

Speech recognition is the ability to detect and identify words and phrases in spoken language and subsequently convert them into human, readable text. 

This field can be very complex and top Python libraries result from decades of hard work from experts. We will obviously not be constructing from A to Z such a library, that would be way beyond this tutorial. Instead, we will be using the SpeechRecognition library.

Therefore, we don’t need to build any machine learning model from scratch, this library provides us with wrappers for several well known public speech recognition APIs (such as Google Cloud Speech API, IBM Speech To Text, etc.).

As usual, we will start downloading the module:

pip install SpeechRecognition pydub

Then, in a new Python file, you can import it the following way:

import speech_recognition as sr

It is now very handy, you have access to several speech recognition engines, that have different use cases:

In this tutorial, we will be using Google Speech Recognition, because it is rather simple to use, efficient, and does not require any API key.

Interpreting Speech from a File

Regular Size

Before starting, make sure you placed an audio file containing english language, in the current working directory for maximised simplicity, or somewhere you know the path (such as ‘../audio_files/my_audio_file.wav’).

The first step is to initialize your recognizer like so:

# initialize the recognizer
r = sr.Recognizer()

The below code is then responsible for loading the audio file from the designated path, and then converting the speech into text using Google Speech Recognition:

# open the file
with sr.AudioFile(path_to_audio_file) as source:
    # listen to the data ( = load audio to memory)
    audio_data = r.record(source)
    # recognize (convert from speech to text)
    text = r.recognize_google(audio_data)
    print(text)

 This could take a little while, on the other hand don’t think the duration of the code execution is in some way related to human speech speed: you will commonly eye-witness your code spitting out the full text even before the audio file has finished reading!

Okay, this kind of script works fine for small to medium-sized audio files, not so well for larger files.

Large Audio Files

I won’t go into too many details here, as our goal is to launch Spotify thanks to voice command, remember? Suggesting we’ll use the mic.

However, if you do need to convert the content of large audio files, then you should be looking into the pydub library, more specifically its AudioSegment class and split_on_silence function.

Why?

Because, well-equipped with these two, you will then be able to respectively load the audio data and chunk it, based on a pre-set silence duration found in the data.

This comes handy to split your audio file.

👉 Recommended: How to Recognize Speech From Large Audio Files?

Interpreting Speech from the Mic

We are now getting to the core of the tutorial! We will be processing the audio input directly from the mic, moving one step closer to being able to actually make voice commands.

To start with, this requires PyAudio to be installed on your machine and depending on your OS the installation procedure varies:

Windows

pip install pyaudio

Linux

You need to first install the dependencies:

sudo apt-get install python-pyaudio python3-pyaudio
pip install pyaudio

MacOS

You need to first install portaudio:

brew install portaudio
pip install pyaudio

Warning: you may experience issues installing the module properly, especially on Windows.

For Windows users, if you don’t succeed the abovementioned way, try:

pip install pipwin (if you don’t already have it)
pipwin install pyaudio

Now we’re ready to start building our Spotify launcher!

with sr.Microphone() as source:
    # read the audio data from the microphone
    audio_data = r.record(source, duration=5)
    print("Analyzing...")
    # convert speech to text
    text = r.recognize_google(audio_data)
    print(text)

This piece of code will open the (default) mic, read the input for 5 seconds (you can obviously tailor this argument), then (try to) convert it, finally print out the output. 

Obviously it still isn’t perfect, for example it usually struggles with homophonous phrases or words.

Two arguments are worth mentioning at this point:

  • offset: passed to the record function, it is used to start recording after some delay (default 0)
  • language: passed to the recognize_google function, it changes the target language (ex: “fr-FR”). More info about supported languages here

Opening Processes with Python

Now that we can talk to our speech recognizer and convert speech into text, let’s head towards our second task: opening processes.

As often, there are multiple ways to do this.

We will use the subprocess module, which is built-in.

In particular, we will be using the Popen (P stands for process) function within this module, like so: 

# import the function from its module
from subprocess import Popen

Popen(path_of_the_executable)

For example, on my machine, opening the Spotify desktop app would be done like so:

subprocess.Popen('C:\\Users\\cleme\\AppData\\Roaming\\Spotify\\Spotify.exe')

Depending on your OS, you may need to adjust the slashes in the path to make sure it is understood well. You may want to use a function doing this for you in the os built-in module.

Of course, as always feel free to dive deeper into this module, but for now we have what we need to trigger the opening of our Spotify (desktop) app.

Interacting with Windows

OK, let us sum up:

  • we know how to convert speech to text
  • we know how to open processes

From there, we can easily create a condition on the output text from the speech conversion; for example :

if "spotify" in text.lower():
    subprocess.Popen('C:\\Users\\cleme\\AppData\\Roaming\\Spotify\\Spotify.exe')

What is there left to do?

Now that our Spotify app is open on voice command, we need to somehow be able to launch a song.

To do this, there may exist Spotify modules, but we will be using a powerful module:

pyautogui

This module is essentially about automating mouse, windows, and keyboard actions!

So, what we’ll do is identify the Spotify app search bar location, click it, clear it if needed, type a song or artist name in it, then press Enter, then press Play, and we’re done!

pip install pyautogui

The first thing is to make sure we are dealing with the Spotify app window.

To do this we’ll loop over pyautogui.getAllWindows(), which yields all the currently opened windows titles, and make an if statement to select the Spotify window.

We will then proceed to the subtasks identified above.

We will use a convention here in our voice command: for the sake of simplicity, we will assume that the name of the wanted artist comes last in the voice command (e.g.: “Please open Spotify and play Madonna”).

Of course this is a dummy example, but you can easily improve the voice command and make it more flexible.

Here is what it looks like:

    for window in pyautogui.getAllWindows():
        if 'spotify' in window.title.lower():
            window.show()
            print('spotify window activated')
            text = text.split()  # break down text into list of single words strings, for later usage
            time.sleep(5.5)
            pyautogui.click(x=480,y=25) # this is the search bar location on my machine, when the window is maximized
            time.sleep(1)
            pyautogui.hotkey('ctrl','a') # clearing the search bar
            pyautogui.press('backspace') # clearing the search bar
            time.sleep(0.5)
            pyautogui.write(text[-1],interval=0.05) # because we assumed that the artist was the last word of the voice command
            time.sleep(3)
            pyautogui.click(x=380,y=250) # this is the play button location on my machine, when the window is maximized
            break

Breaking down this piece of code, we performed sequentially all the steps we identified as mandatory. 

Note the difference between the hotkey (keystroke combination) and press (single key, down then up) methods.

We used Ctrl+a to select all potential text in the search bar, then we removed it before typing our artist’s name. The text[-1] bit refers to the last word of our voice command, see the convention described above.

Please note the interval argument inside the write method: in some cases it is vital for our script to function properly. Why?

Because it is the argument that sets the typing speed, and in some instances, pyautogui just goes too fast for the process and this ends up in an unwanted result.

You might need to manually finetune this argument, like I did before settling with 0.05. In the same vein, the time.sleep() statements are here to make sure our code does not outspeed too much the app, for example it enables to wait for the proper opening of the app. This can imply some manual tries.

Last, the break statement is there to make sure we go out of the for loop once we have found our app. Let’s not waste time checking useless windows!

Alright, we’re almost there, I can hear the song now!

Now you may wonder, what if we need to stop the song from playing?

Well, we’ll take care of just that in the below piece of code:

while True:
        try:
            with sr.Microphone() as source:
                # read the audio data from the default microphone
                audio_data = r.record(source, duration=1)
                # convert speech to text
                text = r.recognize_google(audio_data)
                if 'stop' in text.lower():
                    pyautogui.click(x=955 ,y=1000)
                    break
        except Exception as e:
            continue
          print(f"Encountered error: {e}\n")

The while True loop is there to keep looping until it hears ‘stop’ (again, you can obviously tailor this criterion).

If ‘stop’ is heard and correctly decoded, then pyautogui presses the Stop button for us. (Do not hesitate to look into parameters that enable to improve the mic detection when there is noise around (here, our song playing)).

With the use of a try/except clause, we are able to keep the program running without being bothered by potential errors in the way, but still able to print them if they appear, for later debugging.

Combining Everything

Want to see my full code? Here it is, below:

import pyautogui, subprocess, os, time
import speech_recognition as sr
 
# initialize the recognizer
r = sr.Recognizer()
 
with sr.Microphone() as source:
    # read the audio data from the default microphone
    audio_data = r.record(source, duration=3)
    print("Recognizing...")
    # convert speech to text
    text = r.recognize_google(audio_data)
    print(f"I think you said: '{text}'\nhmmm, let's see what I can do for you.")
 
if "spotify" in text.lower():
    subprocess.Popen('C:\\Users\\cleme\\AppData\\Roaming\\Spotify\\Spotify.exe')
    for window in pyautogui.getAllWindows():
        if 'spotify' in window.title.lower():
            window.show()
            print('spotify window activated')
            text = text.split()  # break down text list into single words for later usage
            time.sleep(5.5)
            pyautogui.click(x=480,y=25) # this is the search bar location on my machine, when the window is maximized
            time.sleep(1)
            pyautogui.hotkey('ctrl','a') # clearing the search bar
            pyautogui.press('backspace') # clearing the search bar
            time.sleep(0.5)
            pyautogui.write(text[-1],interval=0.05) # because we assumed that the artist was the last word of the voice command
            time.sleep(3)
            pyautogui.click(x=380,y=250) # this is the play button location on my machine, when the window is maximized
            break
    while True:
        try:
            with sr.Microphone() as source:
                # read the audio data from the default microphone
                audio_data = r.record(source, duration=1)
                # convert speech to text
                text = r.recognize_google(audio_data)
                if 'stop' in text.lower():
                    pyautogui.click(x=955 ,y=1000)
                    break
        except Exception as e:
            continue
    

There is room for improvement but it works, you can be proud of you if you manage to code your own music launcher!

Thank you guys! That’s all for today.

Where to go from here?

  • Create a voice command to change the sound level, or go to the next song
  • Combine this script with Selenium to play music from the Internet instead
  • Use machine learning for smarter interpretation of the voice command
  • Convert text to speech
  • Schedule tasks with Python
  • In relation with the above item, look into cloud solutions to run code 24/7