Alexa Skill

How to Create Alexa Skills in Python? A Simple Overview for Beginners.

Alexa is on the way to lead the next wave of technological disruption. Have you ever asked yourself about the exact steps needed to create an Alexa Skill? And how these steps interplay with each other?

To answer these questions, I give you a quick overview of Alexa Skill development in Python. This article serves as a first starting point with the best links to more in-depth resources.

What is an Alexa skill from a technical point of view?

What is the most natural mode of communication for humans? Yes, it’s talking. We love to talk at the coffee corner, as we go for a walk with a friend, while we cook, or shower. No matter what you are doing – your language is an accessible device for communication. Keyboards, touch-screens, or computer mouses are not.

Alexa is Amazon’s new smart home device that serves as a speech recognizing interface between you and the Internet. The idea is to bridge the gap between you (the information seeker) and the various web services (the information providers).

Let’s say you are talking to the Alexa device. The Alexa device automatically translates your speech into textual data such as “What’s the news?” or “Call Alice!”. As a Python coder, you know these data types as strings.

So far so good. But what should Alexa do with this string data? There are millions of likely strings (experts call this the long tail of natural language). Storing the answers to each of these requests in a database would be incredibly expensive for a single company like Amazon. Even worse, some strings (e.g. “Call Alice!”) are requests for services, not for information.

That’s where Alexa Skills come into play. Each Alexa skill offers one defined set of functionality to the Alexa user. The Alexa skill connects the information intent from the user with the billions of possible backend services.

Here is Amazon’s definition of a skill (in benefit-rich marketing speech):

“You can build capabilities, or skills, that make Alexa smarter and make everyday tasks faster, easier, and more delightful for customers.”

Amazon

Why building an Alexa skill?

Many people believe that Alexa skills will create the next batch of tech millionaires – much like app development in the last decade. Think about it, talking to computers is the most natural way of communication for humans. And it is massively underdeveloped in today’s world. Billions of people are sitting in front of their screens, day after day. Of course, speech can not replace each application under the sun. Still, there is a universe of emerging applications that will heavily rely on speech. Those applications will be accessible by billions of people worldwide.

Alexa simplifies “skill development” much like Apple simplified “app development”.

“The App Store was opened on July 10, 2008, with an initial 500 applications available. As of 2017, the store features over 2.1 million apps.”

Wikipedia

So if you develop your iPhone app now, you compete with more than 2 million apps. Many of them have years of a head start in development, user testing, and marketing.

How many skills are there for Alexa? In 2018, it is reported that there are already 40,000 skills at a rapidly growing rate (source).  While this sounds like heavy competition too, it is kindergarten compared to the competition your iPhone app will face on the App Store.

It is still possible for you to become one of those early creators making millions with simple skills like the voice equivalents of the iPhone apps such as “flashlight” or the “compass”. Don’t lose any more time!

A simple overview of how to build Alexa skills with Python

This graphic gives you an overview of what you have to do from a technical point of view when implementing an Alexa skill. Building an Alexa skill is already hard enough for many new programmers and the lack of graphical support doesn’t make it better. So, I build this graphic about the data flow and executional tasks you have to set when building an Alexa skill.

It all starts with the user who talks to the Alexa device. With the help of the Alexa Voice Service (AVS) hosted by Amazon in the cloud, Alexa converts the speech to text. Then, it packages this text as a JSON file (JSON is a file format to efficiently send attribute-value pairs over the web) and sends this file to the cloud service that hosts your application.

Your cloud service does four things:

  • Request handling
  • Response building
  • Attribute management
  • Alexa API calls

What is request handling?

You implement your cloud service using the Amazon Web Service (AWS) Lambda.

“AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of the Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.”

Wikipedia

In other words, your Alexa skill sits in the cloud and waits for users to access it. It’s like a dog waiting for you to throw the stick. While waiting, your web service is in a sleep-like mode: not doing any work and consuming any resources on Amazon’s computing servers.

At some point, it (hopefully) gets a request (in Amazon Lambda terminology: an event). Amazon Lambda now ensures that within milliseconds, your Alexa skill gets executed by a process that runs your specified functionality.

Part of your Alexa skill functionality is request handling which is taking the JSON file sent by the Alexa framework and processing it. The JSON request file contains relevant information for your Alexa skill such as the following:

  • The version (meta information)
  • The session: If the user leads a conversation with the Alexa device, session information is very important to ensure that there can be any progress in the conversation. For example:
    • User: “Alexa, tell me a joke!”
    • Alexa: “Knock, knock”
    • User: “Who’s there?”
    • Alexa: “Nobody”
  • The context: information about the state of the device (e.g. time, etc.)
  • The request itself: can be a launch request, an intent request, or an audio player request.

Here is how your JSON request file may look like (source):

{
  "version": "string",
  "sessionAttributes": {
    "key": "value"
  },
  "response": {
    "outputSpeech": {
      "type": "PlainText",
      "text": "Plain text string to speak",
      "ssml": "<speak>SSML text string to speak</speak>",
      "playBehavior": "REPLACE_ENQUEUED"      
    },
    "card": {
      "type": "Standard",
      "title": "Title of the card",
      "content": "Content of a simple card",
      "text": "Text content for a standard card",
      "image": {
        "smallImageUrl": "https://url-to-small-card-image...",
        "largeImageUrl": "https://url-to-large-card-image..."
      }
    },
    "reprompt": {
      "outputSpeech": {
        "type": "PlainText",
        "text": "Plain text string to speak",
        "ssml": "<speak>SSML text string to speak</speak>",
        "playBehavior": "REPLACE_ENQUEUED"             
      }
    },
    "directives": [
      {
        "type": "InterfaceName.Directive"
        (...properties depend on the directive type)
      }
    ],
    "shouldEndSession": true
  }
}

What is response building?

Ok, so your user can now send requests to your web service. If there are requests, there have to be responses as well.

Let’s say your web service does its magic and determines the best response to the user’s request. How exactly does that look like?

The response is a JSON file as well. It contains relevant information such as the following:

  • The output speech: you want Alexa to talk to the user, right? The output speech is the text that is translated into speech by the Alexa framework. This is the most important piece of information in the response file.
  • Image URL: you can also give back images (e.g. “Alexa, how do I look like in 20 years?”. Oh man, I would like to know THAT…).
  • Meta information such as size or version parameters.

Here is how a JSON response will look like (source):

{
  "version": "string",
  "sessionAttributes": {
    "key": "value"
  },
  "response": {
    "outputSpeech": {
      "type": "PlainText",
      "text": "Plain text string to speak",
      "ssml": "<speak>SSML text string to speak</speak>",
      "playBehavior": "REPLACE_ENQUEUED"      
    },
    "card": {
      "type": "Standard",
      "title": "Title of the card",
      "content": "Content of a simple card",
      "text": "Text content for a standard card",
      "image": {
        "smallImageUrl": "https://url-to-small-card-image...",
        "largeImageUrl": "https://url-to-large-card-image..."
      }
    },
    "reprompt": {
      "outputSpeech": {
        "type": "PlainText",
        "text": "Plain text string to speak",
        "ssml": "<speak>SSML text string to speak</speak>",
        "playBehavior": "REPLACE_ENQUEUED"             
      }
    },
    "directives": [
      {
        "type": "InterfaceName.Directive"
        (...properties depend on the directive type)
      }
    ],
    "shouldEndSession": true
  }
}

What is attribute management?

In the above request example, we have already touched the need for sessions. If you want to allow the user to lead a conversation with your Alexa skill, attribute management is a must.

Here is an example from (source)

  • Customer: Alexa, ask space facts to tell me a fact <Session Starts>
  • Alexa: The Sun contains 99.86% of the mass in the Solar System. Would you like another fact?
  • Customer: Yes
  • Alexa: Jupiter has the shortest day of all the planets. Would you like another fact?
  • Customer: No (AMAZON.StopIntent)
  • Alexa: Goodbye <Session Ends>

I have already told you that you will use Amazon’s Lambda functions to host your web service. The problem is that Amazon Lambda functions are created on-demand on an event-driven schedule. The variables in your code that you created in one invocation are not known in the next invocation of your service. In other words, your Alexa skill will start doing weird things in its conversations: it forgets everything you tell it.

  • Customer: Alexa, ask space facts to tell me a fact <Session Starts>
  • Alexa: The Sun contains 99.86% of the mass in the Solar System. Would you like another fact?
  • Customer: Yes
  • Alexa: The Sun contains 99.86% of the mass in the Solar System. Would you like another fact?
  • Customer: YOU STUPID $&(/Z!! (throws device out of the window) (AMAZON.StopIntent)<Session Ends>

So how can you keep information across different invocations of the Amazon Lambda function?

Use sessions in your implementation of an Alexa skill. You can store session attributes in the attribute management module provided by Amazon.

In subsequent executions, you can access the values stored in your session attributes (e.g. the jokes or facts that you have already returned in this session).

What are Alexa API calls (and why do you need them)?

Alexa offers not one but many APIs (application programming interfaces).

Short recap for the newbies reading this article: an application programming interface (API) is a set of specified functionality. The API shows you how to talk with a service. Being a programmer, you want to include complex functionality in your program without implementing it yourself. For example, you can embed a Google Maps script to show an interactive street map on your website. APIs help you to use existing code bases in order to “stand on the shoulders of giants”.

There are two important classes of APIs for Alexa:

  • The Alexa Skills Kit (ASK) for developing your own Alexa Skills. Read more here.
  • The Alexa Voice Service (AVS) helps you to include Alexa in your own applications. Read more here.

Alexa Skills Kit (ASK): If you want to develop your own Alexa Skills, you will use the ASK often times. An example of an API call for ASK is speech-to-text translation. The Alexa API will provide you with the JSON file format as explained before. To develop your own skill, you will create a stand-alone application running on an independent server. You have to use the API calls provided by the Alexa Skills Kit to connect your server with the Alexa device. There is no other way.

Alexa Voice Service (AVS): If you want to integrate the Alexa service in your own device or application, you will use the AVS. I think about it this way. You have most likely already integrated Google Maps in your website. Similarly, you will integrate Alexa in your smart home device to enhance its expressive power.

Where to go from here?

  1. Learn Python: It is critical for your success in creating your own Alexa skills that you have a solid Python foundation. To this end, I have created a new learning system where you consume educative Python emails in your coffee break. It’s learning on autopilot. Join my Python email course and get the Python book. It’s 100% free! (And you can unsubscribe any time.)
  2. Create your own Alexa Skill following Amazon’s step-by-step tutorials.
  3. Write me an email (admin@finxter.com) and tell me your biggest coding struggle.