I Read Google’s SoundStorm Paper

4.9/5 - (8 votes)

Listen to this insane conversation published on Google’s SoundStorm GitHub page:

A male and female speaker lead a conversation. Only at the end it becomes apparent that they are actually neither male nor female — they are bot called SoundStorm (PDF)!

SoundStorm is a machine learning model that generates audio files. It is non-autoregressive.

“Non-autoregressive approaches aim to improve the inference speed of translation models by only requiring a single forward pass to generate the output sequence instead of iteratively producing each predicted token.” (Apple Machine Learning)

Requiring only a single forward pass as opposed to multiple iterations makes it really fast.

Blazingly fast! πŸš€

In fact, Google Research highlights that “When synthesizing dialogue segments of 30 seconds, we measured a runtime of 2 seconds on a single TPU-v4”. (source)

πŸ’‘ Note: TPU stands for Tensor Processing Unit and you can replace it in your head with “CPU” only less general-purpose and more specialized to machine learning applications.

Example Prompt

For example, Google researchers gave it the following dialogue prompt:

Where did you go last summer? | I went to Greece, it was amazing. | Oh, that's great. I've always wanted to go to Greece. What was your favorite part? | Uh it's hard to choose just one favorite part, but yeah I really loved the food. The seafood was especially delicious. | yeah | And the beaches were incredible. | uhhuh | We spent a lot of time swimming, uh sunbathing, and and exploring the islands. | Oh that sounds like a perfect vacation! I'm so jealous. | It was definitely a trip I'll never forget | I really hope I'll get to visit someday!

The impressive output generated by the model (source):

Now think about this for a moment. You could create a simple pipeline like this:

  1. Step 1: Generate dialogues with ChatGPT or OpenAI API
  2. Step 2: Feed the dialogues into the SoundStorm model
  3. Step 3: Upload to a podcasting platform
  4. Repeat!

And 99% of people wouldn’t even note a difference!

But there are many more applications, such as replacing human readers of audiobooks (yet another job description that will be disrupted soon!), creating truly accessible web apps with human readers, and rapid prototyping for movies and (YouTube) videos.

The race for our collective attention during walks, drives, and cleaning up our kitchens has officially reached the next stage!

πŸ’‘ Recommended: OpenAI’s Speech-to-Text API: A Comprehensive Guide