Listen to this insane conversation published on Google’s SoundStorm GitHub page:
A male and female speaker lead a conversation. Only at the end it becomes apparent that they are actually neither male nor female — they are bot called SoundStorm (PDF)!

SoundStorm is a machine learning model that generates audio files. It is non-autoregressive.
“Non-autoregressive approaches aim to improve the inference speed of translation models by only requiring a single forward pass to generate the output sequence instead of iteratively producing each predicted token.” (Apple Machine Learning)
Requiring only a single forward pass as opposed to multiple iterations makes it really fast.
Blazingly fast! π
In fact, Google Research highlights that “When synthesizing dialogue segments of 30 seconds, we measured a runtime of 2 seconds on a single TPU-v4”. (source)
π‘ Note: TPU stands for Tensor Processing Unit and you can replace it in your head with “CPU” only less general-purpose and more specialized to machine learning applications.
Example Prompt
For example, Google researchers gave it the following dialogue prompt:
Where did you go last summer? | I went to Greece, it was amazing. | Oh, that's great. I've always wanted to go to Greece. What was your favorite part? | Uh it's hard to choose just one favorite part, but yeah I really loved the food. The seafood was especially delicious. | yeah | And the beaches were incredible. | uhhuh | We spent a lot of time swimming, uh sunbathing, and and exploring the islands. | Oh that sounds like a perfect vacation! I'm so jealous. | It was definitely a trip I'll never forget | I really hope I'll get to visit someday!
The impressive output generated by the model (source):
Now think about this for a moment. You could create a simple pipeline like this:
- Step 1: Generate dialogues with ChatGPT or OpenAI API
- Step 2: Feed the dialogues into the SoundStorm model
- Step 3: Upload to a podcasting platform
- Repeat!
And 99% of people wouldn’t even note a difference!

But there are many more applications, such as replacing human readers of audiobooks (yet another job description that will be disrupted soon!), creating truly accessible web apps with human readers, and rapid prototyping for movies and (YouTube) videos.
The race for our collective attention during walks, drives, and cleaning up our kitchens has officially reached the next stage!

π‘ Recommended: OpenAIβs Speech-to-Text API: A Comprehensive Guide

While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.
To help students reach higher levels of Python success, he founded the programming education website Finxter.com that has taught exponential skills to millions of coders worldwide. He’s the author of the best-selling programming books Python One-Liners (NoStarch 2020), The Art of Clean Code (NoStarch 2022), and The Book of Dash (NoStarch 2022). Chris also coauthored the Coffee Break Python series of self-published books. He’s a computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.
His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.