Semiconductor manufacturer and the world’s most valuable company NVIDIA today shared a preview of Fugatto, an AI-powered audio tool that they describe as “the World’s Most Flexible Sound Machine”.
Fugatto is intended to be a sort of Swiss Army Knife for audio, letting you generate or transform any mix of music, voices and sounds using just text prompts.
“Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale,” says composer & NVIDIA researcher Rafael Valle.
Here’s the official teaser video:
Like earlier generative audio demos, many of the audio examples in their promo seem primitive. On the other hand, this is first generative AI demo that we’ve seen that also showcases the tool being used in interesting creative ways.
For example, the video demonstrates how you can use text prompts with Fugatto to extract vocals from a mix, morph one sound into another, generate realistic speech, remix existing audio, and convert MIDI melodies into realistic vocal samples. These are capabilities that c0uld actually complement and extend the capabilities of the current generation of digital audio workstations.
Here’s what they have to say about the technology behind Fugatto:
“Fugatto is a foundational generative transformer model that builds on the team’s prior work in areas such as speech modeling, audio vocoding and audio understanding.
The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs.
Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea. Their collaboration made Fugatto’s multi-accent and multilingual capabilities stronger.
One of the hardest parts of the effort was generating a blended dataset that contains millions of audio samples used for training. The team employed a multifaceted strategy to generate data and instructions that considerably expanded the range of tasks the model could perform, while achieving more accurate performance and enabling new tasks without requiring additional data.
They also scrutinized existing datasets to reveal new relationships among the data. The overall work spanned more than a year.”
We’ve got a lot of questions about Fugatto, ranging from “When will Fugatto become a real thing?” to “Will the data centers needed to power Fugatto generate enough heat to bring ocean-front views to the Midwest?”
But, with this demo, we can see a paradigm-shift coming in how musicians work with audio – where text-based and spoken commands become an important part of musician’s toolkits.
This is step towards a future Synthtopia predicted back in 2010, in 10 Predictions For Electronic Music Making In The Next Decade:
“Music software will get smarter – the state of the art in digital audio workstations is amazing. But, by and large, DAW manufacturers are still making virtual versions of traditional hardware studios. Most soft synths still look and act like their hardware predecessors, and that’s what buyers are demanding.
At this point, imitating traditional studios is horseless carriage thinking – letting what we can imagine be defined by the past. In the next decade, music software is going to get smarter and interfaces will make bolder leaps. You’ll tell your computer that you want to make an drum and bass track and your DAW will anticipate the way you’ll want your virtual studio configured. Ready get started? Say “gimme a beat!” You’ll interact with your DAW to “evolve” new sounds. You’ll hum the bassline and your DAW will notate it. You’ll build the track by saying that you want a 32 measure intro and a drop down to the bass and then bring the kick back in after 16 measures. You’ll draw a curve on a timeline to define the shape of your track, do a run through and improvise over the rhythm track.
Then you’ll tell your DAW to add a middle eight and double the bassline and to master it with more “zazz” and it will be saved in the cloud for your fans to listen to.”
We may have been optimistic on that timetable. But it’s clear that – for at least younger musicians – we’re heading for an era where the ‘virtual studio’ paradigm of current DAWs may not be relevant anymore. For someone new to music production, being able to remix audio and arrange music using voice commands will make it much easier to get started with music production.
And, for those of us that have invested years in developing skills with audio software – it’s clear that new audio tools are coming, very quickly, that promise to let us work with audio in new ways. It seems inevitable that some of the capabilities demonstrated in this video will be integrated into the next generation of digital audio workstations.
Is generative audio about to get interesting for creative musicians? Or are you sick of hearing about how AI is going to get awesome? Share your thoughts in the comments!
In Internet music, nobody knows you’re a cat.
“Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea.” So was the atomic bomb. Do you want one in your living room? I love marketing blurbs. They apply to everything and nothing simultaneously.