New Frontiers

Mind-Melting AI Makes Frank Sinatra Sing “Toxic” by Britney Spears

We gave these AI music experts an unusual request — and what they delivered will blow your mind.

By Dan Robitzski

Updated May 29, 2020 10:01 AM EDT

OpenAI recently released Jukebox, and algorithm that can create original songs or mashup-hybrids in the style of over 9,000 artists. We kicked the tires. — Image: William P. Gottlieb Collection/Victor Tangermann

At the end of April, the artificial intelligence development firm OpenAI released a new neural net, Jukebox, which can create mashups and original music in the style of over 9,000 bands and musicians.

Alongside it, OpenAI released a list of sample tracks generated with the algorithm that bend music into new genres or even reinterpret one artist’s song in another’s style — think a jazz-pop hybrid of Ella Fitzgerald and Céline Dion.

It’s an incredible feat of technology, but Futurism’s editorial team was unsatisfied with the tracks OpenAI shared. To really kick the tires, we went to CJ Carr and Zack Zukowski, the musicians and computer science experts behind the algorithmically-generated music group DADABOTS, with a request: We wanted to hear Frank Sinatra sing Britney Spears’ “Toxic.”

And boy, they delivered.

An algorithm that can create original works of music in the style of existing bands and artists raises unexplored legal and creative questions. For instance, can the artists that Jukebox was trained on claim credit for the resulting tracks? Or are we experiencing the beginning of a brand-new era of music?

“There’s so much creativity to explore there,” Zukowski told Futurism.

Below is the resulting song, in all its AI-generated glory, followed by Futurism’s lightly-edited conversation with algorithmic musicians Carr and Zukowski.

Futurism: Thanks for taking the time to chat, CJ and Zack. Before we jump in, I’d love to learn a little bit more about both of you, and how you learned how to do all this. What sort of background do you have that lent itself to AI-generated music?

Zack Zukowski: I think we’re both pretty much musicians first, but also I’ve been involved in tech for quite a while. I approached my machine learning studies from an audio perspective: I wanted to extend what was already being done with synthesis and music technology. It seemed like machine learning was obviously the path that was going to make the most gains, so I started learning about those types of algorithms. SampleRNN is the tool we most like to use — that’s one of our main tools that we’ve been using for our livestreams and our Bandcamp albums over the last couple years.

CJ Carr: Musician first, motivated in computer science to do new things with music. DADABOTS itself comes out of hackathon culture. I’ve done 65 hackathons, and Zack and I together have won 15 or so. That environment inspires people to push what they’re doing in some new way, to do something provocative. That’s the spirit DADABOTS came out of in 2012, and we’ve been pushing it further and further as the tech has progressed.

Why did you make the decision to step up from individual hackathons and stick with DADABOTS? Where did the idea come from for your various projects?

CJ: When we started it, we were both interns at Berklee College of Music working in music tech. When I met Zack — for some reason it felt like I’ve known Zack my whole life. It was a natural collaboration. Zack knew more about signal processing than I did, I knew more about programming, and now we have both brains.

What’s your typical approach? What’s going on behind the scenes?

CJ: SampleRNN has been our main tool. It’s really fast to train — we can train it in a day or two on a new artist. One of the main things we love to do is collaborating with artists, when an artist says “hey I’d love to do a bot album.” But recently, Jukebox trumped the state of the art in music generation. They did a really good job.

SampleRNN and Jukebox, they’re similar in that they’re both sequence generators. It’s reading a sequence of audio at 44.1k or 16k sample rate, and then it’s trying to predict what the next sample is going to be. This net is making a decision at a fraction of a millisecond to come up with the next sample. This is why it’s called neural synthesis. It’s not copying and pasting audio from the training data, it’s learning to synthesize.

What’s different about them is that SampleRNN uses “Long Short Term Memory” (LSTM) architecture, whereas the jukebox uses a transformer architecture. The transformer has attention. This is a relatively new thing that’s come to popularity in deep learning, after RNN, after LSTM. It especially took over for language models. I don’t know if you remember fake news generators like GPT-2 and Grover. They use transformer architecture. Many of the language researchers left LSTM behind. No one had really applied it to audio music yet — that’s the big enhancement for Jukebox. They’re taking a language architecture and applying it to music.

They’re also doing this extra thing, called a “Vector-Quantized Variational AutoEncoder” (VQ-VAE). They’re trying to turn audio into language. They train a model that creates a codebook, like an alphabet. And they take this alphabet, which is a discrete set of 2048 symbols — each symbol is something about music — and then they train their transformer models on it.”

What does that alphabet look like? What is that “something about music?”

CJ: They didn’t do that analysis at all. We’re really curious. For instance, can we compose with it?

Zack: we have these 2048 characters, and so we wonder which ones are commonly used. Like in the alphabet we don’t use Zs too much. But what are the “vowels?” Which symbols are used frequently? It would be really interesting to see what happens when you start getting rid of some of these symbols and see what the net can do with what remains. The way we have the language of music theory with chords and scales, maybe this is something that we can compose with beyond making deepfakes of an artist.

What can that language tell us about the underlying rules and components of music, and how can we use these as building blocks themselves? They’re much higher-level than chords — maybe they’re genre-related. We really don’t know. It would be really cool to do that analysis and see what happens by using just a subset of the language.

CJ: They’ve come up with a new music theory.

Well, it sounds like the three of us have a lot of the same questions about all this. Have you started tinkering with it to learn what’s going on?

CJ: We’ve just got the code running. The first example is this Sinatra thing. But as we use this more, the philosophical implications here are that as musicians, we know intuitively that music is very language-like. It’s not just waves and noise, which is what it looks like at a small scale, but when we’re playing we’re communicating with each other. The bass and the drummer are in step, strings and vocals can be doing call-and-response. And OpenAI was just like “Hey, what if we treated music like language?”

If the sort of alphabet this algorithm uses could be seen as a new music theory, do you think this will be a tool for you two going forward? Or is it more of an oddity to play around with?

CJ: Maybe I should correct myself. Instead of being a music theory, these models can train music theory.

Zack: The theory isn’t something that we can explain right now. We can’t say “This value means this.” It’s not quite as human interpretable, I guess.

CJ: the model just learns probabilistic patterns, and that’s what music theory is. It’s these notes tend to have these patterns and produce these feelings. And those were human-invented. What if we just have a machine try to discover that on its own, and then we ask it to make music? And if it’s good at it, probably it’s learned a good quote-unquote “music theory.”

Zack: An analogy we thought of: Back in the days of Bach, and these composers who were really interested in having counterpoint — many voices moving in their own direction — they had a set of rules for this. The first melodic line the composer builds off is called cantus firmus. There was an educational game new composers would play — if you could follow the notes that were presented in the cantus firmus and guess what harmonizing notes were next, you’d be correct based on the music of the day.

We’re thinking this is kind of the machine version of that, in some ways. Something that can be used to make new music in the style of music that has been heard before.

I know it’s early days and that this is speculative, but do you have any predictions for how people might use Jukebox? Will it be more of these mashups, or do you think people will develop original compositions?

CJ: On the one hand, you have the fear of push-button art. A lot of people think push-button art is very grotesque. But I think push-button art, when a culture can achieve this — it’s a transcendent moment for that culture. It means the communication of that culture has achieved its capacity. Think about meme generators — I can take a picture of Keanu Reeves, put in some inside joke and send it to my friends, and then they can understand and appreciate what I’m communicating. That’s powerful. So it is grotesque, but it’s effectual.

On the other side, you’ll have these virtuosos — these creators — who are gonna do overkill and try to create a medium of art that’s never existed before. What interests us are these 24/7 generators, where it can just keep generating forever.

Zack: I think it’s an interesting tool for artists who have worked on a body of albums. There are artists who don’t even know they can be generated on Jukebox. So, I think many of them would like to know what can be generated in their likeness. It can be a variation tool, it can recreate work for an artist through a perspective they haven’t even heard. It can bend their work through similar artists or even very distantly-stylized artists. It can be a great training tool for artists.

You said you’d heard from some artists who approached you to generate music already — is that something you can talk about?

CJ: When bands approach us, they’ve mostly been staying within the lane of “Hey, use just my training data and let’s see what comes out — I’m really interested.”

Fans though, on YouTube, are like “Here’s a list of my four favorite bands, please make me something out of it.”

So, let’s talk about the actual track you made for us. For this new song, Futurism suggested Britney Spears’ “Toxic” as sung by Frank Sinatra. Did the technical side of pulling that together differ from your usual work?

CJ: This is different. With SampleRNN, we’re retraining it from scratch on usually one artist or one album. And that’s really where it shines — it’s not able to do these fusions very well. What OpenAI was able to do — with a giant multimillion-dollar compute budget — they were able to train these giant neural nets. And they trained them on over 9,000 artists in over 300 genres. You need a mega team with a huge budget just to make this generalizable net.

Zack: There are two options. There’s lyrics and no lyrics. No lyrics is sort of like how SampleRNN has worked. With lyrics it tries to get them all in order, but sometimes it loops or repeats. But it tries to go beginning to end and keep the flow going. If you have too many lyrics, it doesn’t understand. It doesn’t understand that if you have a chorus repeating, the music should repeat as well. So we find that these shorter compositions work better for us.

But you had lyrics in past projects that used SampleRNN, like “Human Extinction Party.” How did that differ?

CJ: That was smoke and mirrors.

Zack: That was kind of an illusion. The album we trained it on had vocals, so some made it through to. We had a text generator that made up lyrics whenever it heard a sound.

In a lot of these Jukebox mashups, I’ve noticed that the voice sounds sort of strained. Is that just a matter of the AI-generated voice being forced to hit a certain note, or does it have something more to do with the limitations of the algorithm itself?

Zack: Your guess sounds similar to what I’d say. It was probably just really unlikely that those lyrics or the phonemes, the sounds themselves of the words, showed up in a similar way to how we were forcing it to generate those syllables. It probably heard a lot more music that isn’t Frank Sinatra, so it can imagine some things that Frank Sinatra didn’t do. But it just comes down to being somewhat different from any of the original Frank Sinatra texts.

When you were creating this rendition of Toxic, did you hit any snags along the way? Or was it just a matter of giving the algorithm enough time to do its work?

CJ: Part of it is we need a really expensive piece of hardware that we need to rent on Amazon Cloud at three dollars per hour. And it takes — how long did it take to generate, Zack?

Zack: The final one I had generated took about a day, but I had been doing it over and over again for a week. You have so little control that sometimes you just gotta go again. It would get a few phrases and then it would lose track of the lyrics. Sometimes you’d get two lines but not the whole chorus in a row. It came down to luck — waiting for the right one to come along.

It could loop a line, or sometimes it could go into seemingly different songs. It would completely lose track of where it was. There are some pretty wild things that can happen. One time I was generating Frank Sinatra, and it was clearly a chorus of men and women together. It wasn’t even the right voice. It can get pretty ghostly.

Do you know if there are any legal issues involved in this kind of music? The capability to generate new music in the style or voice of an artist seems like uncharted territory, but are there issues with the mashups that use existing lyrics? Or are those more acceptable under the guise of fair use, sort of like parody songs?

CJ: We’re not legal people, we haven’t studied copyright issues. The vibe is that there’s a strong case for fair use, but artists may not like people creating these deepfakes.

Zack: I think it comes down to intention, and whatever the law decides they’ll decide. But as people using this tool, artists, there’s definitely a code of ethics that people should probably respect. Don’t piss people off. We try our best to cite the people who worked on the tech, the people who it was trained on. It all just depends how you’re putting it out and how respectful you’re being of people’s work.

Before I let you go, what else are you two working on right now?

CJ: Our long-term research is trying to make these models faster and cheaper so bedroom producers and 12-year-olds can be making music no one’s ever thought of. Of course, right now it’s very expensive and it takes days. We’re in a privileged position of being able to do it with the rented hardware.

Specifically, what we’re doing right now — there’s the list of 9,000-plus bands that the model currently supports. But what’s interesting is the bands weren’t asked to be a part of this dataset. Some machine learning researchers on Twitter were debating the ethics of that. There are two sides of that, of course, but we really want to reach out to those bands. If anyone knows these bands, if you are these bands, we will generate music for you. We want to take this technology, which we think is capable of brand-new forms of creativity, and give it back to artists.

More on DADABOTS: Researchers Trained a Neural Net Using a Cannibal Corpse Album