AI + ML

This article is more than 1 year old

Here's why AI can't make a catchier tune than the worst pop song in the charts right now

DeepMind tries to train a neural network on classical piano

Tue 24 Jul 2018 // 07:03 UTC

Neural networks are neat at spotting and reproducing patterns in images and text – yet they still struggle when spitting out audio.

There are numerous examples of artificially intelligent software improvising fairly realistic images of people, buildings, and other objects, from training material. However, when it comes to composing music, machines are way off. The melodies, if you can call them that, sound nonsensical because they don’t have any structure over time like normal music does.

We expect tunes to sustain a structure over a matter of minutes, whereas computers end up flitting about between styles every few seconds.

Pop songs are roughly split into verses and choruses with a repeating melody, yet that's a pattern machine-learning code cannot seem to grasp. Now, a paper by researchers at DeepMind has had a stab at explaining why.

Most research projects train a system by converting the raw sound waves into MIDI files, which the neural network is expected to recreate. This, it seems, strips away the details and nuances that are important when it comes to crafting music that sounds realistic. So instead, the DeepMind gang trained their model directly from raw audio waves, teaching it to produce raw audio waves – a move other teams are also starting to consider.

Explored

“Models that are capable of generating audio waveforms directly (as opposed to some other representation that can be converted into audio afterwards, such as spectrograms or piano rolls) are only recently starting to be explored,” the researchers explained in a writeup of their study, emitted late last week.

"This was long thought to be infeasible due to the scale of the problem, as audio signals are often sampled at rates of 16 kHz or higher."

Crucially, text-to-speech systems do not suffer from the same creative blocks as AI songwriters because words in human speech are pretty short – with sounds in the order of hundreds of milliseconds – whereas music requires structure stretching over minutes. Text-to-speech bots just have it easier than music-generating cousins.

Popular models used for text-to-speech, such as SampleRNN and WaveNet, have been explored for music generation, but none of them have been successful in capturing melodies and rhythm.

“Music is a complex, highly structured sequential data modality,” the DeepMind paper stated.

"When rendered as an audio signal, this structure manifests itself at various timescales, ranging from the periodicity of the waveforms at the scale of milliseconds, all the way to the musical form of a piece of music, which typically spans several minutes.

"Modeling all of the temporal correlations in the sequence that arise from this structure is challenging, because they span many different orders of magnitude."

Putting it all together

So, what did DeepMind come up with after scrutinizing other song-streaming systems. In order to grok patterns in music, they designed their AI software to learn from longer snippets of audio training data. The researchers called this “[enlarging] the receptive fields.”

They did this by adding more convolutional layers to a WaveNet model. The input sound samples were taken from more than 400 hours of recorded solo piano music, from composers such as Chopin and Beethoven. These were then fed into the model via an encoder that converted the raw audio into continuous scalars, or into 256-dimensional one-hot vectors.

It’s a computationally demanding process, and the whole training portion taxed as many as 32 GPUs. “We show that it is possible to model structure across roughly 400,000 timesteps, or about 25 seconds of audio sampled at 16 kHz. This allows us to generate samples of piano music that are stylistically consistent,” the paper stated.

Since the input structure only spans about 25 seconds, the output generated is only consistent “across tens of seconds” as well. Here are a few ten second clips of what some of DeepMind's machine-made music sounds like.

DeepMind Audio Clip 1

DeepMind Audio Clip 2

Ten seconds isn’t enough to craft a catchy tune, but it’s interesting to see AI try. Slap a banging beat on it, and you've probably got next year's EDM hit. ®

Topics

Special Features

Vendor Voice

Resources

AI + ML

Here's why AI can't make a catchier tune than the worst pop song in the charts right now

DeepMind tries to train a neural network on classical piano

Explored

Putting it all together

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Microsoft puts ex-DeepMind boffin in charge of London AI hub

Google will pump more than $100B into AI, says DeepMind boss

Google Cloud chief is really psyched about this AI thing

Reducing the cloud security overhead

AI spam is winning the battle against search engine quality

What's up with AI lately? Let's start with soaring costs, public anger, regulations...

AI PCs are here but a killer application for biz users? Nope

Psst, hey. It's the NSA. You want some AI security advice?

Intel CEO suggests AI can help to create a one-person Unicorn

US House mulls forcing AI makers to reveal use of copyrighted training data

Hailo's latest AI chip shows up integrated NPUs and sips power like fine wine

British watchdog has 'real concerns' about the staggering love-in between cloud giants and AI upstarts

About Us

Our Websites

Your Privacy