Making of “Emergent Rhythm” — Behind the scenes of a live performance using real-time AI audio generation

Nao Tokui
Qosmo Lab
Published in
11 min readFeb 1, 2023

--

Nao Tokui / Emergent Rhythm — Real-time AI Generative Live Set / MUTEK.JP 2022.12.8 (digest)

Can we use AI to create unprecedented musical experiences?

In this article, we will explain the technical details behind our performance “Emergent Rhythm — AI Generative Live Set” at MUTEK Japan (Shibuya Stream Hall) on December 8, 2011, and the process and motivation that led to the realization of the performance.

For more information on the visual aspect of our performance at MUTEK, Ryosuke Nakajima of Qosmo, who was in charge of the performance, has written a detailed article.

The visuals were inspired by the video work “Powers of Ten” by the Eameses. We found visual patterns common to the past and future of the universe, the earth, and humanity, as well as to multiple scales from the very small to the very large, and synthesized them using Stable Diffusion. Please observe this as an exciting example of the latest VJ expression using AI models that generate images from text.

Visual examples generated with Stable Diffusion from the performance.

Live performance using real-time AI sound generation at MUTEK.JP

I have been a researcher of AI and have been releasing dance music and DJing since around 2000. One of the projects I have been working on since around 2015 is the “AI DJ Project,” an attempt to show “a possible future of DJing” using AI. As the latest form of this project, a performance was held in which the author mixed a song while generating it on the spot using real-time sound generation.

I must emphasize that I’m not interested in imitating/reproducing existing music or automating the process of DJing; rather, I’m interested in creating a unique musical experience. I want to make something different.

I must emphasize that I’m not interested in imitating/reproducing existing music or automating the process of DJing; rather, I’m interested in creating a unique musical experience. I want to make something different.

Loop generation using GAN

The audio generation AI model used in this performance was based on an image generation algorithm called GANs (Generative Adversarial Networks). GANs can be likened to a deception game between an AI model that imitates the training data (Generator) and an AI model that scrutinizes the data to tell true data from generated “fake” data (Discriminator). It’s been the de facto standard for image generation AI until Diffusion models were introduced with Stable Diffusion and DALL-E2. Although the latest Diffusion model is said to be superior in terms of the quality of images that can be generated, we chose GAN for this project due to the real-time nature of the performance (The Diffusion model processes data repeatedly passed through multiple networks that remove noise. The process makes the generation time much longer than the GAN model).

If you are wondering why image generation, yes, you are right. Our GAN model uses a format called “spectrogram,” representing sound as an image with a time axis and frequency distribution. I’m sure you have seen morphing animations of GAN-generated faces of people who do not exist. In the same way, we can generate spectrograms of music that may exist and morph between them with the GAN model.

Morphing of faces and spectrograms generated by GAN

More specifically, we trained StyleGAN2 models to generate spectrograms of 2-bar-long music at a tempo of 120 BPM (4 seconds). It takes around 0.4 seconds to generate. This means that the sound is generated ten times faster than in real-time.

Since phase information is lost when a sound is converted to a spectrogram, it is necessary to estimate the lost phase information when the spectrogram is converted back to a sound. In our implementation, the spectrogram generated by the StyleGAN2 model is passed through another model called MelGAN, which compensates for the phase and converts the spectrogram into a sound file humans can hear (our implementation of this part is available on GitHub).

Loop generation with GANs

For the MUTEK performance, we trained a total of seven StyleGAN2 models. Each model generates unique sounds (spectrogram), such as rhythm patterns of specific genres, basslines, or synth pads. Following the aforementioned visual theme, we have also prepared models trained on organic sounds in our daily lives and nature (sounds of wind, rolling objects, footsteps in the forest, etc.).

In the actual performance, musical changes were created by gradually moving the latent vector input to the GAN models. The author, as a DJ, mixed different loops generated by multiple models and applied sound effects to them with a DJ mixer. The music developed in an improvisational, DJ-like manner.

Controlling the system with an iPad

StyleGAN is an architecture with an intermediate latent vector w that is more organized (disentangled) than the common input latent vector (noise) z. It is known that by moving this latent vector w in a particular direction, you can make a certain face smile, change it to be more masculine/feminine or younger/older, in the case of a model trained on facial images.

Manipulation of images using GAN style vector w (InterfaceGAN)

In the same way, we pre-defined w vectors that would make generated rhythm more complex/simpler or increase/weaken the bass sound. It allows the performer to intentionally increase the number of notes and/or the bass sound when they want to excite the audience.

More technically speaking, when generating the spectrograms for each GAN model, we used a slightly different latent vector to generate four slightly different loops simultaneously (batch size = 4). These four loops are mixed together to create a stereo image, and the balance of this mix is slightly different on the left and right sides to create a unique stereo effect. This stereo effect is also another unique aspect of this performance using AI.

System diagram

Real-time tone conversion by Neutone plug-in

Another important element of this performance was Neutone, a real-time AI sound synthesis and sound processing plug-in that we as Qosmo continue to develop. Neutone is a project that has been in development since early 2022, with the goal of providing the latest AI models in the form of AudioUnit/VSTs that can be easily used with common DAWs (Ableton Live, Logic, etc.). It was also very moving for me to see Neutone was finally at a level of perfection that could be used in the performance after a year of development.

In MUTEK, we mainly used the RAVE model in Neutone. RAVE is a timbre transfer model that converts all input sounds into a specific type of sound, say violin sound or human voice.

Ableton Live project used in the performance. Multiple Neutone plug-ins were launched and used.

Following the theme of human history and the back and forth between various cultural contexts, we used RAVE models trained on Buddhist chantings, Christian church choirs, Bulgarian female choruses (You may have heard it in the theme song of the anime Akira), an African folk instrument, the kora, and so on. During the performance, the rhythms generated by GAN are converted into human voices (instant voice percussion), and synth sounds are converted into hymn-like sounds in real time.

My favorite scene was when I half-mistakenly created a feedback loop on the DJ mixer: I fed the output of the Bulgarian female chorus model into the Buddhist chant model and fed it back to the Bulgarian model. It was so surprising to see that it started generating intriguing sounds with an organic and unique floating feeling. I believe it was something, which only real-time AI can realize.

Other scenes from the performance. We recorded a dancer (Masumi Endo) dancing at a given tempo and used the footage as an input image to the Stable Diffusion model.

Background to the Performance

This “AI DJ Project” is, of course, not about automating DJing with AI (I’m not generous enough to hand such interesting activity over to AI!), but rather, by delegating AI with a part of the DJ process, we aim to realize song selections that could not be imagined by humans alone, as well as new kinds of DJ performance never seen before. (See this article on Medium)

AI DJ Project at YCAM (2017)

The AI DJ Project started with Back to Back with AI (multiple DJs selecting songs alternately), but as AI technology progressed, it evolved into a live performance where AI generates rhythms and bass lines on stage in real-time and a DJ play generated MIDI data using drum machines and synthesizers (AI DJ Project ver2 — Ubiquitous Rhythm in 2021).

The fact that the sound you chose on the drum machine and synthesizer affects the next generated rhythm or bass line patterns makes it impossible to predict what kind of music will be created next. When everything clicked, and the music started grooving, it was a thrilling and rewarding experience that gave me goosebumps. But it was too challenging for the DJ: there were too many parameters to control, such as synths, drum machines, and AI, and it was almost impossible to maintain a musical quality.

AI DJ Project ver2 — Ubiquitous Rhythm (2018)

Last year, 2022, marked the birth of a breakthrough in AI music technology, which has been relatively invisible in the shadow of image-generating AI models such as DALL-E and Stable Diffusion, as well as ChatGPT. Namely, it was the year of the birth of real-time AI audio generation models, such as RAVE (November 2021, to be exact) and MUSIKA. (The term “real-time generation” is used here to mean “sound is generated in the same or shorter time as the actual performance. For example, one measure of a song with a tempo of 120 BPM is equivalent to two seconds, and the boundary of real-time or not is whether or not it is possible to generate more than one measure of music within two seconds.)

AI-based sound synthesis technology is a field that has been actively researched, especially in the field of speech synthesis for generating spoken words, and AI models that can generate music faster than real-time have been proposed. In music, several acoustic synthesis technologies have been proposed that synthesize sounds as they are, rather than using MIDI or other musical notation. SampleRNN and OpenAI Jukebox are particularly well-known, but they are far from real-time, taking several hours to generate 30 seconds of sound (in the case of Jukebox), and the quality of the sound produced needed to be improved for performance. This situation changed drastically from the end of 2021 to 2022.

We saw this trend and thought that using AI to synthesize sounds on the spot (instead of generating scores) would reduce the burden of controlling synthesizers and their arbitrariness, as described above. Suppose if DJs could synthesize music as they imagined on stage and combine it as they wished, creating a one-time, one-off musical experience. It would provide the audience with a musical experience that had never existed before and present a new way of performing (and a new challenge) for the DJ.

It’s safe to say that modern-day DJs, who mix digital data using software features such as HOT cueing, looping functions, and even sound source separation, are more inclined to live performance than traditional DJs who mix two vinyl records. If AI accelerates this trend, what kind of live performance will be possible? We were also conscious of such speculative viewpoints.

Suppose if DJs could synthesize music as they imagined on stage and combine it as they wished, creating a one-time, one-off musical experience. It would provide the audience with a musical experience that had never existed before and present a new way of performing (and a new challenge) for the DJ.

MUTEK and Future Prospects

After a long preparation period (preparations began in May 2022, and the first live performance using this system was held at Club METRO in Kyoto in July), the show was scheduled for December 8, 2022. I was quite nervous about how the audience would react, but as it turned out, the response was more positive than I had imagined. I was relieved from the bottom of my heart when I heard the cheers that rose immediately after the live performance.

Controlling the AI model to generate music on the spot and make the audience dance. Rather than a disc jockey or DJ who rides/jockeys a disc, you could feel the possibility of a new form of performance as an “AI jockey” who tames and rides an AI.

There were new musical experiences that could only have been realized using AI, such as morphing rhythms and sounds in real-time, and musical developments that included unexpectedness unique to AI. On the other hand, there are technical limitations, such as the quality of the generated sound, especially synthesizer melodies and riffs. It must also be acknowledged that when music alone is taken apart, it is not at a level where we can proudly say that we have created new and cool music that has never existed before.

How can we create a novel musical experience and sound while learning data through AI? There is still a lot of trial and error to be done (one direction is shown in my 2019 study presented in the video below).

Creative Adversarial Networks for Rhythms

Around the time of MUTEK’s performance, a system called Riffusion, which generates music by generating spectrograms using Stable Diffusion, was presented and created a lot of buzz. The system is also very attractive because, unlike GAN-based systems, it can generate music from text. Whether text input is the best way to manipulate music is debatable. Still, many new experiments are underway to generate waveforms directly using Diffusion, so this field is likely to become even more interesting.

We will continue to incorporate the latest research in this area and carry on our trial-and-error efforts to create unique music and musical experiences that have never been heard before.

(Full performance video is available here.)

Thanks for coming out to the show!

Creadit

  • Concept/Machine Learning/Performance: Nao Tokui (Qosmo)
  • Visual Programming: Ryosuke Nakajima (Qosmo)
  • Visual Programming: Keito Takaishi (Qosmo)
  • Dancer: Masumi Endo
  • Movement Director: Maiko Kuno

References

  • Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. “Generative Adversarial Networks.” Communications of the ACM 63 (11): 139–44.
  • Hung, Tun-Min, Bo-Yu Chen, Yen-Tung Yeh, and Yi-Hsuan Yang. 2021. “A Benchmarking Initiative for Audio-Domain Music Generation Using the Freesound Loop Dataset.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2108.01576.
  • Karras, Tero, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2019. “Analyzing and Improving the Image Quality of StyleGAN.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1912.04958.
  • Kumar, Kundan, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. 2019. “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.” arXiv [eess.AS]. arXiv. http://arxiv.org/abs/1910.06711.
  • Caillon, Antoine, and Philippe Esling. 2021. “RAVE: A Variational Autoencoder for Fast and High-Quality Neural Audio Synthesis.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2111.05011.
  • Pasini, Marco, and Jan Schlüter. 2022. “Musika! Fast Infinite Waveform Music Generation.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2208.08706.
  • Tokui, Nao. 2020. “Can GAN Originate New Electronic Dance Music Genres? — Generating Novel Rhythm Patterns Using GAN with Genre Ambiguity Loss.” arXiv [cs.SD]. arXiv. http://arxiv.org/abs/2011.13062.

--

--