On Music Generation AI Models

On AI Music Models

Introduction:

Recently, I took on a side project that involved packaging a new music generation AI model into a nice user experience (www.text-to-sample.com). Simply by describing a sound, such as “African drums”, or “electric guitar feedback”, and specifying a duration, an AI can generate the sound, which can then be used as part of the song creation process.

This involved creating a plugin for digital audio workstations (DAWs) like Logic and Ableton, which allows music producers to use the AI directly in their workflow. Traditionally, music producers look for little snippets of audio called samples, by using audio from their own collection, or purchasing “sample packs” through various exchanges such as Splice. This can sometimes be a cumbersome process, and it’s not always easy to find exactly the sound you’re looking for. By using AI, and a text prompt input, we can make the process easier.

In the process of making this project, I made a short survey of the AI landscape related to music.

A very brief History of music technology:

Technology has always played an important part in music creation as well as music listening. Going back before electricity, craftsmen would make instruments through woodworking and other skills. Pipes were used in organs, strings were used in pianos, violins and cellos, and wood was used to amplify sounds in drums. However, changes in technology have always been met with resistance and controversy, especially in music. A famous example which is portrayed in the new Bob Dylan movie is when “Dylan goes electric” (Electric Dylan controversy). In more recent history, autotune and synthesized instruments were first met with contempt by certain artists and listeners, but artists who embraced them would go on to hit the top charts and win grammy’s.

I think tech in music can be broken down into 4 distinct eras:

Pre-electric: Before electricity was invented, technology took its form through woodworking and other crafting techniques to amplify sound, create different timbers and tones, etc. Even amphitheaters and concert halls were engineered in a way that took acoustics into consideration, to make the listening experience optimal.
Analog/electric: Technology that made use of electricity gave way to great changes in music- from microphones and amplifiers, to electric guitars, basses and synthesizers. On the listening side, we had transistor radios, tape casets and record players.
The digital Era: The computer brought even more radical technological changes to music. CD players, digital audio workstations (DAWs), sampling, and a plethora of algorithms gave way to new ways of making music. Limits were broken, such as the length of a song and the number of distinct tracks in a song. Digital algorithms gave way to unique and controversial effects, such as autotune.
The new era - the AI era: I think we are currently living through a new era of music technology which I would call the ai music era. Machine learning models are giving way to new ways to make music that were impossible before, and lowering the skill levels needed to make full production songs. Tools like Suno allow anybody who knows how to type to create full length songs, while models like SoftVC VITS (sovits vocal transfer) allows anyone to sing a song, and make their voice sound like somebody else (see this fake Drake rap songs that went viral). For example, Google's MusicLM (musicLM) and MusicFX (musicFX) are also examples of this new era of music creation.

For me, this new AI era is very exciting. As both a musician and a software engineer, I am really excited to use these new tools to augment and speed up my workflow, as well as achieve creative visions that I have in my mind. Just like using copilot and chatGPT to help me create software, I think these AI models will greatly help me make music.

On the other hand, I suspect others may feel differently. Some might feel threatened by these new models, or think that it renders the music inauthentic. Arguments that were used agains electric guitars, autotune and hip hop are sure to be used once again against these AI music models.

So what are the new models, and what can they do?

Currently, there are available ai tools that can do:

Full song generation: Tools like Suno allow anybody to type in a prompt, and even pass in lyrics and song structure, and have the AI create the entire song - yes, even the singing. Riffusion is another AI model that can generate full songs from text prompts. YuE, developed by a Chinese lab, also offers full song generation capabilities (Link).
Stem Splitting: This allows one to take an existing song, and split the audio into distinct tracks, such as vocals, guitar, drums and bass. This is useful for musicians who want to take parts of other songs and mix them into something new. Demucs by Meta is an AI model for stem splitting. Lalala.ai and Audioshake are online services that also provide stem splitting functionality.
Vocal Synthesis: This allows one to generate new audio that sounds like someone singing. One can write lyrics, as well as melody, and have the AI generate an audio that can match it. You could even specify how you want the voice to sound like. Google has also presented models capable of vocal synthesis, such as googlesingsong.
Instrumental Generation/Sample Generation: AI models can generate new instrumental sounds and samples from text prompts. For example, Stability AI's Stable Audio 2.0 (Stability AI, stable audio 1.0) allows users to create samples of various lengths and styles by simply typing a text prompt. My own project, text-to-sample web app, blog post, also falls into this category, allowing users to generate specific sound samples for use in DAWs. Meta's AudioGen (meta audiogen) is another example of a model focused on audio sample generation from text.
Infilling: AI can be used to fill in missing sections of an existing audio track. This could be useful for restoring damaged recordings or extending existing musical ideas. For example, if a section of a drum track is missing, an AI model could be used to generate a realistic drum fill that seamlessly integrates with the rest of the song.
Song continuation: AI models can analyze an existing song and generate a continuation, creating a longer piece of music. Meta's AudioGen also offers song continuation capabilities (Link).
Tonal transfer: This refers to the ability of AI to transfer the tonal characteristics of one piece of music to another. For example, you could take a simple melody and have an AI model render it in the style of Bach or Jimi Hendrix. So-vits-svc (sovits singing voice conversion) is an AI model specifically for singing voice conversion, enabling tonal transfer for vocals.
Mastering: AI is starting to be used in music mastering, the final stage of audio production that optimizes the overall sound of a track for playback on different systems. AI mastering tools can automatically adjust levels, EQ, and compression to achieve a professional and polished sound. Kits AI (kits ai), LANDR (landr), and CryoMix (cryomix) are examples of AI mastering services available.

What Comes Next?

Just like other areas of AI, I believe we will continue to see extremely rapid development in this space. Market analysis projects the AI in music market to reach $2.8 billion by 2031, with a compound annual growth rate of 30.2% from 2022 to 2031 (market data on ai music). I predict we will soon see songs hitting the top charts that use AI generated music in some form, whether it’s a sample, a vocal style transfer, or the mastering.

That being said, I don’t believe AI will replace the artist. Tools like Suno will surely be used to replace music that is primarily functional, like elevator music and background music in commercials. Purists will continue to ignore these tools, and will likely continue to be successful in certain niches. However, the vast majority of music will no doubt use these tools in one way or another, just like the vast majority of code is now being written with the help of LLMs, and almost all graphic design uses photoshop.

If the progress of LLM models, image generation models and video models offer any hints, then we will continue to see the audio generation models get better as the datasets get more and more finetuned, and the number of parameters in the AI model increases. At some point, AI generated music will be nearly indistinguishable from traditionally created music.

Perhaps, and this might be wishful thinking, that the rise of AI generated media will push people to desire more authentic forms of media, and appreciate real art. As a musician, I hope that this leads to more appreciation for live music, and more space for live music.