I don’t have to learn guitar to impress people because of Meta’s new Generative AI tool…
Imagine a world where professional musicians can explore new compositions without playing a single note, or indie game developers can populate their virtual worlds with realistic sound effects on a shoestring budget.
This is the world that AudioCraft promises to create. AudioCraft is a simple framework that generates high-quality, realistic audio and music from text-based user inputs after training on raw audio signals. It consists of three models: MusicGen, AudioGen, and EnCodec, each serving a unique purpose in the generation of audio.
The Three Pillars of AudioCraft: MusicGen, AudioGen, and EnCodec
MusicGen, trained with Meta-owned and specifically licensed music, generates music from text-based user inputs. In contrast, AudioGen, which was trained in public sound effects, generates audio from text-based user inputs.
The EnCodec decoder, an improved version, allows for higher-quality music generation with fewer artifacts. These models are available for research purposes and to further people’s understanding of the technology, providing researchers and practitioners access to train their own models with their own datasets.
The Challenges and Solutions in Audio Generation
Generating high-fidelity audio of any kind requires modelling complex signals and patterns at varying scales. Music, with its local and long-range patterns, is arguably the most challenging type of audio to generate.
Traditional approaches using symbolic representations like MIDI or piano rolls often fail to fully grasp the expressive nuances and stylistic elements found in music. AudioCraft addresses these challenges by leveraging self-supervised audio representation learning and a number of hierarchical or cascaded models to generate music.
The Promise of AudioCraft
AudioCraft promises to simplify the overall design of generative models for audio compared to prior work in the field. It works for music and sound generation and compression — all in the same place.
It’s easy to build on and reuse, allowing people who want to build better sound generators, compression algorithms, or music generators to do it all in the same code base and build on top of what others have done. The models are designed to be easily extended and adapted to various use cases for research, offering nearly limitless possibilities.
The Future of AudioCraft
The team behind AudioCraft is continuously working on improving the models by boosting their speed and efficiency from a modelling perspective and improving the way we control these models. This will open up new use cases and possibilities.
The simple approach developed to successfully generate robust, coherent, and high-quality audio samples is expected to have a meaningful impact on the development of advanced human-computer interaction models considering auditory and multi-modal interfaces.
The Importance of Open Source and Responsible AI
Open-sourcing the research and resulting models helps ensure that everyone has equal access. By sharing the code for AudioCraft, the team hopes other researchers can more easily test new approaches to limit or eliminate potential bias in and misuse of generative models.
The team recognizes the importance of being open about their work so the research community can build on it and continue the important conversations about how to build AI responsibly.
Conclusion
AudioCraft is an important step forward in generative AI research. It’s not just a tool for musicians and sound designers but a stepping stone towards a future where generative AI could help people vastly improve iteration time by allowing them to get feedback faster during the early prototyping stages.
Whether they’re a large AAA developer building worlds for the metaverse, a musician working on their next composition, or a small or medium-sized business owner looking to up-level their creative assets, AudioCraft is set to revolutionize the way we interact with audio.
To learn more about Future Tech, make sure to follow.
My Blog — Future technology simplified to read and implement.
My LinkedIn — Short form opinions on future tech.