Compiling ideas to build the next fancy tool for your music journey, we have conducted a state-of-the-art review of what researchers are suggesting to use AI for these days in the field of music and audio.
Compared to NLP or computer vision, the integration of AI and ML appears to be less advanced in music and audio processing. Nonetheless, recent developments suggest a substantial interest in particular in generative AI. This can be observed by several recent successful launches like, e.g., stable audio, soundry AI, udio or suno. However, there are many more applications of AI in this field. In this blog post, we have assembled a list of application areas, where AI might be used or is being used already under the hood in audio processing.
Some of the following AI use cases are quite well-known or even commonly recognized by the general public. Some might be new to you. We also searched for available software implementing the relevant tasks to show which fields have matured to a deployable level. Of course, in many cases it is not clear whether these applications internally use AI, manual algorithms or a mix of both.
Here’s our list of AI application areas related to music or audio:
- Automatic Music Transcription / Recognition: This technology involves the process of converting a music audio signal into a corresponding symbolic notation, such as MIDI. It utilizes algorithms to identify pitches, rhythms, and sometimes even instrumentations from recorded music. Automatic music transcription (AMT) is crucial for various applications including music education, archiving, and retrieval.
Existing software: Klangio AI, AnthemScore - Sheet Music Difficulty Estimation: This topic involves developing methods to assess the difficulty level of sheet music. Algorithms analyze the musical content, such as note density, rhythmic complexity, and technical requirements, to provide educators and musicians with an objective measure of a piece’s difficulty. This aids in appropriate repertoire selection based on skill level.
- Music Structure Analysis / Formal Analysis: Music structure analysis examines the layout of a piece of music to understand and map its form and structure. This includes identifying repeated sections, variations, and the overall arrangement of themes. It’s essential for musicologists, composers, and performers to interpret compositions and enhance performance strategies.
- Optical Music Recognition (OMR): Scans printed or handwritten sheet music and converts it into editable or playable formats like MIDI or MusicXML. This facilitates the digital preservation of music scores, easier transposition, and integration with various music software.
Existing software: E.g., Maestria, Soundslice - (Symbolic) Music Generation / Text-to-Music: Symbolic music generation involves creating new music in symbolic form (like MIDI) using algorithms, often based on specific rules or learned musical styles. Text-to-music extends this by converting descriptive text into music, reflecting the mood, tempo, and instrumentation suggested by the text.
Existing software: AIVA, MuseNet - Generation of Choir Voices: Focuses on synthesizing realistic choir sounds using digital audio technology. It combines elements of voice synthesis and acoustics modeling, creating rich, layered choir effects from single or multiple vocal inputs.
- Music Performance Analysis: Analyzing musical performance involves studying audio recordings or live performances to evaluate timing, dynamics, and expression. This helps in understanding performance practices. In education, developers use it to enhance musical training and feedback systems.
Existing software: Practice Bird’s IPM, Yousician - Music Source Separation: This process involves extracting individual sounds or instruments from a mixed audio track. It is fundamental in music editing, remixing, and also in educational contexts where students can practice with individual components of a composition.
Existing software: iZotope RX, lalal.ai - Speaker Diarization allows to partition a dialogue into segments based on when each speaker is speaking. It has substantial applications in automatic transcription of dialogues.
Existing software: E.g., Google Cloud Speech-to-Text - Audio Classification / Music Classification / Music Auto-Tagging / Music Genre Recognition: These processes involve analyzing audio tracks to categorize them by genre, mood, instrumentation, or other metadata. This is crucial for music recommendation systems, digital libraries, and for organizing large music collections.
- Mechanical Fault Diagnosis Based on Audio Signal Analysis (MFDA) involves detecting mechanical failures in machinery through the analysis of audio signals. By identifying unusual sounds or vibrations, it helps in preventive maintenance and fault detection without disassembly. Hence, this task constitutes a typical anomaly detection problem, which might be approached using autoencoder-based architectures.
- Content-based Retrieval: This technique allows users to find multimedia content (like audio or video) based on the content itself rather than metadata. It utilizes features extracted from the content, such as tempo, melody, or harmony in music, to facilitate the search.
Existing software: SoundHound - Voice Cloning: Voice cloning technology creates digital replicas of a person’s voice from audio samples. AI engineers might use this technique in personalized virtual assistants, accessibility applications, and entertainment.
Existing software: Respeecher, Descript - Audio DeepFake Detection: Aiming to detect voice cloning, this is the identification and mitigation of artificially generated or manipulated audio clips designed to mimic real recordings. It’s critical in combating misinformation and ensuring communication authenticity.
Existing software: Pindrop - Audio Inpainting: Similar to filling in missing parts of an image, audio inpainting restores or reconstructs missing or corrupted parts of an audio signal. It therefore allows to improve the listening experience or to aid in audio restoration projects.
Existing software: Udio - Audio Super Resolution enhances the resolution of an audio signal. It thus improves the clarity and detail of low-quality recordings to higher resolutions.
- Audio Denoising / Audio Declipping: These techniques are used to clean up audio recordings by removing noise and repairing distorted sounds. They are crucial for post-production in music and film, as well as in forensic audio analysis.
- Polyphonic Audio Editing involves editing audio that contains multiple sounds or voices at the same time, allowing for complex adjustments and manipulations that can enhance the overall audio experience.
Existing software: Melodyne Studio - Speech Synthesis & Speech-to-Text: Speech synthesis, aka. Text-to-Speech (TTS), converts text to spoken voice output, while speech-to-text does the reverse, transcribing spoken language into written text. These technologies are fundamental in accessibility tools, virtual assistants, and automated transcription services.
Existing Software: E.g., Google Text-to-Speech - Language Identification identifies the language spoken in an audio clip. This task is essential for multilingual applications, automated translation services, and global communication platforms.
- Emotion Recognition: This field involves analyzing vocal expressions to detect emotional states. It’s used in customer service, security systems, and health diagnostics to assess and respond to human emotions effectively.
Existing software: Beyond Verbal - Sound Event Localization and Detection localizes and identifies specific sounds within an audio environment. They are useful in surveillance, wildlife monitoring, and smart home systems.
Existing software: Audio Analytic
On the technological side, similar to other domains of machine learning, deep learning clearly dominates these fields. The architectural principles of the deployed models are quite similar to the ones known from computer vision. VAEs, transfomers and diffusion and flow matching models are all being used, in particular for generative tasks. However, as opposed to vision tasks, in the audio domain we typically use spectrogram representations as inputs. When needed, we often employ phase reconstruction at the output to create a realistic audio signal.
There are substantial differences considering the amount of effort that has been put into researching these application areas to date. For instance, researchers from Spotify and Magenta (Google AI) have recently developed powerful models in the AMT field to reliably process audio recordings for various instruments. Also the TTS and STT domains have seen significant advancements recently. The synthesis of naturally sounding voice and the reliable transcription of voice recordings (e.g. using OpenAI’s Whisper) thereby become possible.
What does the future hold in the field of music and audio? On the one hand, we see that the above list contains quite a few tasks using generative techniques. Generative music might become a significant application area in industries such as film. We additionally see a convergence of different types of modalities. For instance, OpenAI’s latest GPT-4o model, accepts textual, visual and acoustic inputs. The growing reliance on foundation models in various fields of deep learning suggests that transfer learning could play a crucial role in enhancing predictive tasks in audio technology. Even for less popular tasks, it’s therefore likely that we can improve accuracy while minimizing the need for training data.