Building a text-to-ambient-sound generator with AudioLDM2
A curated collection of key papers and resources in generative AI for audio and music. Papers Iβve read in depth are marked with β and linked to my blog post about them.
| Paper | Year | Key Idea | Status |
|---|---|---|---|
| AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining | 2023 | Latent diffusion for audio with βLanguage of Audioβ shared representation | β My notes |
| AudioLDM: Text-to-Audio Generation with Latent Diffusion Models | 2023 | First latent diffusion model for text-to-audio, uses CLAP | π To read |
| Make-An-Audio 2 | 2023 | Temporal-enhanced text-to-audio with LLM-augmented captions | π To read |
| AudioGen: Textually Guided Audio Generation | 2022 | Autoregressive audio generation from Meta | π To read |
| Stable Audio Open | 2024 | Latent diffusion with timing conditioning from Stability AI | π To read |
| Paper | Year | Key Idea | Status |
|---|---|---|---|
| CLAP: Learning Audio Concepts from Natural Language Supervision | 2022 | Contrastive learning to align audio and text (like CLIP for audio) | π To read |
| Audio Spectrogram Transformer (AST) | 2021 | Pure attention model for audio classification | π To read |
| Paper | Year | Key Idea | Status |
|---|---|---|---|
| MusicGen: Simple and Controllable Music Generation | 2023 | Single-stage transformer for music from Meta | π To read |
| MusicLM: Generating Music From Text | 2023 | Hierarchical music generation from Google | π To read |
| Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion | 2024 | DiT-based architecture with timing control | π To read |
| Paper | Year | Key Idea | Status |
|---|---|---|---|
| XTTS: Cross-lingual Text-to-Speech | 2024 | Multilingual TTS with voice cloning (Coqui) | π To read |
| Bark | 2023 | GPT-style text-to-audio with speech, music, sound effects | π To read |
| StyleTTS 2 | 2023 | Diffusion-based style modeling for natural TTS | π To read |
| Paper | Year | Key Idea | Status |
|---|---|---|---|
| Denoising Diffusion Probabilistic Models (DDPM) | 2020 | The foundational diffusion model paper | π To read |
| High-Resolution Image Synthesis with Latent Diffusion Models | 2022 | Latent Diffusion (Stable Diffusion) β same principle used in AudioLDM | π To read |
Last updated: February 2025