AmbientGen

Building a text-to-ambient-sound generator with AudioLDM2

View the Project on GitHub my-sonicase/ambientgen

📚 Papers & Reading List

A curated collection of key papers and resources in generative AI for audio and music. Papers I’ve read in depth are marked with ✅ and linked to my blog post about them.

Text-to-Audio Generation

Paper	Year	Key Idea	Status
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining	2023	Latent diffusion for audio with “Language of Audio” shared representation	✅ My notes
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models	2023	First latent diffusion model for text-to-audio, uses CLAP	📋 To read
Make-An-Audio 2	2023	Temporal-enhanced text-to-audio with LLM-augmented captions	📋 To read
AudioGen: Textually Guided Audio Generation	2022	Autoregressive audio generation from Meta	📋 To read
Stable Audio Open	2024	Latent diffusion with timing conditioning from Stability AI	📋 To read

Audio Understanding & Representation

Paper	Year	Key Idea	Status
CLAP: Learning Audio Concepts from Natural Language Supervision	2022	Contrastive learning to align audio and text (like CLIP for audio)	📋 To read
Audio Spectrogram Transformer (AST)	2021	Pure attention model for audio classification	📋 To read

Text-to-Music

Paper	Year	Key Idea	Status
MusicGen: Simple and Controllable Music Generation	2023	Single-stage transformer for music from Meta	📋 To read
MusicLM: Generating Music From Text	2023	Hierarchical music generation from Google	📋 To read
Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion	2024	DiT-based architecture with timing control	📋 To read

Voice & Speech Synthesis

Paper	Year	Key Idea	Status
XTTS: Cross-lingual Text-to-Speech	2024	Multilingual TTS with voice cloning (Coqui)	📋 To read
Bark	2023	GPT-style text-to-audio with speech, music, sound effects	📋 To read
StyleTTS 2	2023	Diffusion-based style modeling for natural TTS	📋 To read

Foundational (Diffusion Models)

Paper	Year	Key Idea	Status
Denoising Diffusion Probabilistic Models (DDPM)	2020	The foundational diffusion model paper	📋 To read
High-Resolution Image Synthesis with Latent Diffusion Models	2022	Latent Diffusion (Stable Diffusion) — same principle used in AudioLDM	📋 To read

🔗 Other Resources

HuggingFace Audio Course — free course on audio ML
Valerio Velardo - Audio Signal Processing for ML — YouTube series on audio fundamentals
Papers With Code - Audio Generation — benchmarks and leaderboards
DCASE Challenge — Detection and Classification of Acoustic Scenes

Last updated: February 2025

← Back to blog index