Building a text-to-ambient-sound generator with AudioLDM2
Building AmbientGen — a text-to-ambient-sound generator. This is post #1.
I want to type “rain on a tin roof with distant thunder and crickets” and get back an audio file that sounds like exactly that. Not a recording pulled from a library — a generated soundscape that never existed before.
This is a real product category now. Apps like myNoise and Noisli have millions of users who want ambient soundscapes for focus, sleep, or relaxation. But they all rely on pre-recorded loops. What if you could describe any atmosphere and have it created on the fly?
To build this, I need a model that takes text and produces audio. I chose AudioLDM2. This post is about what it is, how it works, and why I picked it over the alternatives.
If you’ve been following AI image generation — Stable Diffusion, DALL-E, Midjourney — you might think: “surely the same approach works for audio?” And broadly, yes. But there are some important differences.
Images are spatial. Audio is temporal. A 10-second clip at 16kHz is 160,000 samples — that’s a lot of raw data to generate. And unlike images where “a dog sitting on a couch” has a fairly obvious visual interpretation, the connection between language and sound is often less direct. What does “cozy” sound like? What does “distant” mean in audio terms?
There’s also a data problem. We have billions of image-text pairs scraped from the internet (every image has an alt tag or caption). Audio-text pairs are much scarcer. Most audio datasets have simple labels like “dog barking” rather than rich descriptions like “a small dog barking excitedly in a large echoey hallway.”
So the challenge is: generate high-dimensional temporal data from text, with relatively limited paired training data. AudioLDM2’s clever trick is finding a way around this data scarcity.
If you understand Stable Diffusion for images, you’re 70% of the way to understanding AudioLDM2. The key insight of latent diffusion is: don’t generate in pixel/sample space — generate in a compressed latent space.
Here’s the pipeline:
Text → [Text Encoders] → [Diffusion U-Net in latent space] → [VAE Decoder] → [Vocoder] → Audio
The diffusion process itself is the same as in image generation: gradually remove noise, step by step, with the text conditioning telling the model what to generate at each step. If you want to understand diffusion models deeply, the DDPM paper is the reference, but for now the intuition is: the model learns to reverse a noising process, and at generation time you give it noise and let it “imagine” its way to clean audio.
Here’s where it gets interesting. AudioLDM (v1) used CLAP embeddings to condition the diffusion model. CLAP — which I’ll explain in a moment — basically gives you a way to say “this text and this audio mean the same thing.” That worked, but it was limited.
AudioLDM2 introduces a concept called “Language of Audio” (LOA) — a shared intermediate representation that any type of audio can be mapped to. The LOA is based on AudioMAE (Audio Masked Autoencoder), a self-supervised model trained on large amounts of unlabeled audio.
Why does this matter? Because AudioMAE learns from audio alone — no text labels needed. This means:
The generation process works in two stages:
Stage 1: Text → GPT-2 → LOA features (translating from language to the “language of audio”)
Stage 2: LOA features → Latent Diffusion Model → Audio (generating audio conditioned on LOA)
This two-stage approach is the key innovation. By having an intermediate audio representation (LOA), the model separates “understanding what to generate” from “actually generating it.” And because the LOA is learned from raw audio, the generation stage can be trained self-supervised.
I mentioned CLAP above. Let me explain it properly because it’s a foundational piece.
CLAP stands for Contrastive Language-Audio Pretraining. If you know CLIP (for images), CLAP is the audio equivalent. The idea is simple and powerful:
In AudioLDM (v1), CLAP embeddings were the main way to condition the diffusion model — the text got encoded by CLAP’s text encoder, and this embedding guided the audio generation.
In AudioLDM2, CLAP is still part of the picture but it’s joined by FLAN-T5 (a large language model) as an additional text encoder. Both feed into the GPT-2 module that produces LOA features. The multiple text encoders give the model a richer understanding of the text prompt — CLAP provides audio-semantic alignment while FLAN-T5 provides deeper language understanding.
I considered several alternatives:
AudioGen (Meta): Autoregressive model — generates audio token by token, like GPT generates text. Good for sound effects but slower at inference and less flexible for long-form ambient generation.
Stable Audio Open (Stability AI): Very capable, especially for music. Uses a DiT (diffusion transformer) architecture with timing conditioning. It’s newer and arguably higher quality, but more oriented toward music production than ambient sound generation.
Make-An-Audio 2: Interesting approach with temporal enhancement, but less community support and harder to get running.
I picked AudioLDM2 because:
I might benchmark it against Stable Audio later. But for getting started and learning the fundamentals, AudioLDM2 is the right choice.
In the next post, I’ll set up the environment, run AudioLDM2 for the first time, and generate my first ambient sounds. I’ll try different prompts and start building intuition for what works and what doesn’t.
The goal: go from paper to sound in one session.
Paper: AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining (Liu et al., 2023)
Also referenced: