Building a text-to-ambient-sound generator with AudioLDM2
Building AmbientGen — a text-to-ambient-sound generator. This is post #3.
A quick correction from Day 2. I initially reported that AudioLDM2’s output quality was poor — lots of artifacts, distortion, disappointing results. Turns out the problem was largely me, not the model.
I was using 50 inference steps and a guidance scale of 7.0. The settings that actually work well:
num_waveforms_per_prompt: didn’t make a practical difference on my setupLesson learned: read the documentation before concluding a model doesn’t work. The guidance scale was the biggest culprit — at 7.0 the model was essentially “trying too hard” and producing overcooked, distorted audio. At 3.0, things smoothed out considerably.
With that fixed, I ran a systematic set of experiments to understand how prompts affect audio generation quality.
I tested three sound categories — water, wind, and fire — with prompts ranging from one word to a full descriptive sentence.
Water:
| Prompt | Result |
|---|---|
| “water” | Vague, ambiguous — could be a faucet, rain, anything |
| “water flowing” | Slightly better, but still generic |
| “a stream of water flowing over rocks” | Clear improvement — spatial context helps |
| “a clear mountain stream flowing over smooth rocks in a quiet forest” | Best result. The forest context added a convincing background |
Wind:
| Prompt | Result |
|---|---|
| “wind” | Generic white noise |
| “wind blowing” | Still generic |
| “wind blowing through trees” | Noticeable texture, more natural |
| “gentle wind blowing through pine trees on a mountain, leaves rustling” | Best, but not dramatically better than the 3-word version |
Fire:
| Prompt | Result |
|---|---|
| “fire” | Surprisingly decent even with one word |
| “fire crackling” | Good — “crackling” is a very specific acoustic descriptor |
| “a campfire crackling at night” | Good, similar to above |
| “a warm campfire crackling and popping at night with crickets in the background” | Best — the crickets added a nice high-frequency complement |
Key finding: There’s a sweet spot of specificity. The biggest quality jump happens when you add spatial context (where the sound happens) and acoustic descriptors (crackling, flowing, rustling). Going beyond 3-4 meaningful elements doesn’t improve things much.
Fire was interesting because it worked well even with short prompts. My theory: “crackling” maps to a very specific spectral pattern in the training data, while “water” is hugely ambiguous.
I took a mediocre-sounding prompt (“rain falling on a window”) and prepended different quality modifiers to see which ones improve the output.
Results, roughly ranked:
Why do these work? The model learned from audio datasets that have metadata and descriptions. Audio tagged as “field recording” in the training data is likely high-quality nature recordings from sites like Freesound. “ASMR” content has a very specific sonic signature. The modifier doesn’t change what the model generates — it changes which part of the training distribution it draws from.
This is the same principle as in image generation, where “professional photograph” or “8K” produces better results than just describing the subject. You’re not just describing the content, you’re describing the recording quality and context.
Ambient soundscapes are usually layered — rain plus thunder plus wind. How well does AudioLDM2 handle multi-element scenes?
| Elements | Prompt | Result |
|---|---|---|
| 1 | “rain falling” | Clean but flat, one-dimensional |
| 2 | “rain falling with distant thunder” | Good improvement, the thunder adds dynamics |
| 3 | “rain falling with distant thunder and wind blowing” | Still OK, but volume balance between elements starts to feel off |
| 4 | “rain falling on a cabin roof with distant thunder, gentle wind, and a fireplace crackling inside” | The model struggled here. Indoor/outdoor mixing was inconsistent — it couldn’t clearly separate the spatial logic of inside vs outside |
Finding: 2-3 elements is the sweet spot for a single generation. Beyond that, the model tends to either drop elements or blend them unnaturally.
This has a direct implication for AmbientGen: instead of trying to generate complex scenes in a single pass, we should generate individual layers and mix them. Generate rain, generate thunder, generate fireplace — then composite them with proper volume control. This will likely produce better results and gives the user more control.
I generated the same prompt (“ocean waves on a sandy beach with seagulls”) with five different seeds. The variation was significant — out of 5 generations, 2 were clearly good, 2 were mediocre, and 1 was poor.
This isn’t random noise in the perceptual sense. Different seeds produce different starting points in the latent space, and some starting points lead to better denoising paths than others. For the app, this means we should either:
Based on all experiments, here’s the pattern that consistently produces the best ambient sound results:
[quality modifier] + [specific sound] + [acoustic descriptor] + [spatial context] + [1-2 complementary elements]
Examples:
And always use:
The composition limit (2-3 elements per generation) actually shapes the architecture of our app. Instead of one big “generate everything” button, AmbientGen should let you build scenes in layers. In the next post, I’ll build the Gradio interface with this layered approach — generate individual ambient elements and mix them together.
Experiment notebook: Available in the experiments folder.