Text-to-Audio
Transformers
Safetensors
English
dasheng_audiogen
feature-extraction
audio-generation
text-to-speech
text-to-music
sound-effects
diffusion
custom_code
Instructions to use mispeech/Dasheng-AudioGen with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mispeech/Dasheng-AudioGen with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-audio", model="mispeech/Dasheng-AudioGen", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Dasheng-AudioGen
Dasheng-AudioGen is a unified audio generation model that can jointly synthesize intelligible speech, music, sound effects, and environmental acoustics from text descriptions.
Models
| Model | HuggingFace | Text Encoder | Language |
|---|---|---|---|
| Dasheng-AudioGen | mispeech/Dasheng-AudioGen | google/flan-t5-large |
English |
| Dasheng-AudioGen-Multilingual | mispeech/Dasheng-AudioGen-Multilingual | google/mt5-large |
Multilingual |
Installation
pip install torch torchaudio "transformers<5" einops
Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x.
Quick Start
Basic Usage
import torchaudio
from transformers import AutoModel
model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen", trust_remote_code=True).cuda()
audio = model.generate("A dog barking in a park")
torchaudio.save("output.wav", audio.cpu(), 16000)
Aspect-wise Prompt
Use compose_prompt to describe different audio aspects separately:
prompt = model.compose_prompt(
caption="A gritty detective narrating over the sound of heavy rain and a melancholic solo jazz saxophone.",
speech="gritty deep male voice",
music="melancholic solo saxophone",
env="distant urban ambience",
sfx="heavy rain hitting pavement",
asr="The city never sleeps, but it sure knows how to cry.",
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
You can also pass a pre-formatted string with tags directly:
audio = model.generate(
"<|caption|> A helicopter passing overhead. <|sfx|> Rhythmic helicopter blade sounds. <|env|> Open sky ambience."
)
Batch Inference
prompts = [
model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."),
model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."),
model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."),
]
audios = model.generate(prompts)
for i, audio in enumerate(audios):
torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000)
Generation Parameters
audio = model.generate(
prompts="A dog barking in a park",
num_steps=25, # number of denoising steps (default: 25)
guidance_scale=5.0, # classifier-free guidance scale (default: 5.0)
sway_sampling_coef=-1.0, # sway sampling coefficient (default: -1.0, 0 for linear)
)
Prompt Format
Dasheng-AudioGen uses structured tags to describe different audio aspects:
| Tag | Description |
|---|---|
<|caption|> |
Overall audio scene description |
<|speech|> |
Speaker identity and speaking style |
<|asr|> |
Spoken transcript / dialogue |
<|sfx|> |
Sound effects |
<|music|> |
Background music |
<|env|> |
Environmental ambience |
Acknowledgments
Dasheng-AudioGen was developed with contributions from XIAOMI LLM PLUS and SJTU X-LANCE.
Citation
@article{dasheng-audiogen,
title={Dasheng-AudioGen},
author={},
journal={arXiv preprint arXiv:2505.XXXXX},
year={2025}
}
License
This project is released under the Apache License 2.0.
- Downloads last month
- 12