Audio Model Type Support In SwarmUI

Audio Duration: defaults to 120 seconds (2 minutes), but designed to short any duration from a few seconds to up to 10 minutes
Audio Style: write a short description of the music style.

Model	Year	Author	Scale	Type	Quality/Status
Ace Step 1.5	2026	StepFun	2B DiT	Music	No

Support for image models and technical formats is documented in the Model Support doc, as well as explanation of the table columns above.

Audio models vary in intention and purpose. Some examples include:

Sound effect models are used to create general purpose sound effects, useful to adjust videos with (see eg mmaudio)
Music models create full songs. Often these support instruments, styles, and lyrics (see eg ACE-Step)
Speech models create speech. Often these take in voice references and text prompts for what to say (see eg VibeVoice)

Ace Step 1.5