Since last time I updated here, we have added CozyVoice3 to the suite (the nice thing about it is that it is finnally an alternative to Chatterbox zero-shot VC - Voice Changer). And now I just added the new Qwen3-TTS!

The most interesting feature is by far the Voice Designer node. You can now finnally create your own AI voice. It lets you just type a description like “calm female voice with British accent” and it generates a voice for you. No audio sample needed. It’s useful when you don’t have a reference audio you like, or you don’t want to use a real person voice or you want to quickly prototype character voices. The best thing about our implementation is that if you give it a name, the node will save it as a character in your models/voices folder and the you can use it with literally all the other TTS Engines through the 🎭 Character Voices node.

The Qwen3 engine itself comes with three different model types: 1- CustomVoice has 9 preset speakers (Hardcoded) and it supports intructions to change and guide the voice emotion (base doesn’t unfortunantly) 2- VoiceDesign is the text-to-voice creation one we talked about 3- and Base that does traditional zero-shot cloning from audio samples. It supports 10 languages and has both 0.6B (for lower VRAM) and 1.7B (better quality) variants.

\very recently a ASR (*Automatic Speech Recognition) model has been released and I intedn to support it very soon with a new node for ASR which is something we are still missing in the suite Qwen/Qwen3-ASR-1.7B · Hugging Face

I also integrated it with the Step Audio EditX inline tags system, so you can add a second pass with other emotions and effects to the output.

Of course, as any new engine added, it comes with all our project features: character switching trough the text with tags, language switchin, PARAMETHERS switching, pause tags, caching generated segments, and of course Full SRT support with all the timing modes. Overall it’s a solid addition to the 10 TTS engines we now have in the suite.

Now that we’re at 10 engines, I decided to add some comparison tables for easy reference - one for language support across all engines and another for their special features. Makes it easier to pick the right engine for what you need.

🛠️ GitHub: Get it Here 📊 Engine Comparison: Language Support | Feature Comparison 💬 Discord: https://discord.gg/EwKE8KBDqD

Below is the full LLM description of the update (revised by me):


🎨 Qwen3-TTS Engine - Create Voices from Text!

Major new engine addition! Qwen3-TTS brings a unique Voice Designer feature that lets you create custom voices from natural language descriptions. Plus three distinct model types for different use cases!

✨ New Features

Qwen3-TTS Engine

🎨 Voice Designer - Create custom voices from text descriptions! “A calm female voice with British accent” → instant voice generation Three model types with different capabilities:

CustomVoice: 9 high-quality preset speakers (Vivian, Serena, Dylan, Eric, Ryan, etc.) VoiceDesign: Text-to-voice creation - describe your ideal voice and generate it Base: Zero-shot voice cloning from audio samples

10 language support - Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian Model sizes: 0.6B (low VRAM) and 1.7B (high quality) variants Character voice switching with [CharacterName] syntax - automatic preset mapping SRT subtitle timing support with all timing modes (stretch_to_fit, pad_with_silence, etc.) Inline edit tags - Apply Step Audio EditX post-processing (emotions, styles, paralinguistic effects) Sage attention support - Improved VRAM efficiency with sageattention backend Smart caching - Prevents duplicate voice generation, skips model loading for existing voices Per-segment parameters - Control [seed:42], [temperature:0.8] inline Auto-download system - All 6 model variants downloaded automatically when needed

🎙️ Voice Designer Node

The standout feature of this release! Create voices without audio samples:

Natural language input - Describe voice characteristics in plain English Disk caching - Saved voices load instantly without regeneration Standard format - Works seamlessly with Character Voices system Unified output - Compatible with all TTS nodes via NARRATOR_VOICE format

Example descriptions:

“A calm female voice with British accent” “Deep male voice, authoritative and professional” “Young cheerful woman, slightly high-pitched”

📚 Documentation

YAML-driven engine tables - Auto-generated comparison tables Condensed engine overview in README Portuguese accent guidance - Clear documentation of model limitations and workarounds

🎯 Technical Highlights

Official Qwen3-TTS implementation bundled for stability 24kHz mono audio output Progress bars with real-time token generation tracking VRAM management with automatic model reload and device checking Full unified architecture integration Interrupt handling for cancellation support

Qwen3-TTS brings a total of 10 TTS engines to the suite, each with unique capabilities. Voice Designer is a first-of-its-kind feature in ComfyUI TTS extensions!


💬 Discussion r/StableDiffusion (111 points, 40 commentaires) 🔗 Source