Since last time I updated here, we have added CozyVoice3 to the suite (the nice thing about it is that it is finnally an alternative to Chatterbox zero-shot VC - Voice Changer). And now I just added the new Qwen3-TTS!
The most interesting feature is by far the Voice Designer node. You can now finnally create your own AI voice. It lets you just type a description like “calm female voice with British accent” and it generates a voice for you. No audio sample needed. It’s useful when you don’t have a reference audio you like, or you don’t want to use a real person voice or you want to quickly prototype character voices. The best thing about our implementation is that if you give it a name, the node will save it as a character in your models/voices folder and the you can use it with literally all the other TTS Engines through the 🎭 Character Voices node.
The Qwen3 engine itself comes with three different model types: 1- CustomVoice has 9 preset speakers (Hardcoded) and it supports intructions to change and guide the voice emotion (base doesn’t unfortunantly) 2- VoiceDesign is the text-to-voice creation one we talked about 3- and Base that does traditional zero-shot cloning from audio samples. It supports 10 languages and has both 0.6B (for lower VRAM) and 1.7B (better quality) variants.
\very recently a ASR (*Automatic Speech Recognition) model has been released and I intedn to support it very soon with a new node for ASR which is something we are still missing in the suite Qwen/Qwen3-ASR-1.7B · Hugging Face
I also integrated it with the Step Audio EditX inline tags system, so you can add a second pass with other emotions and effects to the output.
Of course, as any new engine added, it comes with all our project features: character switching trough the text with tags, language switchin, PARAMETHERS switching, pause tags, caching generated segments, and of course Full SRT support with all the timing modes. Overall it’s a solid addition to the 10 TTS engines we now have in the suite.
Now that we’re at 10 engines, I decided to add some comparison tables for easy reference - one for language support across all engines and another for their special features. Makes it easier to pick the right engine for what you need.
🛠️ GitHub: Get it Here 📊 Engine Comparison: Language Support | Feature Comparison 💬 Discord: https://discord.gg/EwKE8KBDqD
Below is the full LLM description of the update (revised by me):
🎨 Qwen3-TTS Engine - Create Voices from Text!
Major new engine addition! Qwen3-TTS brings a unique Voice Designer feature that lets you create custom voices from natural language descriptions. Plus three distinct model types for different use cases!
✨ New Features
Qwen3-TTS Engine
🎨 Voice Designer - Create custom voices from text descriptions! “A calm female voice with British accent” → instant voice generation Three model types with different capabilities:
CustomVoice: 9 high-quality preset speakers (Vivian, Serena, Dylan, Eric, Ryan, etc.) VoiceDesign: Text-to-voice creation - describe your ideal voice and generate it Base: Zero-shot voice cloning from audio samples
10 language support - Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian Model sizes: 0.6B (low VRAM) and 1.7B (high quality) variants Character voice switching with [CharacterName] syntax - automatic preset mapping SRT subtitle timing support with all timing modes (stretch_to_fit, pad_with_silence, etc.) Inline edit tags - Apply Step Audio EditX post-processing (emotions, styles, paralinguistic effects) Sage attention support - Improved VRAM efficiency with sageattention backend Smart caching - Prevents duplicate voice generation, skips model loading for existing voices Per-segment parameters - Control [seed:42], [temperature:0.8] inline Auto-download system - All 6 model variants downloaded automatically when needed
🎙️ Voice Designer Node
The standout feature of this release! Create voices without audio samples:
Natural language input - Describe voice characteristics in plain English Disk caching - Saved voices load instantly without regeneration Standard format - Works seamlessly with Character Voices system Unified output - Compatible with all TTS nodes via NARRATOR_VOICE format
Example descriptions:
“A calm female voice with British accent” “Deep male voice, authoritative and professional” “Young cheerful woman, slightly high-pitched”
📚 Documentation
YAML-driven engine tables - Auto-generated comparison tables Condensed engine overview in README Portuguese accent guidance - Clear documentation of model limitations and workarounds
🎯 Technical Highlights
Official Qwen3-TTS implementation bundled for stability 24kHz mono audio output Progress bars with real-time token generation tracking VRAM management with automatic model reload and device checking Full unified architecture integration Interrupt handling for cancellation support
Qwen3-TTS brings a total of 10 TTS engines to the suite, each with unique capabilities. Voice Designer is a first-of-its-kind feature in ComfyUI TTS extensions!
💬 Discussion r/StableDiffusion (111 points, 40 commentaires) 🔗 Source