Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music. AF3 builds on previous work with innovations in:
Unified audio representation learning (speech, sound, music)
Flexible, on-demand chain-of-thought reasoning
Long-context audio comprehension (up to 10 minutes) Multi-turn, multi-audio conversational dialogue (AF3-Chat)
Voice-to-voice interaction (AF3-Chat)
Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
This model is for non-commercial research purposes only.
Model Architecture:
Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.
Paper: https://arxiv.org/abs/2507.08128 Voice-chat finetune: https://huggingface.co/nvidia/audio-flamingo-3-chat
💬 Discussion r/LocalLLaMA (85 points, 10 commentaires) 🔗 Source