Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music. AF3 builds on previous work with innovations in:

Unified audio representation learning (speech, sound, music)

Flexible, on-demand chain-of-thought reasoning

Long-context audio comprehension (up to 10 minutes) Multi-turn, multi-audio conversational dialogue (AF3-Chat)

Voice-to-voice interaction (AF3-Chat)

Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.

This model is for non-commercial research purposes only.

Model Architecture:

Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.

Paper: https://arxiv.org/abs/2507.08128 Voice-chat finetune: https://huggingface.co/nvidia/audio-flamingo-3-chat


💬 Discussion r/LocalLLaMA (85 points, 10 commentaires) 🔗 Source