Hey everyone,
I’ve been testing the new Qwen 3.5 35B (the A3B MoE version) and noticed a massive performance gap depending on how I run it.
My setup:
GPU: RTX 5070 Ti (16GB VRAM) RAM: 96GB OS: Windows 11
When I load the exact same GGUF in LM Studio, I’m only pulling around 16 tok/s. But when I drop into the terminal and run it directly through llama.cpp, it shoots up to 40 tok/s.
Has anyone else noticed this kind of overhead with the new Qwen 3.5 MoE models? Are there advanced settings in LM Studio I’m missing to bridge this gap, or is terminal llama.cpp just the undisputed king for MoE efficiency right now?
For context, here is the exact command I’m using to run the server:
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL
—alias “qwen3.5-35b-a3b” --host 0.0.0.0
—port 1234 -c 65536
—temp 0.6 --top-p 0.95
—top-k 20 `
—min-p 0.00
💬 Discussion r/LocalLLaMA (151 points, 55 commentaires) 🔗 Source