So I went down the rabbit hole of making a VLM agent that actually plays DOOM. The concept is dead simple - take a screenshot from VizDoom, draw a numbered grid on top, send it to a vision model with two tools (shoot and move), the model decides what to do. Repeat.
The wild part? It’s Qwen 3.5 0.8B - a model that can run on a smartwatch, trained to generate text, but it handles the game surprisingly well.
On the basic scenario it actually gets kills. Like, it sees the enemy, picks the right column, and shoots. I was genuinely surprised.
On defend_the_center it’s trickier - it hits enemies, but doesn’t conserve ammo, and by the end it keeps trying to shoot when there’s nothing left. But sometimes it outputs stuff like “I see a fireball but I’m not sure if it’s an enemy”, which is oddly self-aware for 0.8B parameters.
The stack is Python + VizDoom + direct HTTP calls to LM Studio. Latency is about 10 seconds per step on an M1-series Mac.
Currently trying to fix the ammo conservation - adding a “reason” field to tool calls so the model has to describe what it sees before deciding whether to shoot or not. We’ll see how it goes.
💬 Discussion r/LocalLLaMA (171 points, 22 commentaires) 🔗 Source