Built a business sim where AI agents run a food truck for 30 days — location, menu, pricing, staff, inventory. Same scenario for all models.
Opus made 28K. 8 went bankrupt. Every model that took a loan went bankrupt (8/8).
There’s also a playable mode — same simulation, same 34 tools, same leaderboard. You either survive 30 days or go bankrupt, get a result card and land on the shared leaderboard. Example result: https://foodtruckbench.com/r/9E6925
Benchmark + leaderboard: https://foodtruckbench.com
Play: https://foodtruckbench.com/play
Gemini 3 Flash Thinking — only model out of 20+ tested that gets stuck in an infinite decision loop, 100% of runs: https://foodtruckbench.com/blog/gemini-flash
Happy to answer questions about the sim or results.
💬 Discussion r/LocalLLaMA (244 points, 105 commentaires) 🔗 Source