I’m looking to whip up an Ollama-adjacent kind of CLI wrapper over whatever is the fastest way to run a model that can fit entirely on a single GPU.

I remember a few months back when exl2 was far and away the fastest way to run, say, a 7b model, assuming a big enough gpu. Is this still the case, or have there been developments with like vllm or llama.cpp that have outpaced exl2 in terms of pure inference tok/s?

What are you guys using for purely local inference?


💬 Discussion r/LocalLLaMA (17 points, 19 commentaires)