I’m looking to whip up an Ollama-adjacent kind of CLI wrapper over whatever is the fastest way to run a model that can fit entirely on a single GPU.
I remember a few months back when exl2 was far and away the fastest way to run, say, a 7b model, assuming a big enough gpu. Is this still the case, or have there been developments with like vllm or llama.cpp that have outpaced exl2 in terms of pure inference tok/s?
What are you guys using for purely local inference?
💬 Discussion r/LocalLLaMA (17 points, 19 commentaires)