I’m using OpenRouter to manage multiple LLM subscriptions in one place for a research project where I need to benchmark responses across different models. However, I’ve noticed some discrepancies between responses when calling the same model (like GPT-4) through OpenRouter’s API versus OpenAI’s native API.
I’ve verified that:
temperature and top_p parameters are identical No caching is occurring on either side Same prompts are being used
The differences aren’t huge, but they’re noticeable enough to potentially affect my benchmark results.
Has anyone else run into this issue? I’m wondering if:
OpenRouter adds any middleware processing that could affect outputs There are default parameters being set differently There’s some other configuration I’m missing
Any insights would be appreciated - trying to determine if this is expected behavior or if there’s something I can adjust to get more consistent results.
💬 Discussion r/LocalLLaMA (0 points, 2 commentaires)