LLMdeterminismreproducibilityproduction

Why LLMs Aren't Deterministic (Even at Temperature 0) - And How to Fix It

Reproducibility is the foundation of science. Yet LLMs remain non-deterministic even at temperature 0.

2025-09-22·4 min read

Use with AI

Reproducibility is the foundation of science. You run the experiment, you get the result, someone else runs the same experiment, they get the same result. That's how we know something is real rather than noise.

LLMs make this hard in ways that aren't obvious at first. Most practitioners know that high temperature means more randomness. Fewer know that temperature 0 - the "deterministic" setting - doesn't actually give you determinism.

Why Temperature 0 Isn't Deterministic

Setting temperature to 0 makes the model greedily select the highest-probability token at each step, which should be deterministic. In practice, it often isn't, for several reasons:

Floating-point non-determinism. GPU matrix operations aren't commutative in floating-point arithmetic. The order of parallel operations affects the result. When you scale computation across multiple GPUs or change batch size, you change the order of operations, which changes the floating-point accumulation errors, which changes which token gets the highest computed probability. The differences are tiny - fractions of a percent in probability - but at temperature 0 where you're taking the argmax, these tiny differences can flip the selected token.

Infrastructure non-determinism. Cloud LLM APIs run on dynamic infrastructure. Your request might be served by different GPU configurations on different calls. The model weights are the same, but the hardware-level computation path differs, producing floating-point variation.

Model version drift. Providers silently update models. The endpoint named gpt-4 today may have different weights than the endpoint named gpt-4 last month. Version pinning helps, but version pinning guarantees are not always honored at the provider level.

Context window effects. For long contexts, different implementations handle attention differently. Chunking strategies, KV cache implementations, and attention approximations can all introduce variation that's invisible at the API level.

Why This Matters

For most use cases, non-determinism at temperature 0 is a minor annoyance. For regulated use cases, it's a serious problem.

If you're building an AI system for financial analysis, medical decision support, or legal document review, you may be required to demonstrate reproducibility. "We ran this analysis and got this result" needs to mean something. If running it again can produce a different result, your audit trail is broken.

More practically: if your evaluation suite gives different results on different runs, you can't tell signal from noise. You don't know if a model change improved performance or if you just got a different random seed.

What You Can Actually Do

Seed control where available. Some APIs and local model implementations expose a random seed parameter. Use it. It doesn't solve all the sources of non-determinism listed above, but it eliminates the stochastic sampling component.

Output fingerprinting. For production systems where reproducibility matters, record a cryptographic hash of the output along with the input, timestamp, and model version. This lets you detect when outputs drift even if you can't prevent it.

Deterministic components for critical calculations. The VeNRA architecture takes this to its logical conclusion: route numerical calculations through deterministic Python execution, not through the LLM. The LLM handles language; the deterministic layer handles arithmetic. If reproducibility of specific outputs is required, don't trust a probabilistic system to produce them.

Replicate-then-compare for high-stakes decisions. For decisions where an LLM error would be costly, run the same query multiple times and check for consistency. If outputs differ meaningfully across runs, flag for human review. Inconsistency is a useful signal that the model is operating near a decision boundary.

Design for auditability, not determinism. In many cases, perfect determinism is the wrong goal. The right goal is: can I understand, explain, and verify this output? A non-deterministic system that produces outputs with clear source citations and explicit reasoning is more auditable than a deterministic black box. Design for auditability first; determinism is one path to it, but not the only one.

The Honest Framing

Temperature 0 gives you more reproducibility, not complete reproducibility. If your system has hard requirements on determinism - regulatory, scientific, or practical - you need to design those requirements explicitly into the architecture rather than relying on a parameter setting to guarantee them.

The teams that handle this well don't try to make the LLM deterministic. They architect around the LLM's non-determinism so that the parts of the system that need to be reproducible are handled by systems that actually are.

See how this applies to your stack

20-minute discovery call - no pitch, just specifics.

Book a Call