Detecting LLM Hallucinations in Under 200ms: A Production Architecture
How to instrument real-time hallucination detection without adding perceptible latency to your LLM workflows.
The common objection to real-time hallucination detection is latency. If verifying an LLM output takes 3 seconds, you have not improved your AI workflow — you have just added a slow step before the broken output reaches your users.
The objection is valid. The 3-second assumption is not.
In production, Fact AI Lab's verification layer adds a median of 180ms to a verified request. P95 latency is under 320ms. That is imperceptible in the workflows we target: research summarization, document Q&A, compliance screening, report generation. These are tasks where users expect a 1–4 second response time. Adding 180ms does not change the user experience.
This post explains the architecture that makes sub-200ms verification possible.
The problem with naive verification
The obvious implementation of hallucination detection is a second LLM call. Produce the output, then ask a second model "is this faithful to the source?" Two LLM calls at 1–2 seconds each produces a 2–4 second verification overhead. This is the approach most teams attempt first, and it explains why they conclude real-time verification is impractical.
Two problems with this approach:
First, the verifying model can hallucinate in exactly the same way as the original. You are not measuring faithfulness — you are measuring whether one stochastic process agrees with another.
Second, latency is a product of the architecture, not the task. Two sequential LLM calls are slow because they are sequential. The solution is to not use LLM calls for verification.
The two-stage verification architecture
Stage 1: Claim extraction (30–50ms)
Extract atomic factual claims from the LLM output. A claim is a unit of information that is independently verifiable: "The company reported $4.2B in revenue in Q3 2024" is a claim. "The company performed well last quarter" is not — it is an interpretation.
We use a fine-tuned small language model (under 1B parameters) running on dedicated inference infrastructure for claim extraction. Small models are fast. At 30–50ms, extraction does not dominate the latency budget.
Stage 2: Source cross-reference (80–120ms)
For each extracted claim, retrieve the most relevant passages from the source document set using vector similarity search, then score the claim against those passages using a specialized entailment model.
The entailment model is not a general-purpose LLM. It is a fine-tuned NLI (natural language inference) model optimized for the entailment task: given a claim and a passage, does the passage support, contradict, or fail to address the claim? These models run in 5–15ms per claim, and claims can be scored in parallel.
The vector retrieval step (50–80ms) and the entailment scoring step (20–40ms total for typical outputs of 5–15 claims) run concurrently where possible.
Output: verification record (< 5ms)
Aggregate the per-claim scores into a document-level verification score, flag any unsupported or contradicted claims, and write the result to the audit log with a cryptographic hash. This step is fast because it is computation, not inference.
Total median time: 180ms. The 4-second verification you were afraid of was an artifact of the wrong architecture.
What the verification record contains
For each verified output, the record includes:
- Verification score (0.0–1.0): The proportion of extracted claims that are supported by source documents. Under 0.7 triggers a compliance flag in most configurations.
- Claim list: Each atomic claim with its individual entailment score and the source passage(s) used.
- Flagged claims: Claims scored as unsupported or contradicted, with the relevant source context.
- Metadata: Model version, timestamp, document set hash, verification model version.
- Chain hash: SHA-256 hash linking this record to the previous one in the audit log.
This record is what makes the audit trail useful. Not just "this output was verified" but "here is exactly which claims were verified, against which source passages, with which scores."
Accuracy in production
On SEC filing summarization tasks, our claim extraction model identifies 94% of verifiable factual claims in a typical output. The 6% miss rate consists primarily of implicit claims and comparative statements that are difficult to decompose into atomic form.
Of claims that are extracted and scored, the entailment model achieves 96.2% accuracy on our evaluation set (human-labeled claim-passage pairs from financial documents). The combined system — extraction plus entailment — produces a hallucination detection rate of approximately 91% on our full evaluation benchmark.
The 9% gap is the honest acknowledgment that no verification system catches everything. What it does catch — 91% of hallucinations, in under 200ms, with a complete audit record — is a fundamentally different risk posture than 0%.
The latency budget in practice
For a typical research summarization workflow:
- LLM inference: 1,200ms (P50)
- Verification: 180ms (P50)
- Total user-facing time: 1,380ms
The verification step adds 13% to total latency. For the compliance benefit — under 0.3% post-verification hallucination rate and a complete audit record — that is a straightforward trade.
The teams we work with do not experience the verification step as latency. They experience it as a guarantee.
The architecture described here is what Fact AI Lab deploys in customer environments. If you want to understand how it applies to your specific LLM workflows, the conversation takes 20 minutes.