SelfCheckGPT: How to Catch LLM Hallucinations by Making Models Argue With Themselves
Hook
Every time your LLM generates a response, it’s lying to you about 15-30% of the time according to recent studies. The twist? You can catch these hallucinations by simply asking the model to answer the same question multiple times and watching it contradict itself.
Context
Large language models have a credibility problem. They confidently generate plausible-sounding answers that are factually wrong, and they do it with the same authoritative tone they use for accurate information. Traditional fact-checking requires either external knowledge bases (expensive to maintain, never comprehensive enough) or human review (doesn’t scale). Retrieval-augmented generation helps, but it’s architecturally complex and still vulnerable when retrieved documents are misleading.
SelfCheckGPT, introduced at EMNLP 2023 by researchers from Cambridge, takes a radically different approach: it exploits the stochastic nature of LLM sampling. The core insight is beautifully simple—if you ask an LLM to generate the same response multiple times with temperature > 0, factual information tends to appear consistently across samples while hallucinations vary. A model that truly “knows” Barack Obama was the 44th US president will mention this fact reliably across generations. But if it hallucinates that he served from 2007-2015 (off by a year), subsequent samples might say 2009-2017 or omit dates entirely. This inconsistency is the tell. The framework requires zero external resources: no knowledge bases, no labeled training data for your specific domain, no retrieval infrastructure. Just the ability to sample from your LLM multiple times.
Technical Insight
SelfCheckGPT operates at sentence granularity, scoring each sentence in your original LLM response against multiple sampled responses. The library provides five distinct detection methods, each with different accuracy-cost profiles.
The simplest approach is n-gram overlap, which counts how many word sequences from the original sentence appear in the sampled passages. It’s fast but crude. BERTScore uses contextualized embeddings to measure semantic similarity between the original sentence and each sample, then averages the scores. Lower average similarity means higher hallucination probability. Here’s how the basic workflow looks:
from selfcheckgpt.modeling_selfcheck import SelfCheckBERTScore
# Initialize with a sentence similarity model
checker = SelfCheckBERTScore(rescale_with_baseline=True)
# Your LLM's original response
original = "Barack Obama was born in Hawaii in 1961. He served as the 44th president from 2009 to 2017."
# Generate 5 additional samples from your LLM with temp > 0
samples = [
"Barack Obama, born in Honolulu in 1961, was the 44th US president serving two terms starting in 2009.",
"The 44th president Barack Obama was born in Hawaii and served from 2009-2017.",
"Obama was born in 1961 in Hawaii and became president in 2009.",
"Barack Obama served as president for eight years beginning in 2009.",
"Born in Hawaii, Barack Obama was elected as the 44th president in 2008."
]
# Split into sentences
sentences = original.split(". ")
# Score each sentence (lower = more consistent = less likely hallucination)
scores = checker.predict(
sentences=sentences,
sampled_passages=samples
)
for sent, score in zip(sentences, scores):
print(f"Score: {score:.3f} | {sent}")
The most accurate method is SelfCheckNLI, which treats hallucination detection as a natural language inference problem. For each sentence in the original response, it checks whether that sentence is entailed by, contradicts, or is neutral with respect to each sampled passage. It uses DeBERTa-v3-large fine-tuned on Multi-NLI, specifically extracting the contradiction probability as the hallucination score.
from selfcheckgpt.modeling_selfcheck import SelfCheckNLI
# This downloads a ~1.5GB fine-tuned DeBERTa model
checker = SelfCheckNLI(device="cuda:0")
scores = checker.predict(
sentences=sentences,
sampled_passages=samples,
num_questions_per_sentence=1 # For MQAG variant, ignored here
)
# Scores > 0.5 typically indicate hallucinations
hallucination_threshold = 0.5
for sent, score in zip(sentences, scores):
flag = "⚠️ HALLUCINATION" if score > hallucination_threshold else "✓ Consistent"
print(f"{flag} | Score: {score:.3f} | {sent}")
The MQAG (Multi-Question Answering Generation) variant is particularly clever. For each sentence, it uses an LLM to generate questions that the sentence answers, then uses another QA model to try answering those questions using each sampled passage. If the samples can’t answer the questions consistently, the original sentence is likely hallucinated. This method performs well but requires running both question generation and question answering models.
The newest addition is SelfCheckLLMPrompt, which uses a separate LLM (like GPT-4 or Claude) to judge whether the original sentence is supported by each sample. You craft a prompt that asks the LLM to score consistency. This is the most flexible approach and can leverage the reasoning capabilities of frontier models, but it’s also the most expensive since you’re making LLM API calls for every sentence.
All methods output a score per sentence where higher scores indicate likely hallucinations. The key architectural decision is that each method is stateless and operates independently on sentence-sample pairs, making it trivial to parallelize across sentences or batch process multiple documents. The library separates concerns cleanly: you’re responsible for generating the samples (using whatever LLM infrastructure you have), and SelfCheckGPT handles only the consistency scoring.
One underappreciated detail: the number of samples matters dramatically. The original paper shows that performance plateaus around 5-6 samples. Fewer than 3 samples gives unreliable scores because statistical noise dominates. More than 6 gives diminishing returns while linearly increasing your inference costs. In production, 5 samples is the sweet spot.
Gotcha
The elephant in the room is cost. You’re running inference 5-6 times for every response you want to check. If you’re using GPT-4 or Claude, that’s 5-6x your API costs. If you’re self-hosting, that’s 5-6x your GPU time. For a customer-facing chatbot with tight latency requirements, this is often a non-starter. You can try to mitigate this with smaller, faster models for the samples or by only checking high-stakes responses, but you’re fundamentally trading computational resources for reliability.
The method also has a blindspot: systematic hallucinations. If your model was trained on misinformation or has a consistent knowledge gap, it might hallucinate the same wrong fact across all samples. SelfCheckGPT will score this as highly consistent and therefore not a hallucination. For example, if a model consistently believes a fake news story because it appeared frequently in training data, sampling won’t reveal the error. You’re detecting inconsistency, not incorrectness. This means SelfCheckGPT works best for catching random hallucinations (the model making up facts it doesn’t know) but struggles with confident errors (the model consistently stating something false it “learned”). You still need external validation for true fact-checking. Finally, sentence-level granularity can be limiting. A sentence like “He was born in 1961 and served from 2009-2017” contains multiple claims, and if only one is hallucinated, the aggregate score may not cross your threshold.
Verdict
Use if: You’re building production LLM applications where factual accuracy matters (medical, legal, financial domains), you can absorb 5-6x inference costs, and you lack domain-specific knowledge bases for fact-checking. SelfCheckNLI gives the best accuracy for most cases, though n-gram is viable for extreme budget constraints. It’s particularly valuable for long-form generation where you can afford to check asynchronously. Skip if: You need real-time responses with sub-second latency requirements, your LLM runs at low temperature or produces near-deterministic outputs (no variance to measure), or you’re already using RAG with high-quality retrieval and can verify against source documents directly. Also skip if you’re in a domain where the model has systematic biases—SelfCheckGPT won’t catch consistently wrong answers. For those cases, invest in external fact-checking APIs or human-in-the-loop review instead.