LLM-as-a-Judge: What It Is, How to Use It, and When Not To
Large Language Models are excellent at generating text — but evaluating whether that text is correct, safe, or helpful is a different problem entirely. This is where LLM-as-a-Judge comes in. However, using it incorrectly can make your evaluation slower, flaky, and misleading.
1. What Is LLMJudge?
LLMJudge is an evaluation primitive that uses another LLM to judge the output of your system. Instead of asking:
“Does this output equal X?”
you ask:
“Does this output satisfy a qualitative requirement?”
Examples:
- Is the response factually accurate?
- Is it semantically equivalent to an expected answer?
- Does it use only the provided context?
- Is the tone professional?
These are things hard-coded logic cannot reliably determine. So we delegate judgment to an LLM.
2. When Should You Use LLMJudge?
Use LLMJudge only when evaluation requires understanding.
Good Use Cases:
- Semantic equivalence
- Groundedness in RAG
- Citation correctness
- Tone/style compliance
- Instruction following
- Completeness and helpfulness
Bad Use Cases:
- Type checking
- Exact matching
- Schema validation
- Tool call verification
- Performance or timing checks
Rule of thumb:
If you can write if output == expected, don’t use an LLM.
3. Basic Usage of LLMJudge
The simplest form:
LLMJudge(
rubric='Response is factually accurate'
)
This judge:
- Sees only the model output
- Returns pass/fail
- Includes a short reason
This is useful for high-level quality checks.
4. The Most Important Parameter: rubric
The rubric is the entire evaluation contract.
Bad Rubrics (Too Vague):
LLMJudge(rubric='Good response')
LLMJudge(rubric='Check quality')
The judge has no idea what “good” means.
Good Rubrics (Specific):
LLMJudge(
rubric='Response answers the user question without hallucinating facts'
)
Excellent Rubrics (Structured):
LLMJudge(
rubric='''
Response must:
1. Directly answer the question
2. Use only information from the provided context
3. Acknowledge uncertainty if context is insufficient
'''
)
The more precise the rubric, the more stable the evaluation.
5. Controlling What the Judge Sees
By default, LLMJudge sees only the output. You can expand its context.
include_input=True
Use this when correctness depends on the user question.
LLMJudge(
rubric='Response correctly answers the user question',
include_input=True,
)
include_expected_output=True
Use this for comparative evaluation.
LLMJudge(
rubric='Response is semantically equivalent to the expected output',
include_input=True,
include_expected_output=True,
)
This is essential for translation, summarization, and normalization tasks.
6. Assertions vs Scores
LLMJudge can produce assertions, scores, or both.
Assertion (Pass / Fail):
LLMJudge(
rubric='Response is accurate',
assertion={'include_reason': True}
)
Use this when failure should block deployment.
Score (0.0 → 1.0):
LLMJudge(
rubric='Response quality',
score={'include_reason': True},
assertion=False
)
Use this for:
- Ranking
- Regression tracking
- Continuous quality monitoring
Both Together:
LLMJudge(
rubric='Response quality',
score={'include_reason': True},
assertion={'include_reason': True},
)
This gives:
- A numeric signal
- A hard gate
7. Custom Evaluation Names
Naming matters for reports.
LLMJudge(
rubric='Response is factually accurate',
assertion={
'evaluation_name': 'accuracy',
'include_reason': True,
},
)
Instead of generic “LLMJudge ✔”, you get:
accuracy: ✔
This scales well when you have multiple judges.
8. Choosing the Judge Model
Not all evaluations need a premium model.
Cheap & Simple Checks:
LLMJudge(
rubric='Response contains profanity',
model='openai:gpt-5-mini'
)
Nuanced Reasoning:
LLMJudge(
rubric='Response demonstrates deep understanding of legal reasoning',
model='anthropic:claude-opus-4-20250514'
)
Consistency Tip: Always use:
ModelSettings(temperature=0.0)
Judges should be boring and predictable.
9. Multi-Aspect Evaluation (Recommended)
Do not overload a single judge.
Bad:
LLMJudge(rubric='Response is accurate, helpful, safe, and professional')
Good:
evaluators = [
LLMJudge(rubric='Response is factually accurate', assertion={'evaluation_name': 'accurate'}),
LLMJudge(rubric='Response is helpful', score={'evaluation_name': 'helpfulness'}),
LLMJudge(rubric='Response uses professional tone', assertion={'evaluation_name': 'tone'}),
]
Each judge answers one question.
10. Deterministic Evaluation vs LLM-as-a-Judge
Let’s understand this through an example.
10.1 What Is a Deterministic Check?
A deterministic check is an evaluation where:
- There is only one correct answer
- The answer can be verified using simple logic
- No interpretation or “understanding” is required
If the output is correct, the check passes. If not, it fails. No probabilities. No judgment. No ambiguity.
Simple Deterministic Examples
Example 1: Math
Question:
What is 2 + 2?
Correct answer:
4
This does not require an LLM to evaluate. You don’t need reasoning. You don’t need language understanding.
output == 4
That’s enough.
Example 2: Fact with One Correct Answer
Question:
Who scored 100 runs in the 2011 Cricket World Cup final?
There is exactly one correct factual answer. If the system output is:
- The correct name → pass
- Anything else → fail
Again, this is deterministic.
Why You Should NOT Use LLMJudge Here
Using an LLM to evaluate:
“Is 2 + 2 equal to 4?”
is a mistake because:
- LLMs are probabilistic
- They can hallucinate
- They are slower and cost money
- They introduce unnecessary instability
If a check can be written as exact logic, use exact logic. That’s why evaluation frameworks provide deterministic evaluators like:
- IsInstance → type checking
- EqualsExpected → exact match
- Contains → substring presence
These are fast, cheap, and reliable.
10.2 When Deterministic Checks Break Down
Now let’s look at a case where deterministic checks fail completely.
Example: Same Meaning, Different Representation
A user says:
“Set the event date to 1st January 2026.”
Your system stores dates internally like this:
"2026-01-01T00:00:00Z"
Now ask yourself: Are these two values the same?
From a string comparison perspective:
- ❌ "1st January 2026" ≠ "2026-01-01T00:00:00Z"
From a human meaning perspective:
- ✅ They represent the same date
This is where deterministic logic fails. No ==, no regex, no hard-coded rules can reliably answer:
“Do these two values mean the same thing?”
10.3 This Is Where LLMJudge Is Needed
This is the exact moment where LLM-as-a-Judge becomes useful. Here, you are no longer asking:
“Is this exactly equal?”
you are asking:
“Is this semantically equivalent?”
That requires:
- Language understanding
- Concept normalization
- Ignoring formatting differences
This is what LLMs are good at. So we delegate only this part to LLMJudge.
You intentionally exclude structure checks from the LLM. That separation is the key design insight.