LLM-as-a-Judge: What It Is, How to Use It, and When Not To

Large Language Models are excellent at generating text — but evaluating whether that text is correct, safe, or helpful is a different problem entirely. This is where LLM-as-a-Judge comes in. However, using it incorrectly can make your evaluation slower, flaky, and misleading.

1. What Is LLMJudge?

LLMJudge is an evaluation primitive that uses another LLM to judge the output of your system. Instead of asking:

“Does this output equal X?”

you ask:

“Does this output satisfy a qualitative requirement?”

Examples:

  • Is the response factually accurate?
  • Is it semantically equivalent to an expected answer?
  • Does it use only the provided context?
  • Is the tone professional?

These are things hard-coded logic cannot reliably determine. So we delegate judgment to an LLM.

2. When Should You Use LLMJudge?

Use LLMJudge only when evaluation requires understanding.

Good Use Cases:

  • Semantic equivalence
  • Groundedness in RAG
  • Citation correctness
  • Tone/style compliance
  • Instruction following
  • Completeness and helpfulness

Bad Use Cases:

  • Type checking
  • Exact matching
  • Schema validation
  • Tool call verification
  • Performance or timing checks

Rule of thumb: If you can write if output == expected, don’t use an LLM.

3. Basic Usage of LLMJudge

The simplest form:

LLMJudge(
    rubric='Response is factually accurate'
)

This judge:

  • Sees only the model output
  • Returns pass/fail
  • Includes a short reason

This is useful for high-level quality checks.

4. The Most Important Parameter: rubric

The rubric is the entire evaluation contract.

Bad Rubrics (Too Vague):

LLMJudge(rubric='Good response')
LLMJudge(rubric='Check quality')

The judge has no idea what “good” means.

Good Rubrics (Specific):

LLMJudge(
    rubric='Response answers the user question without hallucinating facts'
)

Excellent Rubrics (Structured):

LLMJudge(
    rubric='''
    Response must:
    1. Directly answer the question
    2. Use only information from the provided context
    3. Acknowledge uncertainty if context is insufficient
    '''
)

The more precise the rubric, the more stable the evaluation.

5. Controlling What the Judge Sees

By default, LLMJudge sees only the output. You can expand its context.

include_input=True Use this when correctness depends on the user question.

LLMJudge(
    rubric='Response correctly answers the user question',
    include_input=True,
)

include_expected_output=True Use this for comparative evaluation.

LLMJudge(
    rubric='Response is semantically equivalent to the expected output',
    include_input=True,
    include_expected_output=True,
)

This is essential for translation, summarization, and normalization tasks.

6. Assertions vs Scores

LLMJudge can produce assertions, scores, or both.

Assertion (Pass / Fail):

LLMJudge(
    rubric='Response is accurate',
    assertion={'include_reason': True}
)

Use this when failure should block deployment.

Score (0.0 → 1.0):

LLMJudge(
    rubric='Response quality',
    score={'include_reason': True},
    assertion=False
)

Use this for:

  • Ranking
  • Regression tracking
  • Continuous quality monitoring

Both Together:

LLMJudge(
    rubric='Response quality',
    score={'include_reason': True},
    assertion={'include_reason': True},
)

This gives:

  • A numeric signal
  • A hard gate

7. Custom Evaluation Names

Naming matters for reports.

LLMJudge(
    rubric='Response is factually accurate',
    assertion={
        'evaluation_name': 'accuracy',
        'include_reason': True,
    },
)

Instead of generic “LLMJudge ✔”, you get:

accuracy: ✔

This scales well when you have multiple judges.

8. Choosing the Judge Model

Not all evaluations need a premium model.

Cheap & Simple Checks:

LLMJudge(
    rubric='Response contains profanity',
    model='openai:gpt-5-mini'
)

Nuanced Reasoning:

LLMJudge(
    rubric='Response demonstrates deep understanding of legal reasoning',
    model='anthropic:claude-opus-4-20250514'
)

Consistency Tip: Always use:

ModelSettings(temperature=0.0)

Judges should be boring and predictable.

9. Multi-Aspect Evaluation (Recommended)

Do not overload a single judge.

Bad:

LLMJudge(rubric='Response is accurate, helpful, safe, and professional')

Good:

evaluators = [
    LLMJudge(rubric='Response is factually accurate', assertion={'evaluation_name': 'accurate'}),
    LLMJudge(rubric='Response is helpful', score={'evaluation_name': 'helpfulness'}),
    LLMJudge(rubric='Response uses professional tone', assertion={'evaluation_name': 'tone'}),
]

Each judge answers one question.

10. Deterministic Evaluation vs LLM-as-a-Judge

Let’s understand this through an example.

10.1 What Is a Deterministic Check?

A deterministic check is an evaluation where:

  • There is only one correct answer
  • The answer can be verified using simple logic
  • No interpretation or “understanding” is required

If the output is correct, the check passes. If not, it fails. No probabilities. No judgment. No ambiguity.

Simple Deterministic Examples

Example 1: Math

Question:

What is 2 + 2?

Correct answer:

4

This does not require an LLM to evaluate. You don’t need reasoning. You don’t need language understanding.

output == 4

That’s enough.

Example 2: Fact with One Correct Answer

Question:

Who scored 100 runs in the 2011 Cricket World Cup final?

There is exactly one correct factual answer. If the system output is:

  • The correct name → pass
  • Anything else → fail

Again, this is deterministic.

Why You Should NOT Use LLMJudge Here

Using an LLM to evaluate:

“Is 2 + 2 equal to 4?”

is a mistake because:

  • LLMs are probabilistic
  • They can hallucinate
  • They are slower and cost money
  • They introduce unnecessary instability

If a check can be written as exact logic, use exact logic. That’s why evaluation frameworks provide deterministic evaluators like:

  • IsInstance → type checking
  • EqualsExpected → exact match
  • Contains → substring presence

These are fast, cheap, and reliable.

10.2 When Deterministic Checks Break Down

Now let’s look at a case where deterministic checks fail completely.

Example: Same Meaning, Different Representation

A user says:

“Set the event date to 1st January 2026.”

Your system stores dates internally like this:

"2026-01-01T00:00:00Z"

Now ask yourself: Are these two values the same?

From a string comparison perspective:

  • ❌ "1st January 2026" ≠ "2026-01-01T00:00:00Z"

From a human meaning perspective:

  • ✅ They represent the same date

This is where deterministic logic fails. No ==, no regex, no hard-coded rules can reliably answer:

“Do these two values mean the same thing?”

10.3 This Is Where LLMJudge Is Needed

This is the exact moment where LLM-as-a-Judge becomes useful. Here, you are no longer asking:

“Is this exactly equal?”

you are asking:

“Is this semantically equivalent?”

That requires:

  • Language understanding
  • Concept normalization
  • Ignoring formatting differences

This is what LLMs are good at. So we delegate only this part to LLMJudge.

You intentionally exclude structure checks from the LLM. That separation is the key design insight.