LLM Judge Output Evaluators¶

LLM Judge Output Evaluators use Language Models to assess the quality and semantic similarity of agent outputs. These evaluators are ideal for scenarios where deterministic comparison is insufficient and human-like judgment is needed.

Overview¶

There are two variants of LLM Judge Output Evaluators:

LLM Judge Output Evaluator (llm-judge-output-semantic-similarity): General semantic similarity evaluation
LLM Judge Strict JSON Similarity Output Evaluator (llm-judge-output-strict-json-similarity): Strict JSON structure comparison with LLM judgment

Use Cases:

Evaluate natural language outputs
Assess semantic similarity beyond exact matching
Judge output quality based on intent and meaning
Validate structured outputs with flexible criteria

Returns: Continuous score from 0.0 to 1.0 with justification

LLM Service Integration¶

LLM Judge evaluators require an LLM service to perform evaluations. By default, the evaluators use the UiPathLlmService to handle LLM requests, which automatically integrates with your configured LLM providers through the UiPath platform.

Custom LLM Service¶

You can supply a custom LLM service that supports the following request format:

{
    "model": "model-name",
    "messages": [
        {"role": "system", "content": "system prompt"},
        {"role": "user", "content": "evaluation prompt"}
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "evaluation_response",
            "schema": {
                # JSON schema for structured output
            }
        }
    },
    "max_tokens": 1000,  # or None
    "temperature": 0.0
}

The LLM service must:

Accept messages with system and user roles
Support structured output via response_format with JSON schema
Return responses conforming to the specified schema
Handle temperature and max_tokens parameters

Model Selection¶

When configuring the evaluator, specify the model name according to your LLM service's conventions:

evaluator = LLMJudgeOutputEvaluator(
    id="llm-judge-1",
    config={
        "name": "LLMJudgeOutputEvaluator",
        "model": "gpt-4o-2024-11-20",  # Use your service's model naming
        "temperature": 0.0
    }
)

UiPathLlmService

The default UiPathLlmService supports multiple LLM providers configured through the UiPath platform. Model names follow the provider's conventions (e.g., gpt-4o-2024-11-20 for OpenAI, claude-3-5-sonnet-20241022 for Anthropic).

LLM Judge Output Evaluator¶

Configuration¶

LLMJudgeOutputEvaluatorConfig¶

Parameter	Type	Default	Description
`name`	`str`	`"LLMJudgeOutputEvaluator"`	The evaluator's name
`prompt`	`str`	Default user prompt	Custom evaluation prompt
`model`	`str`	`""`	LLM model to use for judgment
`temperature`	`float`	`0.0`	LLM temperature (0.0 for minimal non-determinism)
`max_tokens`	`int or None`	`None`	Maximum tokens for LLM response
`target_output_key`	`str`	`"*"`	Specific key to extract from output
`default_evaluation_criteria`	`OutputEvaluationCriteria or None`	`None`	Default criteria

Evaluation Criteria¶

OutputEvaluationCriteria¶

Parameter	Type	Description
`expected_output`	`dict[str, Any] or str`	The expected output for comparison

Prompt Placeholders¶

The prompt template supports these placeholders:

{{ActualOutput}}: Replaced with the agent's actual output
{{ExpectedOutput}}: Replaced with the expected output from criteria

Examples¶

Basic Semantic Similarity¶

from uipath.eval.evaluators import LLMJudgeOutputEvaluator
from uipath.eval.models import AgentExecution

agent_execution = AgentExecution(
    agent_input={"query": "What is the capital of France?"},
    agent_output={"answer": "Paris is the capital city of France."},
    agent_trace=[]
)

evaluator = LLMJudgeOutputEvaluator(
    id="llm-judge-1",
    config={
        "name": "LLMJudgeOutputEvaluator",
        # Use the UiPathLlmChatService convention for model names; should be changed according to selected service
        "model": "gpt-4o-2024-11-20",
        "temperature": 0.0,
        "target_output_key": "answer"  # Extract the "answer" field
    }
)

result = await evaluator.validate_and_evaluate_criteria(
    agent_execution=agent_execution,
    evaluation_criteria={
        "expected_output": {"answer": "The capital of France is Paris."}
    }
)

print(f"Score: {result.score}")  # e.g., 0.95
print(f"Justification: {result.details}")  # LLM's reasoning

Custom Evaluation Prompt¶

custom_prompt = """
Compare the actual output with the expected output.
Focus on semantic meaning and intent rather than exact wording.

Actual Output: {{ActualOutput}}
Expected Output: {{ExpectedOutput}}

Provide a score from 0-100 based on semantic similarity.
"""

agent_execution = AgentExecution(
    agent_input={},
    agent_output={"message": "The product has been successfully added to your cart."},
    agent_trace=[]
)

evaluator = LLMJudgeOutputEvaluator(
    id="llm-judge-custom",
    config={
        "name": "LLMJudgeOutputEvaluator",
        "model": "gpt-4o-2024-11-20",
        "prompt": custom_prompt,
        "temperature": 0.0,
        "target_output_key": "message"  # Extract the "message" field
    }
)

result = await evaluator.validate_and_evaluate_criteria(
    agent_execution=agent_execution,
    evaluation_criteria={
        "expected_output": {"message": "Item added to shopping cart."}
    }
)

print(f"Score: {result.score}")
print(f"Justification: {result.details}")

Evaluating Natural Language Quality¶

agent_execution = AgentExecution(
    agent_input={"task": "Write a professional email"},
    agent_output={"email": """Dear Customer,

Thank you for your inquiry. We have reviewed your request
and are pleased to inform you that we can accommodate your
needs. Please let us know if you have any questions.

Best regards,
Support Team"""},
    agent_trace=[]
)

evaluator = LLMJudgeOutputEvaluator(
    id="llm-judge-quality",
    config={
        "name": "LLMJudgeOutputEvaluator",
        "model": "gpt-4o-2024-11-20",
        "temperature": 0.0,
        "target_output_key": "email"  # Extract the "email" field
    }
)

result = await evaluator.validate_and_evaluate_criteria(
    agent_execution=agent_execution,
    evaluation_criteria={
        "expected_output": {"email": "A professional, courteous response addressing the customer's inquiry"}
    }
)

print(f"Score: {result.score}")
print(f"Justification: {result.details}")

LLM Judge Strict JSON Similarity Output Evaluator¶

This variant performs per-key matching on JSON structures with penalty-based scoring. The LLM evaluates each top-level key individually and calculates a final score based on key-level matches.

How It Works¶

Key Inventory: Identifies all top-level keys in expected and actual outputs
Per-Key Matching: For each expected key, checks if it exists in actual output
Content Assessment: For matching keys, evaluates content similarity (identical/similar/different)
Penalty-Based Scoring: Calculates score using these penalties per key:

Missing key (not in actual): 100/N penalty
Wrong key (exists but significantly different content): 100/N penalty
Similar key (exists with similar content): 50/N penalty
Identical key (exists with identical content): 0 penalty
Extra key (in actual but not expected): 10/N penalty

Where N = total number of expected keys

Final Score: 100 - total_penalty (scale 0-100)

Why "Strict"?¶

Unlike the standard LLMJudgeOutputEvaluator which evaluates semantic similarity holistically, this evaluator:

Enforces structural matching: Each expected key must be present
Penalizes missing keys heavily: Same as wrong content (100/N penalty)
Evaluates per-key: Independence between key evaluations
Deterministic scoring formula: Mechanical calculation based on key-level assessments

Configuration¶

LLMJudgeStrictJSONSimilarityOutputEvaluatorConfig¶

Same as LLMJudgeOutputEvaluatorConfig but with:

name: "LLMJudgeStrictJSONSimilarityOutputEvaluator"
prompt: Specialized prompt enforcing per-key matching and penalty calculations

Examples¶

Strict JSON Structure Evaluation¶

from uipath.eval.evaluators import LLMJudgeStrictJSONSimilarityOutputEvaluator

evaluator = LLMJudgeStrictJSONSimilarityOutputEvaluator(
    id="llm-json-strict",
    config={
        "name": "LLMJudgeStrictJSONSimilarityOutputEvaluator",
        "model": "gpt-4o-2024-11-20",
        "temperature": 0.0
    }
)

agent_execution = AgentExecution(
    agent_input={},
    agent_output={
        "status": "success",
        "user_id": 12345,
        "name": "John Doe",
        "email": "john@example.com"
    },
    agent_trace=[]
)

result = await evaluator.evaluate(
    agent_execution=agent_execution,
    evaluation_criteria={
        "expected_output": {
            "status": "success",
            "user_id": 12345,
            "name": "John Doe",
            "email": "john@example.com"
        }
    }
)

print(f"Score: {result.score}")
print(f"Justification: {result.details}")

Understanding LLM Judge Response¶

The LLM returns a structured response:

# Result structure
{
    "score": 0.85,  # 0.0 to 1.0 (normalized from 0-100)
    "details": "The outputs convey the same meaning..."  # LLM justification
}

Best Practices¶

Use temperature 0.0 for deterministic evaluations
Craft clear prompts - Be specific about evaluation criteria
Include both placeholders - Always use {{ActualOutput}} and {{ExpectedOutput}}
Set score thresholds - Define minimum acceptable scores (e.g., ≥ 0.8)
Review justifications - Use LLM explanations to understand scores
Cost awareness - LLM evaluations use API calls, consider token costs

When to Use vs Other Evaluators¶

Use LLM Judge Output when:

Semantic meaning matters more than exact wording
Natural language outputs need human-like judgment
Context and intent are important
Flexible evaluation criteria needed

Use Deterministic Evaluators when: - Exact matches are required - Output format is predictable - Speed and cost are priorities - No ambiguity in correctness

Configuration Tips¶

Temperature Settings¶

0.0: Deterministic, consistent results (recommended)
0.1: Slight variation for nuanced judgment
>0.3: Not recommended (too inconsistent)

Error Handling¶

The evaluator will raise UiPathEvaluationError if:

LLM service is unavailable
Prompt doesn't contain required placeholders
LLM response cannot be parsed
Model returns invalid JSON

LLM Judge Trajectory Evaluator: For evaluating agent execution paths
JSON Similarity Evaluator: For deterministic JSON comparison
Exact Match Evaluator: For strict string matching
Contains Evaluator: For substring matching

LLM Judge Output Evaluators¶

Overview¶

LLM Service Integration¶

Custom LLM Service¶

Model Selection¶

LLM Judge Output Evaluator¶

Configuration¶

LLMJudgeOutputEvaluatorConfig¶

Evaluation Criteria¶

OutputEvaluationCriteria¶

Prompt Placeholders¶

Examples¶

Basic Semantic Similarity¶

Custom Evaluation Prompt¶

Evaluating Natural Language Quality¶

LLM Judge Strict JSON Similarity Output Evaluator¶

How It Works¶

Why "Strict"?¶

Configuration¶

LLMJudgeStrictJSONSimilarityOutputEvaluatorConfig¶

Examples¶

Strict JSON Structure Evaluation¶

Understanding LLM Judge Response¶

Best Practices¶

When to Use vs Other Evaluators¶

Configuration Tips¶

Temperature Settings¶

Error Handling¶

Related Evaluators¶