LLM Judge Trajectory Evaluators¶
LLM Judge Trajectory Evaluators use Language Models to assess the quality of agent execution trajectories - the sequence of decisions and actions an agent takes. These evaluators are good options for validating that agents follow expected execution behaviors when standard trajectory evaluators do not weigh specific mistakes too well or are too hard to configure. However, the recommended practice for most use cases involves acquiring comprehensive trajectory annotations and adopting deterministic trajectory evaluators.
Overview¶
We provide two variants of LLM Judge Trajectory Evaluators:
- LLM Judge Trajectory Evaluator (
llm-judge-trajectory-similarity): General trajectory evaluation - LLM Judge Trajectory Simulation Evaluator (
llm-judge-trajectory-simulation): Specialized for tool simulation scenarios
Use Cases:
- Validate agent decision-making processes
- Ensure agents follow expected execution paths
- Evaluate tool usage patterns and sequencing
- Assess agent behavior in complex scenarios
- Validate tool simulation accuracy (where tool responses are mocked)
Returns: Continuous score from 0.0 to 1.0 with justification
LLM Service Integration¶
LLM Judge evaluators require an LLM service to perform evaluations. By default, the evaluators use the UiPathLlmService to handle LLM requests, which automatically integrates with your configured LLM providers through the UiPath platform.
Custom LLM Service¶
You can supply a custom LLM service that supports the following request format:
{
"model": "model-name",
"messages": [
{"role": "system", "content": "system prompt"},
{"role": "user", "content": "evaluation prompt"}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "evaluation_response",
"schema": {
# JSON schema for structured output
}
}
},
"max_tokens": 1000, # or None
"temperature": 0.0
}
The LLM service must:
- Accept messages with
systemanduserroles - Support structured output via
response_formatwith JSON schema - Return responses conforming to the specified schema
- Handle
temperatureandmax_tokensparameters
Model Selection¶
When configuring the evaluator, specify the model name according to your LLM service's conventions:
evaluator = LLMJudgeTrajectoryEvaluator(
id="trajectory-judge-1",
config={
"name": "LLMJudgeTrajectoryEvaluator",
"model": "gpt-4o-2024-11-20", # Use your service's model naming
"temperature": 0.0
}
)
UiPathLlmService
The default UiPathLlmService supports multiple LLM providers configured through the UiPath platform. Model names follow the provider's conventions (e.g., gpt-4o-2024-11-20 for OpenAI, claude-3-5-sonnet-20241022 for Anthropic).
LLM Judge Trajectory Evaluator¶
Configuration¶
LLMJudgeTrajectoryEvaluatorConfig¶
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
"LLMJudgeTrajectoryEvaluator" |
The evaluator's name |
prompt |
str |
Default trajectory prompt | Custom evaluation prompt |
model |
str |
"" |
LLM model to use for judgment |
temperature |
float |
0.0 |
LLM temperature (0.0 for deterministic) |
max_tokens |
int or None |
None |
Maximum tokens for LLM response |
default_evaluation_criteria |
TrajectoryEvaluationCriteria or None |
None |
Default criteria |
Evaluation Criteria¶
TrajectoryEvaluationCriteria¶
| Parameter | Type | Description |
|---|---|---|
expected_agent_behavior |
str |
Description of the expected agent behavior |
Prompt Placeholders¶
The prompt template supports these placeholders:
{{AgentRunHistory}}: The agent's execution trace/trajectory{{ExpectedAgentBehavior}}: The expected behavior description{{UserOrSyntheticInput}}: The input provided to the agent{{SimulationInstructions}}: Tool simulation instructions specifying how tools should respond (for simulation variant only)
Examples¶
Basic Trajectory Evaluation¶
from uipath.eval.evaluators import LLMJudgeTrajectoryEvaluator
from uipath.eval.models import AgentExecution
agent_execution = AgentExecution(
agent_input={"user_query": "Book a flight to Paris"},
agent_output={"booking_id": "FL123", "status": "confirmed"},
agent_trace=[
# Trace contains spans showing the agent's execution path
# Each span represents a step in the agent's decision-making
]
)
evaluator = LLMJudgeTrajectoryEvaluator(
id="trajectory-judge-1",
config={
"name": "LLMJudgeTrajectoryEvaluator",
# Use the UiPathLlmChatService convention for model names; this should be changed according to selected service
"model": "gpt-4o-2024-11-20",
"temperature": 0.0
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"expected_agent_behavior": """
The agent should:
1. Search for available flights to Paris
2. Present options to the user
3. Process the booking
4. Confirm the reservation
"""
}
)
print(f"Score: {result.score}")
print(f"Justification: {result.details}")
Validating Tool Usage Sequence¶
agent_execution = AgentExecution(
agent_input={"task": "Update user profile and send notification"},
agent_output={"status": "completed"},
agent_trace=[
# Spans showing: validate_user -> update_profile -> send_notification
]
)
evaluator = LLMJudgeTrajectoryEvaluator(
id="trajectory-tools",
config={
"name": "LLMJudgeTrajectoryEvaluator",
"model": "gpt-4o-2024-11-20",
"temperature": 0.0
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"expected_agent_behavior": """
The agent must:
1. First validate the user exists
2. Update the profile in the database
3. Send a confirmation notification
This sequence must be followed to ensure data integrity.
"""
}
)
print(f"Score: {result.score}")
print(f"Justification: {result.details}")
Custom Evaluation Prompt¶
custom_prompt = """
Analyze the agent's execution path and compare it with the expected behavior.
Agent Run History:
{{AgentRunHistory}}
Expected Agent Behavior:
{{ExpectedAgentBehavior}}
User Input:
{{UserOrSyntheticInput}}
Evaluate:
1. Did the agent follow the expected sequence?
2. Were all necessary steps completed?
3. Was the decision-making logical and efficient?
Provide a score from 0-100.
"""
evaluator = LLMJudgeTrajectoryEvaluator(
id="trajectory-custom",
config={
"name": "LLMJudgeTrajectoryEvaluator",
"model": "gpt-4o-2024-11-20",
"prompt": custom_prompt,
"temperature": 0.0
}
)
# ... use evaluator
LLM Judge Trajectory Simulation Evaluator¶
This variant is specialized for evaluating agent behavior in tool simulation scenarios, where tool responses are mocked/simulated during agent execution.
What is Tool Simulation?¶
In tool simulation:
- Simulation Engine: Mocks tool responses based on simulation instructions
- Agent Unawareness: The agent doesn't know tool responses are simulated
- Controlled Testing: Allows testing agent behavior with predictable tool responses
- Evaluation Focus: Assesses whether the agent behaves correctly given the simulated tool responses
The evaluator checks if: - The simulation was successful (tools responded as instructed) - The agent behaved according to expectations given the simulated responses - The agent's decision-making aligns with expected behavior in the simulated scenario
Configuration¶
LLMJudgeTrajectorySimulationEvaluatorConfig¶
Same as LLMJudgeTrajectoryEvaluatorConfig but with:
- name:
"LLMJudgeTrajectorySimulationEvaluator" - prompt: Specialized prompt for tool simulation evaluation that considers:
- Simulation instructions (how tools should respond)
- Whether the simulated tool responses matched instructions
- Agent behavior given the simulated responses
Examples¶
Tool Simulation Trajectory Evaluation¶
from uipath.eval.evaluators import LLMJudgeTrajectorySimulationEvaluator
agent_execution = AgentExecution(
agent_input={"query": "Book a flight to Paris for tomorrow"},
agent_output={"booking_id": "FL123", "status": "confirmed"},
agent_trace=[
# Execution spans showing tool calls and their simulated responses
],
simulation_instructions="""
Simulate the following tool responses:
- search_flights tool: Return 3 available flights with prices
- book_flight tool: Return booking confirmation with ID "FL123"
- send_confirmation_email tool: Return success status
Mock the tools to respond as if it's a Tuesday in March with normal availability.
"""
)
evaluator = LLMJudgeTrajectorySimulationEvaluator(
id="sim-trajectory-1",
config={
"name": "LLMJudgeTrajectorySimulationEvaluator",
"model": "gpt-4o-2024-11-20",
"temperature": 0.0
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"expected_agent_behavior": """
The agent should:
1. Call search_flights to find available options
2. Present flight options to the user (simulated in conversation)
3. Call book_flight with appropriate parameters
4. Confirm the booking with the user
5. Call send_confirmation_email to notify the user
"""
}
)
print(f"Score: {result.score}")
print(f"Justification: {result.details}")
Understanding Agent Traces¶
The agent_trace contains execution spans that show:
- Tool calls made by the agent
- LLM reasoning steps
- Decision points
- Action sequences
- Intermediate results
Example trace structure:
agent_trace = [
{
"name": "search_flights",
"type": "tool",
"inputs": {"destination": "Paris"},
"output": {"flights": [...]}
},
{
"name": "llm_reasoning",
"type": "llm",
"content": "User wants cheapest option..."
},
{
"name": "book_flight",
"type": "tool",
"inputs": {"flight_id": "FL123"},
"output": {"status": "confirmed"}
}
]
Best Practices¶
- Write clear behavior descriptions - Be specific about expected sequences and decision logic
- Use temperature 0.0 for consistent evaluations
- Include context - Provide enough detail in expected behavior
- Consider partial credit - LLM can give partial scores for mostly correct trajectories
- Review justifications - Understand why trajectories scored high or low
- Combine with tool evaluators - Use Tool Call Evaluators for strict ordering requirements
When to Use vs Other Evaluators¶
Use LLM Judge Trajectory when: - Decision-making process matters more than just output - Agent behavior patterns need validation - Tool usage sequence is complex but somewhat flexible - Human-like judgment of execution quality is needed
Use Tool Call Evaluators when: - Strict tool call sequences must be enforced - Deterministic validation is sufficient - Exact argument values must match - Performance and cost are priorities
Configuration Tips¶
Temperature Settings¶
- 0.0: Deterministic, consistent results (recommended)
- 0.1: Slight variation for nuanced judgment
- >0.3: Not recommended (too inconsistent)
Evaluation Criteria Guidelines¶
When writing expected_agent_behavior, include:
- Sequential steps: Numbered or ordered list of expected actions
- Decision points: When the agent should make choices
- Conditional logic: "If X, then Y" scenarios
- Success criteria: What constitutes good behavior
- Error handling: How agent should handle failures
Good Example¶
evaluation_criteria = {
"expected_agent_behavior": """
The agent should follow this sequence:
1. Validate user authentication status
- If not authenticated, request login
- If authenticated, proceed to step 2
2. Fetch user's order history
- Use the get_orders tool with user_id
3. Identify the problematic order
- Search for orders with "delayed" status
4. Provide explanation to user
- Include order details and delay reason
5. Offer resolution
- Present refund or expedited shipping options
The agent should maintain a helpful tone throughout
and adapt responses based on user reactions.
"""
}
Poor Example (Too Vague)¶
Error Handling¶
The evaluator will raise UiPathEvaluationError if:
- LLM service is unavailable
- Prompt doesn't contain required placeholders
- Agent trace cannot be converted to readable format
- LLM response cannot be parsed
Performance Considerations¶
- Token usage: Trajectories can be long, increasing token costs
- Evaluation time: LLM calls take longer than deterministic evaluators
- Caching: Consider caching evaluations for repeated test runs
- Batch processing: Evaluate multiple trajectories in parallel when possible
Related Evaluators¶
- LLM Judge Output Evaluator: For evaluating outputs instead of processes
- Tool Call Order Evaluator: For strict deterministic sequence validation
- Tool Call Count Evaluator: For validating tool usage frequencies
- Tool Call Args Evaluator: For validating tool arguments