Tool Call Order Evaluator¶
The Tool Call Order Evaluator validates that an agent calls tools in the expected sequence. This is crucial for workflows where the order of operations matters for correctness, data integrity, or business logic.
Overview¶
Evaluator ID: tool-call-order
Use Cases:
- Validate critical operation sequences (e.g., authenticate before access)
- Ensure workflow steps follow business logic
- Test that agents follow prescribed procedures
- Verify proper data pipeline execution order
Returns: Continuous score from 0.0 to 1.0 based on sequence match
Configuration¶
ToolCallOrderEvaluatorConfig¶
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
"ToolCallOrderEvaluator" |
The evaluator's name |
strict |
bool |
False |
If True, requires exact sequence match; if False, uses LCS (Longest Common Subsequence) |
default_evaluation_criteria |
ToolCallOrderEvaluationCriteria or None |
None |
Default criteria |
Strict vs Non-Strict Mode¶
- Strict mode (
strict=True): Requires exact sequence match, score is 1.0 or 0.0 - Non-strict mode (
strict=False): Uses LCS algorithm, allows partial credit for subsequences
Evaluation Criteria¶
ToolCallOrderEvaluationCriteria¶
| Parameter | Type | Description |
|---|---|---|
tool_calls_order |
list[str] |
Ordered list of expected tool names |
Scoring Algorithm¶
Non-Strict Mode (LCS-based)¶
The score is calculated as:
Where LCS is the Longest Common Subsequence between expected and actual tool call sequences.
Example:
- Expected: ["A", "B", "C", "D"]
- Actual: ["A", "X", "B", "D"]
- LCS: ["A", "B", "D"] (length 3)
- Score: 3/4 = 0.75
Strict Mode¶
- Returns
1.0if sequences match exactly - Returns
0.0if any difference exists
Examples¶
Basic Tool Call Order Validation¶
from opentelemetry.sdk.trace import ReadableSpan
from uipath.eval.evaluators import ToolCallOrderEvaluator
from uipath.eval.models import AgentExecution
# Create mock spans representing tool calls in execution trace
mock_spans = [
ReadableSpan(
name="validate_user",
start_time=0,
end_time=1,
attributes={"tool.name": "validate_user"},
),
ReadableSpan(
name="check_inventory",
start_time=1,
end_time=2,
attributes={"tool.name": "check_inventory"},
),
ReadableSpan(
name="create_order",
start_time=2,
end_time=3,
attributes={"tool.name": "create_order"},
),
]
agent_execution = AgentExecution(
agent_input={"task": "Process user order"},
agent_output={"status": "completed"},
agent_trace=mock_spans,
)
evaluator = ToolCallOrderEvaluator(
id="order-check-1",
config={
"name": "ToolCallOrderEvaluator",
"strict": False
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls_order": [
"validate_user",
"check_inventory",
"create_order"
]
}
)
print(f"Score: {result.score}") # 1.0 (perfect match)
print(f"Details: {result.details}")
# Details includes: actual order, expected order, and LCS
Strict Order Validation¶
from opentelemetry.sdk.trace import ReadableSpan
# Critical security sequence
mock_spans = [
ReadableSpan(
name="authenticate_user",
start_time=0,
end_time=1,
attributes={"tool.name": "authenticate_user"},
),
ReadableSpan(
name="verify_permissions",
start_time=1,
end_time=2,
attributes={"tool.name": "verify_permissions"},
),
ReadableSpan(
name="access_resource",
start_time=2,
end_time=3,
attributes={"tool.name": "access_resource"},
),
]
agent_execution = AgentExecution(
agent_input={"task": "Access secured resource"},
agent_output={"status": "completed"},
agent_trace=mock_spans,
)
evaluator = ToolCallOrderEvaluator(
id="order-strict",
config={
"name": "ToolCallOrderEvaluator",
"strict": True
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls_order": [
"authenticate_user",
"verify_permissions",
"access_resource"
]
}
)
# Score is either 1.0 (perfect match) or 0.0 (any mismatch)
print(f"Score: {result.score}") # 1.0
Partial Credit with LCS¶
from opentelemetry.sdk.trace import ReadableSpan
# Actual execution (missing "sort")
mock_spans = [
ReadableSpan(
name="search",
start_time=0,
end_time=1,
attributes={"tool.name": "search"},
),
ReadableSpan(
name="filter",
start_time=1,
end_time=2,
attributes={"tool.name": "filter"},
),
ReadableSpan(
name="display",
start_time=2,
end_time=3,
attributes={"tool.name": "display"},
),
]
agent_execution = AgentExecution(
agent_input={"task": "Search and display"},
agent_output={"status": "completed"},
agent_trace=mock_spans,
)
evaluator = ToolCallOrderEvaluator(
id="order-lcs",
config={
"name": "ToolCallOrderEvaluator",
"strict": False # Use LCS for partial credit
}
)
# Expected sequence
expected = ["search", "filter", "sort", "display"]
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls_order": expected
}
)
# Score: 3/4 = 0.75 (3 tools in correct order out of 4 expected)
print(f"Score: {result.score}") # 0.75
print(f"LCS: {result.details.lcs}") # ["search", "filter", "display"]
Database Transaction Sequence¶
from opentelemetry.sdk.trace import ReadableSpan
mock_spans = [
ReadableSpan(
name="begin_transaction",
start_time=0,
end_time=1,
attributes={"tool.name": "begin_transaction"},
),
ReadableSpan(
name="validate_data",
start_time=1,
end_time=2,
attributes={"tool.name": "validate_data"},
),
ReadableSpan(
name="update_records",
start_time=2,
end_time=3,
attributes={"tool.name": "update_records"},
),
ReadableSpan(
name="commit_transaction",
start_time=3,
end_time=4,
attributes={"tool.name": "commit_transaction"},
),
]
agent_execution = AgentExecution(
agent_input={"task": "Update database"},
agent_output={"status": "completed"},
agent_trace=mock_spans,
)
evaluator = ToolCallOrderEvaluator(
id="db-transaction",
config={
"name": "ToolCallOrderEvaluator",
"strict": True # Transactions must be exact
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls_order": [
"begin_transaction",
"validate_data",
"update_records",
"commit_transaction"
]
}
)
# Must match exactly for data integrity
print(f"Score: {result.score}") # 1.0
API Integration Workflow¶
from opentelemetry.sdk.trace import ReadableSpan
mock_spans = [
ReadableSpan(
name="get_api_token",
start_time=0,
end_time=1,
attributes={"tool.name": "get_api_token"},
),
ReadableSpan(
name="fetch_user_data",
start_time=1,
end_time=2,
attributes={"tool.name": "fetch_user_data"},
),
ReadableSpan(
name="enrich_data",
start_time=2,
end_time=3,
attributes={"tool.name": "enrich_data"},
),
ReadableSpan(
name="post_to_webhook",
start_time=3,
end_time=4,
attributes={"tool.name": "post_to_webhook"},
),
ReadableSpan(
name="log_result",
start_time=4,
end_time=5,
attributes={"tool.name": "log_result"},
),
]
agent_execution = AgentExecution(
agent_input={"task": "API integration"},
agent_output={"status": "completed"},
agent_trace=mock_spans,
)
evaluator = ToolCallOrderEvaluator(
id="api-workflow",
config={
"name": "ToolCallOrderEvaluator",
"strict": False
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls_order": [
"get_api_token",
"fetch_user_data",
"enrich_data",
"post_to_webhook",
"log_result"
]
}
)
print(f"Score: {result.score}") # 1.0
Using Default Criteria¶
from opentelemetry.sdk.trace import ReadableSpan
mock_spans = [
ReadableSpan(
name="init",
start_time=0,
end_time=1,
attributes={"tool.name": "init"},
),
ReadableSpan(
name="process",
start_time=1,
end_time=2,
attributes={"tool.name": "process"},
),
ReadableSpan(
name="cleanup",
start_time=2,
end_time=3,
attributes={"tool.name": "cleanup"},
),
]
agent_execution = AgentExecution(
agent_input={"task": "Standard workflow"},
agent_output={"status": "completed"},
agent_trace=mock_spans,
)
evaluator = ToolCallOrderEvaluator(
id="order-default",
config={
"name": "ToolCallOrderEvaluator",
"strict": False,
"default_evaluation_criteria": {
"tool_calls_order": ["init", "process", "cleanup"]
}
}
)
# Use default criteria
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria=None # Uses default
)
print(f"Score: {result.score}") # 1.0
Justification Details¶
The evaluator returns a ToolCallOrderEvaluatorJustification with:
{
"actual_tool_calls_order": ["tool1", "tool2", "tool3"], # What the agent called
"expected_tool_calls_order": ["tool1", "tool2", "tool3", "tool4"], # What was expected
"lcs": ["tool1", "tool2", "tool3"] # Longest common subsequence (non-strict mode)
}
Best Practices¶
- Use strict mode for critical sequences - Authentication, transactions, data integrity workflows
- Use non-strict mode for flexible workflows - When some steps might be optional or reorderable
- Combine with other evaluators - Use with Tool Call Count and Tool Call Args
- Be specific with tool names - Ensure tool names in criteria match exactly (case-sensitive)
- Consider repeated calls - If a tool should be called multiple times, list it multiple times
- Test partial workflows - Use non-strict mode during development, strict in production
When to Use vs Other Evaluators¶
Use Tool Call Order when:
- Sequence of operations is important
- Workflow has defined steps
- Business logic requires specific order
- Data dependencies exist between steps
Use Tool Call Count when:
- Order doesn't matter, only frequency
- Testing tool usage patterns
- Ensuring tools are called correct number of times
Use LLM Judge Trajectory when:
- More flexible, semantic evaluation needed
- Decision-making process is complex
- Exact sequences are less important than overall behavior
Limitations¶
- Case-sensitive matching - Tool names must match exactly
- No argument validation - Only checks tool names, not arguments (use Tool Call Args Evaluator)
- Position-based - Doesn't consider parallel execution
- Multiple calls - Each call is treated as separate position in sequence
Error Handling¶
The evaluator handles these scenarios gracefully:
- Empty actual calls: Score 0.0
- Empty expected calls: Error (invalid criteria)
- Missing tools in trace: Extracts only tool calls found
- Extra tools called: In non-strict mode, uses LCS; in strict mode, score 0.0
Performance Considerations¶
- Fast evaluation: O(n*m) for LCS algorithm
- No LLM calls: Deterministic and quick
- Low memory: Efficient for large sequences
Related Evaluators¶
- Tool Call Count Evaluator: Validates tool usage frequencies
- Tool Call Args Evaluator: Validates tool arguments
- Tool Call Output Evaluator: Validates tool outputs
- LLM Judge Trajectory Evaluator: For semantic trajectory evaluation