Tool Call Args Evaluator¶
The Tool Call Args Evaluator validates that an agent calls tools with the correct arguments. This is essential for ensuring proper data flow, correct API usage, and validating that agents pass the right information to functions.
Overview¶
Evaluator ID: tool-call-args
Use Cases:
- Validate tool parameters are correct
- Ensure proper data flow between tools
- Test argument transformation logic
- Verify API parameter usage
- Check data type correctness
Returns: Continuous score from 0.0 to 1.0 based on argument accuracy
Configuration¶
ToolCallArgsEvaluatorConfig¶
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str |
"ToolCallArgsEvaluator" |
The evaluator's name |
strict |
bool |
False |
Controls scoring when multiple tools expected: True = all-or-nothing (1.0 or 0.0), False = proportional (ratio of matched tools) |
subset |
bool |
False |
If True, actual args can be a subset of expected; if False, must match all expected args |
default_evaluation_criteria |
ToolCallArgsEvaluationCriteria or None |
None |
Default criteria |
Evaluation Modes¶
| Mode | Description |
|---|---|
strict=False, subset=False |
Proportional scoring (e.g., 2/3 tools matched = 0.66), exact arg matching required |
strict=True, subset=False |
All-or-nothing scoring (1.0 or 0.0), exact arg matching required |
strict=False, subset=True |
Proportional scoring, expected args must be subset of actual (extra args allowed) |
strict=True, subset=True |
All-or-nothing scoring, expected args must be subset of actual (extra args allowed) |
Evaluation Criteria¶
ToolCallArgsEvaluationCriteria¶
| Parameter | Type | Description |
|---|---|---|
tool_calls |
list[ToolCall] |
List of expected tool calls with their arguments |
ToolCall Structure¶
Scoring Algorithm¶
For each tool call: 1. Match tool calls by name 2. Compare arguments based on mode (strict or JSON similarity) 3. Calculate score based on match quality
Final score: Average across all expected tool calls
Examples¶
Basic Argument Validation¶
from opentelemetry.sdk.trace import ReadableSpan
from uipath.eval.evaluators import ToolCallArgsEvaluator
from uipath.eval.models import AgentExecution
# Sample agent execution with tool calls and arguments
mock_spans = [
ReadableSpan(
name="update_user",
start_time=0,
end_time=1,
attributes={
"tool.name": "update_user",
"input.value": "{'user_id': 123, 'fields': {'email': 'user@example.com'}, 'notify': True}",
},
),
]
agent_execution = AgentExecution(
agent_input={"user_id": 123, "action": "update"},
agent_output={"status": "success"},
agent_trace=mock_spans,
)
evaluator = ToolCallArgsEvaluator(
id="args-check-1",
config={
"name": "ToolCallArgsEvaluator",
"strict": False,
"subset": False
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls": [
{
"name": "update_user",
"args": {
"user_id": 123,
"fields": {"email": "user@example.com"},
"notify": True
}
}
]
}
)
print(f"Score: {result.score}") # 1.0 (exact match)
print(f"Details: {result.details}")
Strict Mode - Exact Matching¶
from opentelemetry.sdk.trace import ReadableSpan
mock_spans = [
ReadableSpan(
name="api_request",
start_time=0,
end_time=1,
attributes={
"tool.name": "api_request",
"input.value": "{'endpoint': '/api/users', 'method': 'GET', 'headers': {'Authorization': 'Bearer token123'}}",
},
),
]
agent_execution = AgentExecution(
agent_input={"action": "fetch_users"},
agent_output={"status": "success"},
agent_trace=mock_spans,
)
evaluator = ToolCallArgsEvaluator(
id="args-strict",
config={
"name": "ToolCallArgsEvaluator",
"strict": True,
"subset": False
}
)
# Arguments must match exactly
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls": [
{
"name": "api_request",
"args": {
"endpoint": "/api/users",
"method": "GET",
"headers": {"Authorization": "Bearer token123"}
}
}
]
}
)
# Score is 1.0 only if arguments match exactly
print(f"Score: {result.score}") # 1.0
Non-Strict Mode - Proportional Scoring¶
from opentelemetry.sdk.trace import ReadableSpan
from uipath.eval.evaluators import ToolCallArgsEvaluator
from uipath.eval.models import AgentExecution
# Agent called 3 tools, but only 2 match the expected args
mock_spans = [
ReadableSpan(
name="validate_input",
start_time=0,
end_time=1,
attributes={
"tool.name": "validate_input",
"input.value": "{'data': {'user_id': 123}}", # ✓ Matches
},
),
ReadableSpan(
name="fetch_user",
start_time=1,
end_time=2,
attributes={
"tool.name": "fetch_user",
"input.value": "{'user_id': 999}", # ✗ Wrong ID!
},
),
ReadableSpan(
name="update_profile",
start_time=2,
end_time=3,
attributes={
"tool.name": "update_profile",
"input.value": "{'user_id': 123, 'updates': {'name': 'John Doe'}}", # ✓ Matches
},
),
]
agent_execution = AgentExecution(
agent_input={"task": "Update user profile"},
agent_output={"status": "updated"},
agent_trace=mock_spans,
)
evaluator = ToolCallArgsEvaluator(
id="args-proportional",
config={
"name": "ToolCallArgsEvaluator",
"strict": False, # Proportional scoring
"subset": False
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls": [
{
"name": "validate_input",
"args": {"data": {"user_id": 123}} # ✓ Matches
},
{
"name": "fetch_user",
"args": {"user_id": 123} # ✗ Actual was 999
},
{
"name": "update_profile",
"args": {
"user_id": 123,
"updates": {"name": "John Doe"} # ✓ Matches
}
}
]
}
)
# Score is 2/3 = 0.66 (2 out of 3 tools matched)
print(f"Score: {result.score}") # 0.66 (proportional!)
Subset Mode - Partial Validation¶
from opentelemetry.sdk.trace import ReadableSpan
# Agent included extra arguments beyond what we're validating
mock_spans = [
ReadableSpan(
name="send_email",
start_time=0,
end_time=1,
attributes={
"tool.name": "send_email",
"input.value": "{'to': 'user@example.com', 'subject': 'Welcome', 'cc': 'admin@example.com', 'body': 'Welcome to our platform!'}",
},
),
]
agent_execution = AgentExecution(
agent_input={"action": "send_welcome"},
agent_output={"status": "sent"},
agent_trace=mock_spans,
)
evaluator = ToolCallArgsEvaluator(
id="args-subset",
config={
"name": "ToolCallArgsEvaluator",
"strict": False,
"subset": True
}
)
# Only validate specific arguments, allow extras
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls": [
{
"name": "send_email",
"args": {
"to": "user@example.com",
"subject": "Welcome"
# Agent can include additional args like "cc", "bcc", "body", etc.
}
}
]
}
)
print(f"Score: {result.score}") # 1.0 (subset matches)
Multiple Tool Calls¶
from opentelemetry.sdk.trace import ReadableSpan
# Agent called multiple tools in sequence
mock_spans = [
ReadableSpan(
name="validate_input",
start_time=0,
end_time=1,
attributes={
"tool.name": "validate_input",
"input.value": "{'data': {'user_id': 123}}",
},
),
ReadableSpan(
name="fetch_user",
start_time=1,
end_time=2,
attributes={
"tool.name": "fetch_user",
"input.value": "{'user_id': 123}",
},
),
ReadableSpan(
name="update_profile",
start_time=2,
end_time=3,
attributes={
"tool.name": "update_profile",
"input.value": "{'user_id': 123, 'updates': {'name': 'John Doe'}}",
},
),
]
agent_execution = AgentExecution(
agent_input={"task": "Update user profile"},
agent_output={"status": "updated"},
agent_trace=mock_spans,
)
evaluator = ToolCallArgsEvaluator(
id="args-multiple",
config={
"name": "ToolCallArgsEvaluator",
"strict": False,
"subset": False
}
)
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls": [
{
"name": "validate_input",
"args": {"data": {"user_id": 123}}
},
{
"name": "fetch_user",
"args": {"user_id": 123}
},
{
"name": "update_profile",
"args": {
"user_id": 123,
"updates": {"name": "John Doe"}
}
}
]
}
)
print(f"Score: {result.score}") # 1.0 (all tools match)
Nested Arguments¶
from opentelemetry.sdk.trace import ReadableSpan
mock_spans = [
ReadableSpan(
name="create_order",
start_time=0,
end_time=1,
attributes={
"tool.name": "create_order",
"input.value": "{'customer': {'id': 123, 'name': 'John Doe', 'address': {'street': '123 Main St', 'city': 'New York', 'zip': '10001'}}, 'items': [{'product_id': 1, 'quantity': 2}, {'product_id': 2, 'quantity': 1}], 'total': 99.99}",
},
),
]
agent_execution = AgentExecution(
agent_input={"task": "Create order"},
agent_output={"status": "created"},
agent_trace=mock_spans,
)
evaluator = ToolCallArgsEvaluator(
id="args-nested",
config={
"name": "ToolCallArgsEvaluator",
"strict": False,
"subset": False
}
)
# Validate complex nested structures
result = await evaluator.validate_and_evaluate_criteria(
agent_execution=agent_execution,
evaluation_criteria={
"tool_calls": [
{
"name": "create_order",
"args": {
"customer": {
"id": 123,
"name": "John Doe",
"address": {
"street": "123 Main St",
"city": "New York",
"zip": "10001"
}
},
"items": [
{"product_id": 1, "quantity": 2},
{"product_id": 2, "quantity": 1}
],
"total": 99.99
}
}
]
}
)
print(f"Score: {result.score}") # 1.0 (nested structure matches)
Justification Details¶
The evaluator returns a ToolCallArgsEvaluatorJustification with:
{
"explained_tool_calls_args": {
"update_user_0": "Actual: {'user_id': 123, 'status': 'active'}, Expected: {'user_id': 123, 'status': 'active'}, Score: 1.0",
"send_email_0": "Actual: {'to': 'user@example.com', 'subject': 'Hello'}, Expected: {'to': 'user@example.com'}, Score: 1.0",
"log_event_0": "Actual: {'type': 'info', 'message': 'Success'}, Expected: {'type': 'info'}, Score: 1.0"
}
}
Best Practices¶
- Use non-strict mode by default - Allows for minor numeric or string variations
- Use subset mode for flexibility - When some arguments are optional or variable
- Combine with order validation - Use with Tool Call Order Evaluator
- Test critical parameters in strict mode - For security, authentication, or business-critical params
- Validate data types - Ensure strings, numbers, booleans are used correctly
- Test edge cases - Empty strings, null values, zero, negative numbers
- Consider partial matches - Non-strict mode gives partial credit for similar values
When to Use vs Other Evaluators¶
Use Tool Call Args when:
- Argument values matter
- Testing data flow correctness
- Validating parameter transformation
- Ensuring API contract compliance
Use Tool Call Order when:
- Sequence matters more than arguments
- Testing workflow steps
Use Tool Call Output when:
- Validating what tools return
- Testing tool response handling
Use JSON Similarity when:
- Comparing complex outputs (not tool args)
- General data structure comparison
Limitations¶
- Tool name matching - Must match exact tool names from trace
- Order matters for multiple calls - First expected call matches first actual call of that name
- No cross-tool validation - Each tool evaluated independently
- Case-sensitive - Keys and values are case-sensitive
Error Handling¶
The evaluator handles:
- Missing tools: Score 0.0 for that tool
- Extra tools not in criteria: Ignored
- Missing arguments: Scored based on mode
- Type mismatches: Scored based on mode (strict vs non-strict)
Performance Considerations¶
- Fast evaluation: O(n*m) where n = expected calls, m = actual calls
- No LLM calls: Deterministic and quick
- JSON similarity overhead: Slightly slower than exact match
Related Evaluators¶
- Tool Call Order Evaluator: Validates tool call sequences
- Tool Call Count Evaluator: Validates tool usage frequencies
- Tool Call Output Evaluator: Validates tool outputs
- JSON Similarity Evaluator: For general JSON comparison