Skip to content
Report an issue

Agent Evaluations

The UiPath SDK provides a comprehensive evaluation framework for assessing agent performance and behavior. This framework enables you to systematically measure and validate agent outputs, execution trajectories, and tool usage patterns.

Overview

The evaluation framework consists of two main categories of evaluators, organized by what they evaluate:

Output-Based Evaluators

These evaluators assess the final output/result produced by an agent:

Trajectory-Based Evaluators

These evaluators assess the execution path, decision-making process, and tool usage patterns during agent execution:

Custom Evaluators

When built-in evaluators don't meet your specific needs, you can create custom evaluators with your own logic.

Custom evaluators enable:

  • Domain-specific validation: Implement validation logic tailored to your industry or use case
  • Complex scoring algorithms: Use specialized algorithms like Jaccard similarity, Levenshtein distance, or custom metrics
  • Tool call inspection: Extract and validate data from specific tool calls in the agent trace
  • Integration with external systems: Connect to databases, APIs, or other validation services

See Custom Python Evaluators for detailed implementation guide, including:

  • Creating evaluator classes with proper type annotations
  • Implementing custom evaluation criteria and configuration
  • Extracting data from agent traces and tool calls
  • Registering evaluators with the CLI
  • Complete examples and best practices

Core Concepts

Evaluation Criteria

Each evaluator uses specific criteria to define what should be evaluated. Criteria can be specified per test case or set as defaults in the evaluator configuration.

Evaluation Results

Evaluators return a score (typically between 0 and 1) along with structured justification details. The justification type is determined by the evaluator's generic type parameter J and is always a BaseEvaluatorJustification subclass. The base class provides expected and actual fields, while specific evaluators extend it with additional fields (e.g., LLMJudgeJustification adds a justification field, JsonSimilarityJustification adds matched_leaves/total_leaves).

Configuration

Each evaluator has a configuration class that defines:

  • name: The evaluator's identifier
  • default_evaluation_criteria: Default criteria if not specified per test
  • Evaluator-specific settings (e.g., case_sensitive, strict, temperature)

Getting Started

To use an evaluator, you typically:

  1. Import the evaluator class
  2. Create an evaluator instance with configuration
  3. Call the evaluate() method with agent execution data and criteria
from uipath.eval.evaluators import ExactMatchEvaluator
from uipath.eval.models import AgentExecution

# Sample agent execution (this should be replaced with your agent run data)
agent_execution = AgentExecution(
    agent_input={"query": "Greet the world"},
    agent_output={"result": "hello, world!"},
    agent_trace=[],
)

# Create evaluator
evaluator = ExactMatchEvaluator(
    id="exact-match-1",
    config={
        "name": "ExactMatchEvaluator",
        "case_sensitive": False,
        "target_output_key": "result",
    }
)

# Evaluate
result = await evaluator.validate_and_evaluate_criteria(
    agent_execution=agent_execution,
    evaluation_criteria={"expected_output": {"result": "Hello, World!"}}
)

print(f"Score: {result.score}")

Best Practices

  1. Choose the right category:
  • Use Output-Based Evaluators to validate what the agent produces (final results)
  • Use Trajectory-Based Evaluators to validate how the agent achieves results (decision-making and tool usage)
  1. Select appropriate evaluators within categories:
  • For outputs: Use deterministic evaluators (exact match, contains, JSON similarity) for predictable outputs and LLM judges for semantic/quality assessment
  • For trajectories: Use tool call evaluators for specific validations and LLM judges for holistic behavior assessment
  1. Combine multiple evaluators: Use different evaluators together for comprehensive evaluation (e.g., exact match for output + tool call order for trajectory)

  2. Set appropriate thresholds: Define minimum acceptable scores based on your use case

  3. Evaluate both outputs and trajectories: For complex agents, validate both what they produce and how they produce it

  4. Create custom evaluators when needed: If built-in evaluators don't cover your use case, implement custom evaluators with domain-specific logic

Running Evaluations

The UiPath SDK provides a CLI command to run evaluations against your agents. The evaluation framework automatically discovers your agent and evaluation sets, or you can specify them explicitly.

Basic Usage

# Auto-discover entrypoint and evaluation set
uipath eval

# Specify entrypoint and evaluation set
uipath eval <entrypoint> <eval-set-path>

# Run with parallel workers
uipath eval --workers 4

# Save results to file
uipath eval --output-file results.json

# Run specific evaluation IDs
uipath eval --eval-ids "['eval-1', 'eval-2']"

# Disable reporting to Studio Web
uipath eval --no-report

Command Options

Option Type Description
entrypoint Positional Path to agent script (optional, auto-discovered if not specified)
eval_set Positional Path to evaluation set JSON file (optional, auto-discovered if not specified)
--eval-ids List Specific evaluation IDs to run from the eval set
--eval-set-run-id String Custom evaluation run ID (UUID generated if not provided)
--workers Integer Number of parallel workers (default: 1)
--output-file Path File path to save evaluation results
--no-report Flag Disable reporting results to Studio Web
--input-overrides JSON Can be used to override any input parameter, but most often used for file inputs (see below)

Running Multimodal Evaluations with Input Overrides

You can use the --input-overrides parameter to override any input parameter at runtime. This is most commonly used for file attachments inputs.

Input Overrides Format

The --input-overrides parameter accepts a JSON object where: - Keys are evaluation IDs - Values are objects containing input parameter names mapped to their override values

How File Attachment Overrides Work

At Design Time: Evaluation sets define file attachments using relative studioweb local paths:

{
  "filePath": {
    "ID": "evaluationFiles/document.pdf",
    "FullName": "Document.pdf",
    "MimeType": "application/pdf"
  }
}

At Runtime (with --input-overrides): When running evaluations, you override the local path with an attachmentId obtained by uploading the file to your personal workspace:

{
  "filePath": {
    "ID": "550e8400-e29b-41d4-a716-446655440000"
  }
}

The system automatically merges this override with the original definition, preserving FullName and MimeType fields. The final input passed to your agent becomes:

{
  "filePath": {
    "ID": "550e8400-e29b-41d4-a716-446655440000",
    "FullName": "Document.pdf",
    "MimeType": "application/pdf"
  }
}

Key Points: - Only the ID field needs to be specified in the override - FullName and MimeType are automatically preserved from the evaluation set definition - You can override any field, but typically only ID changes at runtime

Example: Single File Attachment

uipath eval agent.py eval_set.json \
  --eval-ids '["eval-001"]' \
  --eval-set-run-id e4d8f9a2-1234-5678-9abc-def012345678 \
  --input-overrides '{
    "eval-001": {
      "filePath": {
        "ID": "550e8400-e29b-41d4-a716-446655440000"
      }
    }
  }'

Example: Multiple Files (Array)

uipath eval agent.py eval_set.json \
  --eval-ids '["eval-002"]' \
  --eval-set-run-id e4d8f9a2-1234-5678-9abc-def012345678 \
  --input-overrides '{
    "eval-002": {
      "documents": [
        {
          "ID": "550e8400-e29b-41d4-a716-446655440001"
        },
        {
          "ID": "550e8400-e29b-41d4-a716-446655440002"
        }
      ]
    }
  }'

Example: Multiple Evaluations

uipath eval agent.py eval_set.json \
  --eval-ids '["eval-001", "eval-002", "eval-003"]' \
  --eval-set-run-id e4d8f9a2-1234-5678-9abc-def012345678 \
  --input-overrides '{
    "eval-001": {
      "image": {
        "ID": "550e8400-e29b-41d4-a716-446655440003"
      }
    },
    "eval-002": {
      "document": {
        "ID": "550e8400-e29b-41d4-a716-446655440004"
      }
    },
    "eval-003": {
      "files": [
        {"ID": "550e8400-e29b-41d4-a716-446655440005"},
        {"ID": "550e8400-e29b-41d4-a716-446655440006"}
      ]
    }
  }'

Evaluation Sets

Evaluation sets are JSON files that define test cases and specify which evaluators to use:

{
  "version": "1.0",
  "id": "my-eval-set",
  "evaluatorRefs": ["exact-match-1", "MyCustomEvaluator"],
  "evaluationItems": [
    {
      "id": "test-1",
      "agentInput": {"query": "What is 2+2?"},
      "evaluations": [
        {
          "evaluatorId": "exact-match-1",
          "evaluationCriteria": {
            "expectedOutput": {"result": "4"}
          }
        },
        {
          "evaluatorId": "MyCustomEvaluator",
          "evaluationCriteria": {
            "expectedValues": ["value1", "value2"]
          }
        }
      ]
    }
  ]
}

Results

Evaluation results include:

  • Score: Numeric score (typically 0.0 to 1.0) or boolean pass/fail
  • Details: Structured justification for the evaluation (e.g., BaseEvaluatorJustification with expected/actual for deterministic evaluators, JsonSimilarityJustification with matched_leaves/total_leaves, or LLMJudgeJustification with expected/actual/justification for LLM judges)
  • Metrics: Token usage, latency, and other execution metrics
  • Trace: Full execution trace including tool calls and outputs

Results can be viewed in:

  • Console output: Real-time progress and summary
  • Output file: JSON file with detailed results (use --output-file)
  • Studio Web: Automatically reported if running in a Studio project (unless --no-report is specified)

Reference Documentation

See the individual evaluator pages for detailed information on configuration, usage, and examples.