Evaluating AI Agents | Matija Grcic

Arize gives you the observability framework plus the evaluation toolkit for both local debugging and large-scale performance monitoring. As your agent architecture grows more complex—with additional tools, APIs, or specialized sub-agents—Arize remains the single place to trace every step and assess whether your system is delivering on its goals.

1. Why Agent-Based Systems Matter

What They Are

An AI agent is a system that uses large language models (LLMs) for reasoning and can take actions on your behalf. Instead of a one-shot LLM app, an agent has a router (decides which tool to use) and multiple skills (each skill is a function or piece of logic).
Key Components
1. Router: Also called the “brain” or planner. It chooses which skill to call.
2. Skills (Tools): The executable blocks of logic—e.g., making a database query, generating code, summarizing text.
3. Memory / State: Tracks past steps, conversation history, or retrieved context.
Common Use Cases
- Research agents to gather sources and iterate on findings.
- Data analysis or code-generation agents that handle queries step by step.
- Automated tasks in personal assistants or web-scraping flows.

2. Observability: Seeing Inside Your Agent

Definition

Observability provides detailed insight into how the agent is working at every step. It typically involves capturing:
- Traces: A single end-to-end run.
- Spans: Individual steps within that run (each LLM call, tool call, or chain of logic).
Instrumentation
- Adding code/decorators to track calls to LLMs, tools, or routes.
- Enables you to debug efficiently and see exactly when/why the agent might choose the wrong step.
OpenTelemetry Standard
- A widely used framework for capturing and sending traces/spans.
- Tools like Arize Phoenix can collect and display those traces in an interactive UI.

3. Evaluations: Measuring Your Agent’s Performance

There are three primary evaluation methods:

Code-Based Evaluations
- Compare outputs to a known ground truth or run checks like “Does this code compile?”
- Ideal for strictly verifiable tasks (e.g., regex checks, numeric match, JSON format).
- 100% accurate (assuming correct reference or ground truth) but limited by how precise your rules or data are.
LLM-as-a-Judge Evaluations
- Use a separate LLM to label outputs as correct/incorrect, relevant/irrelevant, etc.
- Great for qualitative or open-ended checks like “Is this summary coherent?”
- Scales well but is never 100% accurate—depends on the quality of the “judge” model and prompt.
Human Annotations
- Have humans assign labels (e.g., correct vs. incorrect, or a 1–5 quality rating).
- Most reliable for subtle or subjective tasks, but costly to scale.
- Often used to create “golden datasets” or to validate LLM-based evaluations.

4. Measuring the Path: Convergence & Trajectory

Why the Path Matters
- Agents can get stuck in loops or make extra tool calls that waste time and cost.
- Fewer steps = less latency, lower cost, fewer error opportunities.