Agent Trace Evaluation with TruLens Scorers in MLflow

March 4, 2026 · 7 min read

Lead Specialist Solutions Architect at Databricks

MLflow's third-party scorer framework already supports LLM-as-a-judge evaluations from DeepEval, RAGAS, and Phoenix, an ecosystem with 18M+ monthly PyPI downloads. We're excited to announce the TruLens integration as we continue our efforts to expand support for various third-party evaluation frameworks.

An agent doesn't just produce an answer. It makes a plan, picks tools, executes a multi-step workflow, and adapts when steps fail. A correct final answer can mask a flawed plan, redundant tool calls, or broken reasoning along the way. To catch those problems, you need to evaluate what happened inside the execution trace, not just what came out the other end.

The integration adds 10 scorers that bring the Agent GPA framework to MLflow. You pass an MLflow trace, and the scorer reads the full span tree (plans, tool calls, intermediate outputs) to evaluate agent behavior. MLflow already supports trace-based judges and agentic metrics from DeepEval and RAGAS. TruLens adds a structured three-dimensional lens (Goal, Plan, Action) developed by the TruLens team at Snowflake.

The Agent GPA Framework

GPA stands for Goal-Plan-Action, and it evaluates three alignment dimensions in an agent's execution:

Goal-Plan alignment asks: did the agent make a good strategy? An agent that gets the right answer by brute-forcing every available tool has a Goal-Plan problem even if the output looks fine.

PlanQuality checks whether the plan decomposes the goal into feasible subtasks.
ToolSelection checks whether the agent picked the right tools for each subtask.

Plan-Action alignment asks: did the agent follow through? Did it skip steps, reorder things, or repeat work?

PlanAdherence checks whether the agent's actual actions match its stated plan.
ToolCalling checks whether function calls are valid, with correct parameters and complete inputs.

Holistic alignment looks at the trajectory as a whole.

LogicalConsistency checks whether each step is coherent with prior context and reasoning.
ExecutionEfficiency checks whether the agent reached the goal without redundant calls.

On the TRAIL/GAIA benchmark, GPA judges identify 95% of human-labeled agent errors (267/281), compared to 55% for standard judges that only look at final outputs. That 40-percentage-point gap is what you leave on the table when you only evaluate the answer.

The integration exposes six agent scorers covering the three GPA dimensions:

Scorer	Alignment	What it checks
`PlanQuality`	Goal-Plan	Does the plan decompose the goal into feasible subtasks?
`ToolSelection`	Goal-Plan	Did the agent pick the right tools for each subtask?
`PlanAdherence`	Plan-Action	Did the agent follow its plan, or skip and reorder steps?
`ToolCalling`	Plan-Action	Are tool calls valid with correct parameters and complete inputs?
`LogicalConsistency`	Holistic	Is each step coherent with prior context and reasoning?
`ExecutionEfficiency`	Holistic	Did the agent reach the goal without redundant calls?

Pass a trace and the scorer handles the rest. Under the hood, the integration serializes your MLflow trace to JSON and passes the full span tree to TruLens' provider, which evaluates each dimension with chain-of-thought reasoning. You get back a score and a rationale explaining what it found.

Architecture diagram showing the TruLens trace evaluation pipeline: MLflow agent trace with spans is serialized to JSON, passed to the TruLens GPA Provider backed by a model provider, which evaluates across six scorer dimensions grouped by Goal-Plan, Plan-Action, and Holistic alignment, producing scores and rationales that flow into MLflow Feedback and the assessment table UI

How Trace Evaluation Catches What Output Evaluation Misses

Here's a concrete scenario. Say you have a travel-planning agent that should: (1) search for flights, (2) check hotel availability, (3) book both. The agent returns "Your trip is booked!" and it looks correct. But the trace tells a different story:

Span 1: search_flights("NYC", "LAX", "2026-04-01") -> 3 results
Span 2: search_flights("NYC", "LAX", "2026-04-01") -> 3 results  <- duplicate
Span 3: book_flight(flight_id="FL123") -> confirmed
Span 4: search_hotels("LAX", "2026-04-01") -> 2 results
Span 5: book_hotel(hotel_id=None) -> error
Span 6: book_hotel(hotel_id="H456") -> confirmed

Output-only evaluation gives this a pass. The trip got booked. Trace-level evaluation catches three problems:

ExecutionEfficiency: redundant flight search (Span 2 duplicates Span 1)
ToolCalling: book_hotel called with None before retry (Span 5)
PlanAdherence: the agent booked a flight before searching for hotels

Combining Agent and RAG Evaluation

You can mix agent trace scorers, RAG scorers, and scorers from other frameworks in a single mlflow.genai.evaluate() call. The trace scorers read the span tree, while RAG scorers like Groundedness extract context from retrieval spans in the trace automatically. All scorers support a model parameter for choosing your LLM provider (OpenAI, Anthropic, or any LiteLLM-compatible provider).

import mlflow
from mlflow.genai.scorers.trulens import (
    Groundedness,
    PlanAdherence,
    ExecutionEfficiency,
)
from mlflow.genai.scorers.phoenix import Hallucination

traces = mlflow.search_traces(locations=["..."])

results = mlflow.genai.evaluate(
    data=traces,
    scorers=[
        # Agent behavior (reads the full span tree)
        PlanAdherence(model="openai:/gpt-5-mini"),
        ExecutionEfficiency(model="openai:/gpt-5-mini"),
        # RAG quality (extracts context from retrieval spans)
        Groundedness(model="openai:/gpt-5-mini"),
        # Content quality (Phoenix)
        Hallucination(model="openai:/gpt-5-mini"),
    ],
)

Each scorer runs independently and writes results to the same experiment. Results land in the MLflow assessment table alongside any other evaluation results.

MLflow trace detail showing travel-planning agent with spans on the left, and TruLens GPA assessments on the right with execution_efficiency expanded to show its rationale

MLflow Traces list showing one agent trace with assessment columns: execution_efficiency (0.67), logical_consistency (0.67), plan_adherence (0.67) with average score bar charts for each scorer

Getting Started

pip install mlflow>=3.10.0 trulens trulens-providers-litellm

from mlflow.genai.scorers.trulens import PlanAdherence, Groundedness

# Agent trace scorer
scorer = PlanAdherence(model="openai:/gpt-5-mini")
feedback = scorer(trace=my_agent_trace)
print(feedback.value)      # "yes" or "no" based on threshold
print(feedback.rationale)  # Chain-of-thought reasoning

# RAG scorer (extracts context from retrieval spans in trace)
scorer = Groundedness(model="openai:/gpt-5-mini", threshold=0.6)
feedback = scorer(trace=my_rag_trace)
print(feedback.value)              # "yes" or "no"
print(feedback.rationale)          # Why it passed or failed
print(feedback.metadata["score"])  # 0.85

Resources

Provenance

I contributed the TruLens integration (PR #19492) to MLflow's open-source third-party scorer framework, adding 10 scorers: 4 RAG metrics and 6 agent trace evaluators based on the Agent GPA framework. The integration went through four review rounds with Samraj Moorjani (Software Engineer at Databricks, MLflow maintainer), with final approval from Avesh C. Singh (Software Engineer at Databricks). It follows the scorer pattern Moorjani established in the DeepEval and RAGAS integrations and extends it to agent trace evaluation, a category that requires reading the full span tree rather than just inputs and outputs.

Josh Reini (TruLens maintainer, Snowflake) reviewed the integration's scorer semantics and validated the trace-aware evaluation behavior. Reini published a companion post on the Snowflake Engineering Blog covering the Agent GPA research and TRAIL benchmark results in depth. A cross-project documentation PR was also merged into the TruLens repository.

Related artifacts:

Upstream MLflow TruLens PR #19492 (merged)
TruLens documentation PR #2344 (merged, cross-project)
Introducing DeepEval, RAGAS, and Phoenix Judges in MLflow (companion blog)

LLMs & Agents

Model Training

LLMs & Agents

Model Training

Agent Trace Evaluation with TruLens Scorers in MLflow

The Agent GPA Framework

How Trace Evaluation Catches What Output Evaluation Misses

Combining Agent and RAG Evaluation

Getting Started

Resources

Provenance

LLMs & Agents

Model Training

LLMs & Agents

Model Training

The Agent GPA Framework​

How Trace Evaluation Catches What Output Evaluation Misses​

Combining Agent and RAG Evaluation​

Getting Started​

Resources​

Provenance​

The Agent GPA Framework

How Trace Evaluation Catches What Output Evaluation Misses

Combining Agent and RAG Evaluation

Getting Started

Resources

Provenance