Skip to content

# cli_eval bypasses App.plugins, breaking observability plugins (e.g., BigQueryAgentAnalyticsPlugin) during eval runs #5503

@saifer82

Description

@saifer82

Summary

When an agent is wrapped in an App(root_agent=..., plugins=[...]) and you run evals via adk eval (or agents-cli eval run), the registered plugins do not fire. The eval CLI accesses agent_module.agent.root_agent directly and runs eval sessions against the bare agent, bypassing the App object and its plugin chain.

This means observability plugins like BigQueryAgentAnalyticsPlugin capture interactive (adk web / adk run) sessions but produce no rows for eval runs — leaving cost, latency, and trajectory telemetry blind during the development loop where evals run most often.

Reproduction

ADK version: 1.31.1

# app/agent.py
from google.adk.agents import LlmAgent
from google.adk.apps import App
from google.adk.plugins.bigquery_agent_analytics_plugin import (
    BigQueryAgentAnalyticsPlugin, BigQueryLoggerConfig,
)

root_agent = LlmAgent(name="root_agent", model="gemini-2.5-flash", ...)

app = App(
    root_agent=root_agent,
    name="app",
    plugins=[
        BigQueryAgentAnalyticsPlugin(
            project_id="your-project",
            dataset_id="telemetry",
            table_id="agent_events",
            config=BigQueryLoggerConfig(log_session_metadata=True),
        ),
    ],
)
  1. adk web → run a few turns interactively → events appear in telemetry.agent_events
  2. adk eval ./app path/to/case.evalset.json → eval runs complete successfully but no rows appear in telemetry.agent_events for the eval session ❌

Root cause

In google/adk/cli/cli_eval.py:

def _get_agent_module(agent_module_file_path: str):
    file_path = os.path.join(agent_module_file_path, "__init__.py")
    module_name = "agent"
    return _import_from_path(module_name, file_path)


def get_root_agent(agent_module_file_path: str) -> Agent:
    """Returns root agent given the agent module."""
    agent_module = _get_agent_module(agent_module_file_path)
    root_agent = agent_module.agent.root_agent
    return root_agent

The eval flow imports the agent module and reaches into agent_module.agent.root_agent. The App instance (and its plugins=[...] list) is never resolved or used, so plugin lifecycle hooks (before_agent_callback, on_event_callback, etc.) are never wired to the eval runner.

By contrast, adk web / adk run go through AdkWebServer which constructs sessions via the App, so plugins fire correctly.

Why this matters

  • Cost monitoring during development is blind. Eval runs are typically the largest chunk of LLM cost during prompt iteration (we measured ~€6 per full 29-case run, mostly Pro tokens), but this cost is invisible to dashboards built on BigQueryAgentAnalyticsPlugin or any other observability plugin.
  • Eval-specific observability is the most useful kind. Knowing per-case latency, token breakdown, tool-trajectory drift across eval runs is exactly what you'd want when iterating on prompts. Today users have to fall back to Cloud Billing exports — much coarser.
  • The BigQueryAgentAnalyticsPlugin doc actively advertises eval analytics as a use case ("LLM-as-judge evals — structured data for evaluation pipelines"). The current behavior contradicts that.

Proposed fix

Make get_root_agent (and the surrounding eval flow) prefer the App object when it exists in the agent module, and run eval sessions through the App.runner (or equivalent) so plugins fire.

Sketch:

def get_app_or_root_agent(agent_module_file_path: str):
    """Returns (app, root_agent). Falls back to bare root_agent if no App is exported."""
    agent_module = _get_agent_module(agent_module_file_path)
    app = getattr(agent_module.agent, 'app', None)
    if app is not None:
        return app, app.root_agent
    return None, agent_module.agent.root_agent

Then update the eval runner (evaluation_generator.py etc.) to use app for session creation when present, so plugin callbacks are invoked the same way adk web invokes them. The bare-root-agent path remains available for projects that don't define an App.

Happy to draft a PR if there's interest and the maintainers agree on the approach. Would also want feedback on whether there's a reason the current bypass is intentional (e.g., to avoid plugin side-effects polluting eval runs) — if so, an opt-in flag like adk eval --use-app-plugins would also resolve the gap without changing the default.

Workaround for now

For users hitting this:

  • Estimate eval cost from Cloud Billing → Vertex AI line items in the eval window, less the interactive-session cost reported by your dashboard.
  • Or: write a custom eval runner that uses App.runner directly instead of going through adk eval.

Environment

  • google-adk 1.31.1
  • Python 3.13
  • ADK ContextCacheConfig, BigQueryAgentAnalyticsPlugin both registered on App

Metadata

Metadata

Assignees

No one assigned

    Labels

    eval[Component] This issue is related to evaluation

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions