Skip to content

feat(server): pluggable request-auth framework (management + runtime)#204

Open
abhinav-galileo wants to merge 11 commits intoabhi/data-model-v1from
abhi/management-auth-framework
Open

feat(server): pluggable request-auth framework (management + runtime)#204
abhinav-galileo wants to merge 11 commits intoabhi/data-model-v1from
abhi/management-auth-framework

Conversation

@abhinav-galileo
Copy link
Copy Markdown
Collaborator

@abhinav-galileo abhinav-galileo commented Apr 28, 2026

Summary

Pluggable request-auth framework that handles both auth flows the system needs:

  • Management. Online check on every request. The default authorizer authenticates the credential and authorizes the operation; on production this is HttpUpstreamAuthProvider forwarding to a configurable upstream service.
  • Runtime. Two-phase exchange-then-verify. A target-bearing call presents a long-lived credential plus (target_type, target_id) to a token exchange endpoint; the server mints a short-lived HS256 JWT bound to that target. Subsequent runtime calls verify the JWT locally — no upstream round-trip on the hot path.

Both flows route through the same primitives (Operation vocabulary on endpoints, Principal returned, RequestAuthorizer Protocol installed); a per-operation registry lets a deployment point management ops at one provider and runtime ops at another.

Migrates the /control-bindings endpoint family onto the framework and ships the runtime token exchange endpoint. The runtime resolution path itself (/evaluation etc.) is wired in a follow-up — its provider override (LocalJwtVerifyProvider) is already in place when the runtime secret is configured.

Module layout

server/src/agent_control_server/auth_framework/
  __init__.py                   # public API
  core.py                       # Operation, Principal, RequestAuthorizer, require_operation, registry
  config.py                     # configure_auth_from_env (env-driven setup, both flows)
  runtime_token.py              # HS256 mint / verify helpers
  providers/
    __init__.py
    header.py                   # HeaderAuthProvider + DEFAULT_OPERATION_ACCESS
    http_upstream.py            # HttpUpstreamAuthProvider (forward + parse grant)
    local_jwt.py                # LocalJwtVerifyProvider (hot-path JWT verify)

server/src/agent_control_server/endpoints/
  auth.py                       # POST /api/v1/auth/runtime-token-exchange

auth.py (legacy local credential check) is unchanged; HeaderAuthProvider re-uses _validate_api_key from it. Non-binding routes still go through the legacy router-level gate; their migration happens in follow-up PRs.

Operation vocabulary

class Operation(StrEnum):
    # Wired on endpoints in this PR.
    CONTROL_BINDINGS_READ = "control_bindings.read"
    CONTROL_BINDINGS_WRITE = "control_bindings.write"
    RUNTIME_TOKEN_EXCHANGE = "runtime.token_exchange"

    # Reserved; not yet wired on endpoints.
    CONTROLS_READ = "controls.read"
    CONTROLS_CREATE = "controls.create"
    CONTROLS_UPDATE = "controls.update"
    CONTROLS_DELETE = "controls.delete"
    RUNTIME_USE = "runtime.use"

Per-operation authorizer registry

set_authorizer(authorizer, operation=...) overrides the default for one operation. Without operation=, it becomes the default for every operation that does not have a specific binding. Used to route management ops through one provider and Operation.RUNTIME_USE through LocalJwtVerifyProvider:

set_authorizer(HttpUpstreamAuthProvider(...))                 # default
set_authorizer(LocalJwtVerifyProvider(secret=...),             # override
               operation=Operation.RUNTIME_USE)

require_operation(op) consults the override first, falls back to the default. The OSS path (no override installed) routes everything to HeaderAuthProvider — the no-auth flow (api_key_enabled=False) is preserved end-to-end.

Providers (three ship in-tree)

HeaderAuthProvider — local-credential path, single namespace.

  • Maps each Operation to one of three access levels (PUBLIC, AUTHENTICATED, ADMIN); single source of truth in DEFAULT_OPERATION_ACCESS.
  • Reuses the existing local API-key + session-cookie credential check from auth.py, so behavior matches the previous require_admin_key path verbatim.
  • The no-auth flow (api_key_enabled=False) is preserved: every operation succeeds with a non-admin Principal. Pinned by a regression test.
  • Always returns DEFAULT_NAMESPACE_KEY. The namespace header lookup branch is preserved but inert until non-binding write endpoints are threaded.

HttpUpstreamAuthProvider — generic upstream-delegating provider.

  • Forwards caller credentials (X-API-Key, Authorization, Cookie) on a POST to a configurable URL with {operation, context?}.
  • Optional service-to-service token header for upstream→authorization-service trust.
  • Parses the upstream response into a Principal: namespace_key, is_admin, caller_id, plus optional grant fields (target_type, target_id, scopes, expires_at) so the runtime token exchange can mint from the same response.
  • Maps 200Principal; 401/403/404 → matching error; 5xx, network errors, and malformed payloads fail closed (503/502).

LocalJwtVerifyProvider — hot-path runtime verifier.

  • Reads a Bearer token from Authorization, verifies signature against the runtime secret, checks domain == "runtime", the issuer, expiry, and that the token's scope covers the requested Operation.
  • Returns a Principal with the bound (namespace_key, target_type, target_id) so runtime endpoints inherit the namespace and target binding without re-deriving them.
  • When the dependency surfaces target_type / target_id via context_builder, the provider also enforces that they match the token's binding — runtime endpoints get the request-target check for free.

Runtime token shape

HS256, dedicated secret (AGENT_CONTROL_RUNTIME_TOKEN_SECRET), issuer agent-control/server. Claims:

Claim Purpose
domain Pinned to runtime; tokens minted here MUST not be accepted on management endpoints.
namespace_key The namespace the token authorizes within. Required for mint and verify; preserved end-to-end so a token minted for org A cannot be used to resolve controls in the default namespace.
actor_id Caller identity surfaced from the upstream grant.
scopes Granted runtime capabilities (e.g., ["runtime.use"]). The exchange endpoint refuses to mint when the upstream's explicit grant omits runtime.use.
target_type / target_id Bind the token to one target.
iat / exp Bounded lifetime. The local TTL is capped by the upstream grant's expires_at so the local token can never outlive its grant.
jti Random identifier; reserved for future revocation.

Runtime token exchange endpoint

POST /api/v1/auth/runtime-token-exchange
{ "target_type": "...", "target_id": "..." }
  • Authenticated and authorized via Operation.RUNTIME_TOKEN_EXCHANGE through the default authorizer (typically HttpUpstreamAuthProvider in production). The authorizer's context_builder forwards the requested target to the upstream so it can authorize against the right resource.
  • Refuses with 503 when AGENT_CONTROL_RUNTIME_TOKEN_SECRET is not configured.
  • Mints a local token from Principal.scopes / Principal.grant_expires_at, capped by the configured TTL (default 300s).
  • When the provider's Principal carries a target binding, the endpoint verifies it matches the requested target before minting.

Response: { token, expires_at, target_type, target_id, scopes }.

Migrated endpoints

All seven /api/v1/control-bindings* endpoints now use Depends(require_operation(...)):

Method Path Operation
PUT /control-bindings control_bindings.write
GET /control-bindings control_bindings.read
GET /control-bindings/{binding_id} control_bindings.read
PATCH /control-bindings/{binding_id} control_bindings.write
DELETE /control-bindings/{binding_id} control_bindings.write
PUT /control-bindings/by-key control_bindings.write
POST /control-bindings/by-key:delete control_bindings.write

New: POST /api/v1/auth/runtime-token-exchange (operation runtime.token_exchange).

Env vars

Var Default Purpose
AGENT_CONTROL_AUTH_MODE header Default authorizer: header or http_upstream.
AGENT_CONTROL_AUTH_UPSTREAM_URL Required when mode is http_upstream.
AGENT_CONTROL_AUTH_UPSTREAM_TIMEOUT_SECONDS 5.0 Per-request timeout.
AGENT_CONTROL_AUTH_UPSTREAM_SERVICE_TOKEN Optional upstream service token.
AGENT_CONTROL_AUTH_UPSTREAM_SERVICE_TOKEN_HEADER X-Agent-Control-Service-Token Header name for the service token.
AGENT_CONTROL_RUNTIME_TOKEN_SECRET Required to enable runtime auth + the exchange endpoint.
AGENT_CONTROL_RUNTIME_TOKEN_TTL_SECONDS 300 Local token TTL ceiling (capped further by the upstream grant).

Out of scope (follow-ups)

  • Migrate /controls CRUD onto require_operation using the reserved CONTROLS_* operations.
  • Wire Operation.RUNTIME_USE on the runtime resolution path (/evaluation, etc.) and the SDK side of the runtime exchange. The provider override is already in place when the runtime secret is configured. With feat(server): namespace scoping and control bindings #203's merged-resolver contract on /evaluation, the JWT-verified target binding now narrows the effective set the resolver returns; the verifier's match check is load-bearing for correctness, not just for authorization.
  • Migrate /agents/initAgent onto require_operation. The HttpUpstreamAuthProvider's context_builder should forward the request's target_type / target_id (added in feat(server): namespace scoping and control bindings #203) to the upstream so the upstream can authorize against the requested resource.
  • Thread namespace resolution through the rest of the API so the namespace header lookup in HeaderAuthProvider can be turned on safely.
  • Drop auth.py's require_admin_key once every management endpoint is migrated.

Stacking

Stacked on PR #203 (abhi/data-model-v1); rebased onto its current head 8f806a3 so the merged effective-controls contract (target bindings unioned with direct + policy controls, namespace_key threaded through every join) is the base this PR builds on. GET /control-bindings/effective is gone in #203, so the migration of that route went away with it; the seven surviving binding endpoints are migrated as before. Will rebase onto main once #203 merges.

Test plan

  • 51 framework + endpoint tests:
    • Default coverage: every Operation member has a default access mapping (regression guard).
    • HeaderAuthProvider: PUBLIC bypass, AUTHENTICATED + ADMIN paths route to the legacy validator with the right require_admin flag, no-auth mode passes admin operations, namespace-header lookup currently inert, unknown operation raises.
    • HttpUpstreamAuthProvider: 200 happy path with realistic JSON wire shapes (ISO datetime + JSON array scopes round-trip), service token forwarding, 401/403/404 mapping, 5xx fail-closed, network-error fail-closed, strict-grant rejection on wrong-typed is_admin / malformed scopes / bad expires_at / non-string target fields, partial target grant (target_type only or target_id only) rejected, naive expires_at rejected (no tz info → fail-closed 502 at the parser instead of TypeError later in the mint path).
    • require_operation factory: routes through the installed authorizer, per-operation overrides take precedence, clearing an override falls back to the default, get_authorizer raises when nothing is set.
    • Lifecycle: reconfiguring without the runtime secret drops the previous LocalJwtVerifyProvider override; teardown clears every authorizer.
    • Runtime token mint / verify: round-trip, wrong-secret rejection, expiry rejection, TTL capped by upstream grant, management-domain token refused on runtime verify, missing-namespace rejection, already-expired upstream grant raises UpstreamGrantExpiredError instead of minting a token with an exp in the past (also covers the boundary case where expires_at == issued_at).
    • LocalJwtVerifyProvider: target-bound Principal, namespace carried from token, missing token → 401, wrong scope → 403, non-Bearer header → 401, target-context match enforcement (mismatch on type or id → 403).
    • Exchange endpoint: 503 without secret, mint when configured, target mismatch rejected (400), missing target rejected (422), grant-without-runtime-use rejected (no privilege escalation), target context forwarded to authorizer, non-default namespace propagates into the token, full exchange-then-verify round trip, already-expired upstream grant surfaces as 502 (distinct from the 503 misconfigured-server path) so the public status reflects which side the operator should investigate.
  • Full server suite: 672 passed (was 621 on feat(server): namespace scoping and control bindings #203 head; +51 from new tests).
  • make lint clean.
  • make typecheck clean.
  • make sdk-ts-generate-check clean.
  • TS SDK regenerated alongside the new endpoint (auth-runtime-token-exchange, request/response models).

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

@abhinav-galileo abhinav-galileo changed the title feat(server): pluggable request-auth framework + migrate control bindings feat(server): pluggable request-auth framework (management + runtime) Apr 28, 2026
@abhinav-galileo abhinav-galileo marked this pull request as ready for review April 28, 2026 21:46
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from b87b27f to 8ecb871 Compare April 29, 2026 18:56
)

actor_id = principal.caller_id or "anonymous"
if principal.scopes:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reject explicit empty upstream scopes

When the HTTP upstream returns an explicit empty scopes array, _UpstreamGrant becomes principal.scopes == (), so this falsey check falls into the local default and mints a token with runtime.use. That is the privilege escalation the comment is trying to avoid when an upstream grant omits runtime.use; the exchange needs to distinguish an unscoped local provider from an explicit upstream grant with no scopes before defaulting.

dependencies=[Depends(require_api_key)],
)
app.include_router(
# The auth framework on each endpoint owns authentication and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mounting these framework-protected routers without any FastAPI Security dependency removes the APIKeyHeader requirement from generated OpenAPI; require_operation only accepts Request, so make openapi-spec emits no security entry for /api/v1/control-bindings and the auth exchange route below. API docs and downstream generators will treat these protected operations as unauthenticated even though runtime still requires credentials.

return (this._agents ??= new Agents(this._options));
}

private _auth?: Auth;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Auth group is only added to the generated AgentControlSDK, but the package export uses AgentControlClient from src/index.ts, and that wrapper still exposes only the existing groups. Consumers importing agent-control cannot call runtimeTokenExchange through the public client even though this generated getter exists; add the matching wrapper getter/type export.

abhinav-galileo added a commit that referenced this pull request Apr 29, 2026
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
abhinav-galileo added a commit that referenced this pull request Apr 29, 2026
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from 70c8229 to e5f9654 Compare April 29, 2026 22:42
abhinav-galileo added a commit that referenced this pull request Apr 29, 2026
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from e5f9654 to 84db093 Compare April 29, 2026 23:14
Endpoints declare a generic Operation; an installed RequestAuthorizer
decides whether the request is allowed and returns the resolved
Principal (namespace + admin flag + caller id). Two providers ship
in-tree:

- HeaderAuthProvider: OSS / single-namespace default. Maps each
  Operation to one of three access levels (PUBLIC / AUTHENTICATED /
  ADMIN) and reuses the legacy local credential check; behavior matches
  the previous require_admin_key path verbatim. V1 ignores the
  X-Namespace-Key header and always returns the default namespace
  because non-binding write endpoints still hardcode it; the branch is
  preserved for a follow-up that lifts the lock.
- HttpUpstreamAuthProvider: forwards caller credentials to a
  configurable upstream URL. Maps 401/403/404 directly; fail-closed
  (503) on 5xx and network errors; rejects malformed principals (502).

Control-binding endpoints now declare CONTROL_BINDINGS_READ /
CONTROL_BINDINGS_WRITE via require_operation(...) and read the
resolved namespace from the returned Principal. The router is mounted
without the legacy router-level gate so the framework owns
authentication and authorization end-to-end.

Reserved Operation members for controls.* and runtime.use are defined
but not yet wired; their migrations land in follow-up PRs.
Rename so the framework's vocabulary is factual:

- OssAccessLevel -> AccessLevel
- OSS_OPERATION_ACCESS -> DEFAULT_OPERATION_ACCESS
- Comments / docstrings: replace "OSS / single-namespace" framing with
  factual descriptions of the local-credential path.

Drop the unjustified MANAGEMENT_ prefix on environment variables;
this PR only configures one auth flow:

- AGENT_CONTROL_MANAGEMENT_AUTH_MODE -> AGENT_CONTROL_AUTH_MODE
- AGENT_CONTROL_MANAGEMENT_AUTH_UPSTREAM_URL -> AGENT_CONTROL_AUTH_UPSTREAM_URL
- AGENT_CONTROL_MANAGEMENT_AUTH_UPSTREAM_TIMEOUT_SECONDS -> AGENT_CONTROL_AUTH_UPSTREAM_TIMEOUT_SECONDS
- AGENT_CONTROL_MANAGEMENT_AUTH_UPSTREAM_SERVICE_TOKEN -> AGENT_CONTROL_AUTH_UPSTREAM_SERVICE_TOKEN
- AGENT_CONTROL_MANAGEMENT_AUTH_UPSTREAM_SERVICE_TOKEN_HEADER -> AGENT_CONTROL_AUTH_UPSTREAM_SERVICE_TOKEN_HEADER

Add a regression test for the no-auth flow: when api_key_enabled is
False, even admin operations succeed with a non-admin Principal,
matching the pre-framework local-auth behavior.
Completes the framework's auth coverage. Management and runtime are
genuinely different protocols, and they now route through different
authorizers via the per-operation registry:

- Per-operation override on the registry. set_authorizer(authorizer,
  operation=...) overrides the default for one operation; calls
  without operation= become the default for everything else. Used to
  point Operation.RUNTIME_USE at LocalJwtVerifyProvider while leaving
  the default authorizer (header or http_upstream) for management.

- Runtime token mint/verify. HS256 JWT, dedicated secret
  (AGENT_CONTROL_RUNTIME_TOKEN_SECRET), short TTL capped by the
  upstream grant's expiry. domain="runtime" claim pins the token to
  the runtime path. Issuer is agent-control/server.

- LocalJwtVerifyProvider verifies the Bearer token, checks the scope
  covers the requested Operation, and returns a Principal with the
  bound (target_type, target_id) so endpoints can match the request
  target.

- POST /api/v1/auth/runtime-token-exchange. Authenticates via the
  default authorizer (typically HttpUpstreamAuthProvider in
  production, which forwards the credential to the configured
  upstream) and mints a local runtime token from the resulting
  Principal. Refuses with 503 when the runtime secret is not
  configured.

- Principal grew target_type, target_id, scopes, grant_expires_at
  fields so providers can surface the upstream grant's binding and
  the exchange endpoint can mint a token from it. HttpUpstreamAuthProvider
  parses the matching optional fields from the upstream JSON response.

- Configuration: AGENT_CONTROL_AUTH_* configures the default authorizer;
  AGENT_CONTROL_RUNTIME_TOKEN_SECRET (+ optional
  AGENT_CONTROL_RUNTIME_TOKEN_TTL_SECONDS) enables the runtime override.
  Without the secret, runtime endpoints fall through to the default
  authorizer.

Tests: 18 new unit + integration tests covering the registry overrides,
token round-trip / wrong-secret / expired / wrong-domain rejection,
JWT-verify provider behavior (target binding, missing token, wrong
scope, non-Bearer header), and the exchange endpoint (503 without
secret, mint when configured, target mismatch, missing target,
context forwarded to authorizer, full exchange-then-verify round trip).

The TypeScript SDK regenerates with the new endpoint surface
(runtime-token-exchange) — committed alongside.
…es/grant

Five hardening changes prompted by review:

- Runtime tokens carry namespace_key. mint_runtime_token now requires
  it; the JWT payload includes it; verify_runtime_token rejects tokens
  without it; LocalJwtVerifyProvider returns the token's namespace on
  the resulting Principal instead of always defaulting. Otherwise a
  token minted for org A would resolve runtime controls in the default
  namespace once /evaluation is wired to RUNTIME_USE.

- Exchange endpoint refuses to add runtime.use to a grant that omits
  it. If the upstream returned an explicit scope set without
  runtime.use, the credential is not authorized for runtime use on
  this target — minting one anyway would be privilege escalation.
  Defaulting to runtime.use is preserved only when the provider
  returned no scoped grant (e.g., local header path).

- HttpUpstreamAuthProvider parses the upstream response with a strict
  Pydantic model (strict=True). Wrong-typed is_admin, malformed
  scopes, bad expires_at, and non-string target fields fail closed
  with 502 instead of being silently coerced or dropped. Unknown
  fields are still tolerated so the upstream can evolve.

- LocalJwtVerifyProvider enforces target context match when the
  dependency surfaces it. Future runtime endpoints can declare a
  context_builder that extracts target_type/target_id from the
  request; the provider verifies the token's binding matches and
  rejects with 403 otherwise.

- Auth provider lifecycle. configure_auth_from_env tracks installed
  providers; teardown_auth (called from FastAPI lifespan shutdown)
  closes any aclose-able providers — releases the
  HttpUpstreamAuthProvider's owned httpx.AsyncClient.

Tests: nine new cases covering token-namespace round-trip, target
context mismatch on type and id, strict grant rejection across each
malformed field, the privilege-escalation guard, and a full
non-default-namespace round trip through the exchange endpoint.
… on reconfigure

Two follow-up fixes from review:

- HttpUpstreamAuthProvider validates against the raw response bytes via
  _UpstreamGrant.model_validate_json instead of round-tripping through
  response.json() and model_validate. Pydantic's JSON parser accepts
  ISO datetimes and JSON arrays (the actual wire shapes any HTTP
  service produces) while strict=True still rejects type-coercion
  bugs like "false" -> True or non-string entries in scopes. Adds a
  regression test that pins the JSON wire shape: ISO expires_at +
  array scopes now round-trip correctly.

- configure_auth_from_env clears any prior default and operation
  overrides before installing fresh ones; teardown_auth clears them
  too. Without this, removing the runtime token secret between two
  configure calls left the previous LocalJwtVerifyProvider override
  installed on Operation.RUNTIME_USE — silent inconsistency where the
  config path said runtime should fall through but the registry
  disagreed. Adds a regression test that exercises the full
  configure-then-reconfigure path.
A target binding is only meaningful as a (target_type, target_id)
pair. The previous schema allowed each field independently, so a
malformed grant carrying only target_type would pass type validation
and the exchange endpoint's per-field equality check would fall
through (the upstream's None never trips the != against the request
body), letting the endpoint mint a token bound to whatever target_id
the request asked for.

Add a model validator on _UpstreamGrant that fails closed when exactly
one of the two fields is set; both supplied or both omitted is the
only acceptable shape. Pydantic's ValidationError surfaces as 502 like
every other malformed-grant case.

Tests cover both half-supplied shapes (target_type only and target_id
only). Also drop two stale comments referring to upstream-specific
implementation choices that bled in earlier — the framework is
generic.
Two distinct timing-related fail-closed gaps:

1. Pydantic with strict=True still accepts a naive ISO datetime for the
   upstream's expires_at because strict only enforces types, not tz.
   Comparing the resulting naive datetime against datetime.now(UTC) at
   mint time raises TypeError and surfaces as a 500. Add a field
   validator on _UpstreamGrant.expires_at that rejects naive datetimes,
   so a malformed grant fails closed with a 502 alongside the rest of
   the strict-grant rejections.

2. mint_runtime_token would happily mint when upstream_expires_at <=
   issued_at, returning a 200 with an exp claim already in the past.
   Introduce UpstreamGrantExpiredError(RuntimeTokenError) and raise it
   in that case. The exchange endpoint maps this distinct error class
   to a 502 (upstream returned bad data) rather than the existing 503
   (server misconfigured), so the public status reflects which side
   the operator should investigate.

Tests:

- _UpstreamGrant rejects naive expires_at -> 502 (parser fail-closed).
- mint_runtime_token raises UpstreamGrantExpiredError when the grant is
  already past or exactly at issued_at.
- Exchange endpoint surfaces the expired grant as 502 (vs 503 for the
  misconfigured-server path).
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
… startup, advertise APIKeyHeader

Five review issues against the auth framework:

1. Empty upstream scopes: the exchange endpoint previously fell back to
   minting a runtime.use token whenever principal.scopes was falsey,
   which is the same shape an upstream produces by returning an explicit
   ``"scopes": []``. The fallback is removed; the endpoint now requires
   runtime.use to be present in principal.scopes for every provider.
   HeaderAuthProvider explicitly grants runtime.use only when authorizing
   Operation.RUNTIME_TOKEN_EXCHANGE, so the local path keeps its V1
   behavior while upstream privilege escalation is closed off.

2. Runtime config consolidation: AGENT_CONTROL_RUNTIME_TOKEN_SECRET and
   the TTL are now parsed once at startup into a frozen RuntimeAuthConfig
   that the mint side and the LocalJwtVerifyProvider verify side both
   read. configure_auth_from_env raises at startup on misconfiguration
   instead of producing a runtime 500 from an invalid TTL or a too-short
   secret.

3. Runtime token secret strength: HS256 needs >= 32 bytes of secret
   material; values shorter than that are rejected at startup.

4. RUNTIME_USE fallback warning: when no runtime secret is configured
   the LocalJwtVerifyProvider override is not installed (V1 behavior
   unchanged), but the startup log now warns that RUNTIME_USE will fall
   through to the default authorizer, giving operators a clear signal
   to either configure the secret or accept the long-lived-credential
   trust model.

5. OpenAPI security entries: the framework-protected routers
   (/control-bindings, /auth) are now mounted with the existing
   non-validating get_api_key_from_header Security extractor as a
   router-level dependency. require_operation still owns runtime
   authentication and authorization; the Security dependency exists
   purely so the generated OpenAPI spec advertises X-API-Key on these
   routes for downstream SDK generation. Confirmed: server/.generated/
   openapi.json now lists ``security: [{APIKeyHeader: []}]`` on every
   framework-protected operation.

The TypeScript wrapper AgentControlClient is also extended with an
``auth`` getter so the runtimeTokenExchange method generated under the
Auth group is reachable through the public client.

A new fixture (``runtime_config_enabled``) replaces the previous
os.environ patching in test_runtime_token_exchange_endpoint.py so tests
exercise the same config singleton production uses; one new test pins
the empty-scope rejection.
…ding routes as namespace-wide

Two review issues:

1. ``mint_runtime_token`` now rejects a naive ``upstream_expires_at``
   with ``RuntimeTokenError`` instead of letting the comparison against
   ``datetime.now(UTC)`` raise a raw ``TypeError`` (which surfaces as a
   500). The HTTP-upstream parser already rejects timezone-less
   ``expires_at`` on the wire, but custom authorizers and tests can
   still call the helper directly; the lower-level API is now
   self-contained.

2. The four binding-id-based routes (GET/PATCH/DELETE
   ``/control-bindings/{binding_id}``) are documented as namespace-wide
   in the OpenAPI summary and docstrings. Per-target authorization is
   not possible on these routes today because ``require_operation`` is
   single-pass and the target identifiers are only discoverable after
   the binding row is loaded. Clients whose authorization model needs
   per-target permissions are explicitly steered to the natural-key
   endpoints (``PUT /by-key``, ``POST /by-key:delete``) and the
   target-filtered list, all of which forward
   ``(target_type, target_id)`` to the authorizer. Two-phase auth for
   the by-id routes is tracked as a separate follow-up.

Also: TypeScript SDK regenerated to pick up the new endpoint summaries.
…ten tzinfo guard

Two review issues:

1. Binding endpoints previously used ``principal.namespace_key`` for
   the row's storage namespace. With HeaderAuthProvider this was always
   the default namespace, so the V1 contract held; with
   HttpUpstreamAuthProvider returning an org-scoped namespace, binding
   writes would land in that namespace while initAgent / GET
   /agents/{name}/controls / /evaluation still resolved through
   ``get_namespace_key`` (V1 default), making target-bound controls
   invisible to runtime resolution. The seven binding endpoints now
   read storage namespace from ``get_namespace_key`` so writes and
   reads stay in lockstep until auth-derived namespace resolution
   lands across every endpoint. The auth chain still runs via
   ``require_operation`` for authentication and authorization; the
   resolved Principal is no longer used to pick the storage namespace.

2. The ``mint_runtime_token`` tzinfo guard now also checks
   ``utcoffset() is None`` so a custom ``tzinfo`` subclass that returns
   None from ``utcoffset()`` is rejected at the helper boundary
   instead of raising a raw ``TypeError`` from the comparison below.

TypeScript SDK regenerated to pick up the binding-endpoint docstring
updates.
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from 84db093 to 7698c07 Compare April 29, 2026 23:31
return RuntimeAuthConfig(secret=secret, ttl_seconds=_load_runtime_ttl_seconds())


def _load_runtime_ttl_seconds() -> int:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — Major
No upper cap on AGENT_CONTROL_RUNTIME_TOKEN_TTL_SECONDS

_load_runtime_ttl_seconds validates > 0 but sets no maximum. A misconfigured AGENT_CONTROL_RUNTIME_TOKEN_TTL_SECONDS=999999999 or a copy-paste accident mints tokens valid for decades; the point of short-lived tokens is defeated. Enforce a sane cap at startup (e.g., 86400 s = 1 day). The upstream_expires_at ceiling in mint_runtime_token only helps when the upstream surfaces an expiry.

resource="Resource",
hint="Verify the resource exists in the requested namespace.",
)
# Fail closed on 5xx and unexpected statuses.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HttpUpstreamAuthProvider silently maps all non-200/401/403/404 to 503

A 400 (bad request in the auth call), 422, or 429 (upstream rate-limited) from the upstream all become 503 Authorization service returned an unexpected response. Rate-limit errors in particular are completely hidden from the operator. At minimum, 429 should become a distinct error or be surfaced as a hint in the 503 body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants