Skip to content

Cache /.well-known/databricks-config lookups in the CLI#5011

Draft
simonfaltum wants to merge 20 commits intomainfrom
simonfaltum/hostmetadata-cache
Draft

Cache /.well-known/databricks-config lookups in the CLI#5011
simonfaltum wants to merge 20 commits intomainfrom
simonfaltum/hostmetadata-cache

Conversation

@simonfaltum
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum commented Apr 17, 2026

Why

Every CLI command (databricks auth profiles, bundle validate, every workspace or account call) goes through Config.EnsureResolved, which triggers an unauthenticated GET to {host}/.well-known/databricks-config to populate host metadata. That round trip is ~700ms against production and gets paid on every invocation, doubling the latency of otherwise single-request commands.

Changes

Before: every CLI invocation hits the well-known endpoint once (or more when multiple configs get constructed).
Now: the first invocation populates a local disk cache under ~/.cache/databricks/<version>/host-metadata/; subsequent invocations read from it. Failures are negatively cached for 60s (except for context.Canceled / context.DeadlineExceeded, which are transient and never cached).

  • New libs/hostmetadata package. NewResolver(fetch) is the primary API — takes an injected fetch function and returns a config.HostMetadataResolver. Attach(cfg) is a one-line convenience that wires cfg.DefaultHostMetadataResolver() as the fetch. SDK API from Add HostMetadataResolver for customizable host metadata fetching databricks-sdk-go#1572, shipped in v0.127.0 which is already bumped on main.
  • Positive cache wraps the miss path — the hit path is a single disk read. Negative cache is only consulted when positive misses.
  • Attach wired at every non-allowlisted *config.Config construction site: cmd/root/auth.go (4 sites), bundle/config/workspace.go, cmd/api/api.go, cmd/auth/{env,login,profiles}.go (3 sites across 2 files), cmd/labs/project/entrypoint.go, libs/auth/arguments.go.
  • Test-cleanup env now pins DATABRICKS_CACHE_DIR to a temp dir so tests don't leak cache files into HOME.
  • Injection guardrail test at libs/hostmetadata/injection_guardrail_test.go walks the tree and flags any new config.Config{ construction site that lacks a nearby hostmetadata.Attach call (allowlist for the handful of legitimately cfg-less-resolve sites).

Collateral cleanups

  • libs/cache/file_cache.go: drop the failed to stat cache file debug log when the file is simply missing (fs.ErrNotExist). It was pure noise (the next line, cache miss, computing, conveys the same info) and its OS-specific error text diverged between Unix (no such file or directory) and Windows (The system cannot find the file specified.), breaking cross-platform acceptance goldens. Genuine stat failures (permission, corruption) still log.
  • libs/testdiff/replacement.go: devVersionRegex now accepts either +SHA or -SHA after 0.0.0-dev. build.GetSanitizedVersion() swaps + to - for filesystem safety when the version is used in cache paths, and the old regex only covered the + form.

Test plan

  • make checks clean
  • make lint clean (0 issues)
  • go test ./libs/hostmetadata/... -race — 7 tests (smoke + cache hit + fetch error + cancellation-not-cached + host isolation + end-to-end integration + injection guardrail), all unit tests use an injected mock fetch so no httptest.Server required
  • go test ./libs/cache/... -race clean
  • go test ./cmd/root/... -race clean
  • go test ./bundle/config/... -race clean
  • End-to-end acceptance test acceptance/auth/host-metadata-cache/ asserts exactly ONE /.well-known/databricks-config GET across two auth profiles invocations sharing a DATABRICKS_CACHE_DIR
  • Existing acceptance tests regenerated: fewer well-known GETs in out.requests.txt (caching works), new [Local Cache] debug lines in cache/telemetry tests, two Warn: Failed to resolve host metadata lines removed (intentional: the resolver returns (nil, nil) on fetch errors, which is how the SDK interprets "no metadata available"), stat-not-found lines removed (see Collateral cleanups)

Live validation against dogfood

Built locally (go build -o /tmp/databricks-cache-test .) and ran databricks -p e2-dogfood current-user me with and without a warm cache:

Scenario Elapsed well-known time Cache log output
Cold cache (fresh DATABRICKS_CACHE_DIR) ~713ms fetch cache miss, computingGET /.well-known/databricks-configcomputed and stored result
Warm cache (second invocation) ~1ms single [Local Cache] cache hit line

Net per-command savings: ~700ms, matching the Why. Cache dir after one auth profiles run contained five JSON files (one per host in .databrickscfg). Inspecting one:

{"oidc_endpoint":"https://db-deco-test.databricks.com/oidc/accounts/{account_id}","account_id":"...","workspace_id":"","cloud":"","host_type":"UNIFIED_HOST","token_federation_default_oidc_audiences":["..."]}

Verifies that two CLI invocations sharing DATABRICKS_CACHE_DIR produce only
one /.well-known/databricks-config GET: the first populates the on-disk
cache, the second reads from it.

Co-authored-by: Isaac
Cached /.well-known/databricks-config lookups persist across CLI
invocations now, so recorded request logs drop duplicate GETs and debug
output shows the new host-metadata cache keys. Silenced SDK warnings on
failed well-known fetches (the resolver returns nil,nil) also remove a
couple of Warn lines from auth test outputs.

Co-authored-by: Isaac
Inverts the internal newResolver(cfg, ...) into an exported
NewResolver(fetch) that takes an injected fetch function. Attach stays
as a one-liner convenience. Unit tests for the caching logic no longer
need httptest servers or PAT-authed configs; one integration test
retains the end-to-end SDK wiring.

Co-authored-by: Isaac
Flips the resolver so the happy path is one disk read: positive cache
wraps the miss flow, which now probes negative and falls through to
fetch only on true miss. Context cancellation and deadline errors are
no longer written to the negative cache because they say nothing about
the host's long-term availability.

Regenerates cache/telemetry acceptance outputs — the synthetic
negative-cache probe no longer runs on cache hits.

Co-authored-by: Isaac
@simonfaltum simonfaltum marked this pull request as ready for review April 17, 2026 12:06
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 17, 2026

Approval status: pending

/acceptance/auth/ - needs approval

6 files changed
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/acceptance/bundle/ - needs approval

4 files changed
Suggested: @pietern
Also eligible: @denik, @anton-107, @andrewnester, @lennartkats-db, @shreyas-goenka, @janniklasrose

/bundle/ - needs approval

Files: bundle/config/workspace.go
Suggested: @pietern
Also eligible: @denik, @anton-107, @andrewnester, @lennartkats-db, @shreyas-goenka, @janniklasrose

/cmd/api/ - needs approval

Files: cmd/api/api.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/cmd/auth/ - needs approval

Files: cmd/auth/env.go, cmd/auth/login.go, cmd/auth/profiles.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/cmd/labs/ - needs approval

Files: cmd/labs/project/entrypoint.go
Eligible: @alexott, @nfx

/cmd/root/ - needs approval

Files: cmd/root/auth.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/internal/ - needs approval

Files: internal/testutil/env.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/libs/auth/ - needs approval

Files: libs/auth/arguments.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

General files (require maintainer)

20 files changed
Based on git history:

  • @pietern -- recent work in cmd/auth/, internal/testutil/, cmd/labs/project/

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

simonfaltum and others added 3 commits April 17, 2026 14:31
GetSanitizedVersion replaces + with - in build version metadata for
filesystem safety, but the [DEV_VERSION] replacement regex only
covered the + form. Cache paths use the sanitized form, so telemetry
tests failed across machines with different git HEAD SHAs. Regex now
accepts either + or - before the SHA suffix.

Co-authored-by: Isaac
os.Stat on a missing cache file returns an OS-specific error message
(Unix: "no such file or directory"; Windows: "The system cannot find
the file specified."), causing acceptance-test goldens to diverge
between platforms. The error is also pure noise — the follow-up
"cache miss, computing" line conveys the same information. Drop the
log for fs.ErrNotExist; keep it for genuine stat failures (permissions,
corruption).

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant