Cache /.well-known/databricks-config lookups in the CLI#5011
Cache /.well-known/databricks-config lookups in the CLI#5011simonfaltum wants to merge 20 commits intomainfrom
Conversation
Verifies that two CLI invocations sharing DATABRICKS_CACHE_DIR produce only one /.well-known/databricks-config GET: the first populates the on-disk cache, the second reads from it. Co-authored-by: Isaac
Cached /.well-known/databricks-config lookups persist across CLI invocations now, so recorded request logs drop duplicate GETs and debug output shows the new host-metadata cache keys. Silenced SDK warnings on failed well-known fetches (the resolver returns nil,nil) also remove a couple of Warn lines from auth test outputs. Co-authored-by: Isaac
Inverts the internal newResolver(cfg, ...) into an exported NewResolver(fetch) that takes an injected fetch function. Attach stays as a one-liner convenience. Unit tests for the caching logic no longer need httptest servers or PAT-authed configs; one integration test retains the end-to-end SDK wiring. Co-authored-by: Isaac
Flips the resolver so the happy path is one disk read: positive cache wraps the miss flow, which now probes negative and falls through to fetch only on true miss. Context cancellation and deadline errors are no longer written to the negative cache because they say nothing about the host's long-term availability. Regenerates cache/telemetry acceptance outputs — the synthetic negative-cache probe no longer runs on cache hits. Co-authored-by: Isaac
Approval status: pending
|
GetSanitizedVersion replaces + with - in build version metadata for filesystem safety, but the [DEV_VERSION] replacement regex only covered the + form. Cache paths use the sanitized form, so telemetry tests failed across machines with different git HEAD SHAs. Regex now accepts either + or - before the SHA suffix. Co-authored-by: Isaac
os.Stat on a missing cache file returns an OS-specific error message (Unix: "no such file or directory"; Windows: "The system cannot find the file specified."), causing acceptance-test goldens to diverge between platforms. The error is also pure noise — the follow-up "cache miss, computing" line conveys the same information. Drop the log for fs.ErrNotExist; keep it for genuine stat failures (permissions, corruption). Co-authored-by: Isaac
Why
Every CLI command (
databricks auth profiles,bundle validate, every workspace or account call) goes throughConfig.EnsureResolved, which triggers an unauthenticated GET to{host}/.well-known/databricks-configto populate host metadata. That round trip is ~700ms against production and gets paid on every invocation, doubling the latency of otherwise single-request commands.Changes
Before: every CLI invocation hits the well-known endpoint once (or more when multiple configs get constructed).
Now: the first invocation populates a local disk cache under
~/.cache/databricks/<version>/host-metadata/; subsequent invocations read from it. Failures are negatively cached for 60s (except forcontext.Canceled/context.DeadlineExceeded, which are transient and never cached).libs/hostmetadatapackage.NewResolver(fetch)is the primary API — takes an injected fetch function and returns aconfig.HostMetadataResolver.Attach(cfg)is a one-line convenience that wirescfg.DefaultHostMetadataResolver()as the fetch. SDK API from Add HostMetadataResolver for customizable host metadata fetching databricks-sdk-go#1572, shipped in v0.127.0 which is already bumped on main.Attachwired at every non-allowlisted*config.Configconstruction site:cmd/root/auth.go(4 sites),bundle/config/workspace.go,cmd/api/api.go,cmd/auth/{env,login,profiles}.go(3 sites across 2 files),cmd/labs/project/entrypoint.go,libs/auth/arguments.go.DATABRICKS_CACHE_DIRto a temp dir so tests don't leak cache files intoHOME.libs/hostmetadata/injection_guardrail_test.gowalks the tree and flags any newconfig.Config{construction site that lacks a nearbyhostmetadata.Attachcall (allowlist for the handful of legitimately cfg-less-resolve sites).Collateral cleanups
libs/cache/file_cache.go: drop thefailed to stat cache filedebug log when the file is simply missing (fs.ErrNotExist). It was pure noise (the next line,cache miss, computing, conveys the same info) and its OS-specific error text diverged between Unix (no such file or directory) and Windows (The system cannot find the file specified.), breaking cross-platform acceptance goldens. Genuine stat failures (permission, corruption) still log.libs/testdiff/replacement.go:devVersionRegexnow accepts either+SHAor-SHAafter0.0.0-dev.build.GetSanitizedVersion()swaps+to-for filesystem safety when the version is used in cache paths, and the old regex only covered the+form.Test plan
make checkscleanmake lintclean (0 issues)go test ./libs/hostmetadata/... -race— 7 tests (smoke + cache hit + fetch error + cancellation-not-cached + host isolation + end-to-end integration + injection guardrail), all unit tests use an injected mock fetch so nohttptest.Serverrequiredgo test ./libs/cache/... -racecleango test ./cmd/root/... -racecleango test ./bundle/config/... -racecleanacceptance/auth/host-metadata-cache/asserts exactly ONE/.well-known/databricks-configGET across twoauth profilesinvocations sharing aDATABRICKS_CACHE_DIRout.requests.txt(caching works), new[Local Cache]debug lines in cache/telemetry tests, twoWarn: Failed to resolve host metadatalines removed (intentional: the resolver returns(nil, nil)on fetch errors, which is how the SDK interprets "no metadata available"), stat-not-found lines removed (see Collateral cleanups)Live validation against dogfood
Built locally (
go build -o /tmp/databricks-cache-test .) and randatabricks -p e2-dogfood current-user mewith and without a warm cache:DATABRICKS_CACHE_DIR)cache miss, computing→GET /.well-known/databricks-config→computed and stored result[Local Cache] cache hitlineNet per-command savings: ~700ms, matching the Why. Cache dir after one
auth profilesrun contained five JSON files (one per host in.databrickscfg). Inspecting one:{"oidc_endpoint":"https://db-deco-test.databricks.com/oidc/accounts/{account_id}","account_id":"...","workspace_id":"","cloud":"","host_type":"UNIFIED_HOST","token_federation_default_oidc_audiences":["..."]}