Skip to content

feat(databricks-skills): add databricks-mlflow-ml skill for classic ML#474

Open
dgokeeffe wants to merge 3 commits intodatabricks-solutions:mainfrom
dgokeeffe:feat/databricks-mlflow-ml-skill
Open

feat(databricks-skills): add databricks-mlflow-ml skill for classic ML#474
dgokeeffe wants to merge 3 commits intodatabricks-solutions:mainfrom
dgokeeffe:feat/databricks-mlflow-ml-skill

Conversation

@dgokeeffe
Copy link
Copy Markdown

Why

The existing MLflow-related skills leave a gap for classic ML practitioners:

Skill Scope Covers classic ML UC registration?
databricks-mlflow-evaluation GenAI agent evaluation (mlflow.genai.evaluate, scorers, judges) ❌ Different audience
databricks-model-serving Real-time serving endpoints ❌ Covers serving, not training/registration
databricks-unity-catalog Tables, volumes, system tables ❌ Data primitives, not model registry
databricks-mlflow-ml (this PR) Classic ML training + UC registration + batch inference

A data scientist training a forecasting model, registering it to Unity Catalog, and scoring predictions in a notebook or Lakeflow pipeline has no skill to trigger on. This PR fills that gap.

What's in the skill

SKILL.md — workflow index (Train → Register → Score, Retrain + Promote A/B, Debugging), quick-start, runtime compatibility note, and trigger description.

7 reference files:

  • GOTCHAS.md — 14 common mistakes with symptoms + fixes
  • CRITICAL-interfaces.md — exact API signatures + the models:/catalog.schema.model@alias URI format
  • patterns-experiment-setup.md — UC volume artifact_location (required in UC-enforced workspaces)
  • patterns-training.md — logging with signature + input_example, sklearn.Pipeline wrapping, autologging
  • patterns-uc-registration.md — three-level names, @champion/@challenger aliases, verification via DESCRIBE MODEL, A/B promotion
  • patterns-batch-inference.md — notebook pyfunc.load_model (Tier 1), Lakeflow SDP pyfunc.spark_udf (Tier 2), champion-vs-challenger validation, explicit warning against ai_query on custom UC models
  • user-journeys.md — 7 end-to-end workflows including debugging scenarios

Key gotchas this skill teaches that other guides miss

  1. UC volume artifact_location on experiment creation — DBFS root is rejected in UC-enforced workspaces. Every log_model call fails with opaque errors until artifact_location points at a UC volume.
  2. mlflow.set_registry_uri('databricks-uc') — without this, register_model silently routes to the legacy workspace registry. The Add initial skills for Databricks development #1 "my model isn't showing up in Catalog Explorer" support question.
  3. ai_query on custom UC models — doesn't work. Requires a serving endpoint. Correct primitive is mlflow.pyfunc.load_model (notebook) or mlflow.pyfunc.spark_udf (Lakeflow).
  4. @champion / @challenger aliases — replace deprecated transition_model_version_stage() stages. The legacy API still exists but is a no-op on UC-registered models (no error, no effect).
  5. mlflow.pyfunc.spark_udf in Lakeflow SDP — must be constructed at module scope, not inside @dp.materialized_view. Otherwise deserialization repeats on every pipeline evaluation.
  6. pip install 'mlflow[databricks]' — required for UC registration outside Databricks clusters. Plain pip install mlflow omits the cloud-storage SDKs (azure-core / boto3 / google.cloud) MLflow needs to stage UC artifacts. Clusters ship the extras pre-installed.

Testing

Field-tested end-to-end against a live Databricks workspace:

  • Feature table seeded, trained a GradientBoostingRegressor
  • Registered to UC with @champion alias — verified in Catalog Explorer UI
  • Loaded via mlflow.pyfunc.load_model — predictions within ~2% of actuals
  • Two additional gotchas surfaced during the test (mlflow[databricks] install + artifact_path deprecation) and added to GOTCHAS.md

Runtime verified: MLflow 3.11 on Lakeflow SDP serverless compute v5 (current default). Patterns compatible with MLflow 2.16+ — pairs on older classic DBRs still get correct behaviour. 2.x/3.x divergences called out in GOTCHAS.md (e.g., artifact_pathname=).

Structure parity

File layout matches databricks-mlflow-evaluation (same SKILL.md + references/ + GOTCHAS.md + CRITICAL-interfaces.md + patterns-*.md convention). Installable via the existing install_skills.sh:

./install_skills.sh databricks-mlflow-ml

Not in scope

  • Model Serving endpoints (databricks-model-serving covers that)
  • GenAI agent evaluation (databricks-mlflow-evaluation covers that)
  • Generic UC primitives like volumes and tables (databricks-unity-catalog covers those)

Deliberately narrow — classic ML + UC registration + batch inference only.

Origin

Built to fill a gap encountered during the Coles Vibe Workshop (airgapped Databricks field-engineer hackathon). DS pairs needed UC-scoped MLflow guidance that wasn't covered by any existing skill. Content battle-tested in the workshop before being contributed upstream.

David O'Keeffe added 3 commits April 19, 2026 22:01
Fills the gap between databricks-mlflow-evaluation (GenAI agent eval) and
databricks-model-serving (real-time endpoints). Covers:

- Classic ML model training with MLflow tracking
  (sklearn / XGBoost / PyTorch)
- Experiment creation with UC volume artifact_location
  (required in UC-enforced workspaces)
- Unity Catalog model registration with three-level names
- @Champion / @Challenger alias management
- Batch inference via mlflow.pyfunc.load_model (notebook, up to ~10k rows)
- Distributed batch via mlflow.pyfunc.spark_udf in Lakeflow SDP pipelines

Structure mirrors databricks-mlflow-evaluation:
- SKILL.md: workflows + trigger description + quick start
- references/GOTCHAS.md: 12 common mistakes with symptoms + fixes
- references/CRITICAL-interfaces.md: exact API signatures + models:/ URI format
- references/patterns-experiment-setup.md: UC volume artifact_location setup
- references/patterns-training.md: logging with signature + input_example
- references/patterns-uc-registration.md: register + alias + verify + A/B
- references/patterns-batch-inference.md: pyfunc.load_model + spark_udf + ai_query anti-pattern
- references/user-journeys.md: 7 end-to-end workflows including debugging

Key gotchas covered that other MLflow guides miss:
- Experiment creation now requires UC volume artifact_location in UC-enforced
  workspaces (DBFS root writes are rejected)
- mlflow.set_registry_uri('databricks-uc') is required; silent workspace
  registry fallback is the databricks-solutions#1 support question
- ai_query does NOT work on custom UC-registered models unless they're
  deployed to a serving endpoint; use pyfunc.load_model or spark_udf instead
- UC aliases (@champion/@Challenger) replace deprecated stage transitions
  (transition_model_version_stage is a no-op on UC models)
- mlflow.pyfunc.spark_udf must be constructed at module scope in Lakeflow
  SDP pipelines, not inside the function body

Tested against MLflow 2.16+ on Databricks Runtime 15.4 LTS. Content battle-
tested in the Coles Vibe Workshop (classic-ML track running in an airgapped
environment where online MLflow docs aren't reachable).
Field-tested the skill end-to-end from a local Python environment against
a live Databricks workspace. Surfaced two gotchas not in the original set:

databricks-solutions#12 mlflow[databricks] extras missing when running outside Databricks:
plain `pip install mlflow` omits azure-core / boto3 / google.cloud SDKs
that UC registration needs to stage artifacts. Training + log_model work;
register_model fails with opaque "No module named 'azure'". Databricks
clusters ship the extras pre-installed, so this only bites laptops / CI.

databricks-solutions#13 artifact_path= deprecated in favour of name= (MLflow 2.16+): emits
warning on every log_model call. Non-blocking, but worth flagging since
most online tutorials + training courses still use the old param.

Both verified against the workshop's test run — skill workflow 1 now
completes cleanly with these fixes documented.
Original SKILL.md didn't state a runtime target. Adds a "Runtime compatibility"
section anchored on what the skill was actually tested against — MLflow 3.11
on Lakeflow SDP serverless compute v5 — with a compat note for MLflow 2.16+
(classic DBR 15.4 LTS still ships 2.x). Points at GOTCHAS.md for the 3.x-vs-2.x
divergence (artifact_path deprecation, etc.).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant