Skip to content

feat(train): Auto-detect subscription recipe hyperparameters in SFTTr…#5844

Merged
mollyheamazon merged 1 commit into
aws:masterfrom
haardm:feat/datamix-subscription-recipe-hp-support
May 11, 2026
Merged

feat(train): Auto-detect subscription recipe hyperparameters in SFTTr…#5844
mollyheamazon merged 1 commit into
aws:masterfrom
haardm:feat/datamix-subscription-recipe-hp-support

Conversation

@haardm
Copy link
Copy Markdown
Contributor

@haardm haardm commented May 11, 2026

Auto-detect subscription recipe hyperparameters in SFTTrainer

Issue

When using SFTTrainer with models that have subscription-gated recipes (e.g., Nova Forge datamix), the trainer's hyperparameter schema (_specs) only includes keys from the standard recipe. Subscription recipe hyperparameters like customer_data_percent and nova_*_percent are rejected with AttributeError.

This is because the subscription recipe's SmtjOverrideParamsS3Uri points to an S3 access point that requires the customer to be subscribed. The SDK currently only downloads override_params from the standard (publicly accessible) recipe.

Solution

After loading the standard recipe's override_params, auto-detect if any recipe in RecipeCollection has IsSubscriptionModel: true. If found, attempt to download its override_params from the S3 access point using the customer's credentials and merge the additional keys into _specs with default: None.

Key behaviors:

  • Subscription recipe keys are added to _specs (settable) but with default: None
  • Keys with None default are NOT serialized in to_dict() unless explicitly set by the user
  • This ensures non-datamix jobs don't accidentally send datamix HPs
  • For non-subscribed users, the download fails silently (AccessDenied) — no behavior change

Changes

  • sagemaker-train/src/sagemaker/train/common_utils/finetune_utils.py:

    • After standard recipe override_params are loaded, find subscription recipes
    • Resolve {customer_id} placeholder in the S3 URI with caller's account ID
    • Handle S3 access point ARN URI format for GetObject
    • Merge subscription keys with default: None (non-destructive, won't serialize unless set)
    • Silent fallback on any exception
  • sagemaker-train/tests/unit/train/common_utils/test_finetune_utils.py:

    • Test: subscription recipe HPs available when user is subscribed (default is None)
    • Test: subscription recipe HPs NOT available when no subscription recipe exists
    • Test: graceful fallback when user is not subscribed (AccessDenied)

Testing

Validated end-to-end in IAD gamma with both cases:

Datamix job (user sets datamix HPs):

  • trainer.hyperparameters.customer_data_percent = 70 works natively
  • Only datamix HPs appear in the API request
  • API selects datamix recipe (nova_lite_2_0_p5_gpu_sft_text_with_datamix)

Non-datamix job (user only sets standard HPs):

  • No datamix HPs in the API request
  • API selects standard recipe (nova_lite_2_0_p5_gpu_sft)
  • No regression

All 58 unit tests passing.

Customer Experience

Subscribed user wanting datamix:

trainer = SFTTrainer(model="nova-textgeneration-lite-v2", ...)
trainer.hyperparameters.customer_data_percent = 70
trainer.hyperparameters.nova_code_percent = 30
# ... set all nova categories summing to 100
trainer.train()  # → datamix recipe selected

Subscribed user NOT wanting datamix:

trainer = SFTTrainer(model="nova-textgeneration-lite-v2", ...)
trainer.hyperparameters.max_steps = 4
trainer.train()  # → standard recipe selected, no datamix HPs sent

Non-subscribed user:

trainer = SFTTrainer(model="nova-textgeneration-lite-v2", ...)
# Datamix HPs not in schema (access point download failed silently)
# Standard HPs work as before — no regression

@haardm haardm temporarily deployed to manual-approval May 11, 2026 20:27 — with GitHub Actions Inactive
@haardm haardm temporarily deployed to manual-approval May 11, 2026 20:27 — with GitHub Actions Inactive
Comment thread sagemaker-train/src/sagemaker/train/common_utils/finetune_utils.py Outdated
…ainer

When a model has subscription-gated recipes (IsSubscriptionModel: true
in RecipeCollection), automatically attempt to fetch the recipe's
override_params from the S3 access point and merge additional
hyperparameter keys into the trainer's _specs schema.

This allows subscribed users to natively set datamix hyperparameters
(e.g. customer_data_percent, nova_*_percent) via trainer.hyperparameters
without any explicit flag or workaround.

For non-subscribed users, the fetch fails silently (AccessDenied) and
only standard recipe hyperparameters are available. The extra latency
only occurs when subscription recipes exist in the hub metadata.

Changes:
- After loading standard override_params, check if any recipe has
  IsSubscriptionModel: true
- If found: resolve {customer_id} placeholder with caller's account ID,
  download override_params from access point, merge extra keys
- Handle S3 access point ARN URI format for GetObject
- Silent fallback on failure (non-subscribed users unaffected)
- Add unit tests for positive, negative, and fallback cases
@haardm haardm force-pushed the feat/datamix-subscription-recipe-hp-support branch from bba6d97 to 5489954 Compare May 11, 2026 21:09
@haardm haardm temporarily deployed to manual-approval May 11, 2026 21:09 — with GitHub Actions Inactive
@haardm haardm temporarily deployed to manual-approval May 11, 2026 21:10 — with GitHub Actions Inactive
@mollyheamazon mollyheamazon merged commit bc81f0b into aws:master May 11, 2026
24 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants