Skip to content

cloud: multigpu training options via Replicate#2802

Merged
bghira merged 3 commits into
mainfrom
feature/multigpu-cloud-training
Jun 27, 2026
Merged

cloud: multigpu training options via Replicate#2802
bghira merged 3 commits into
mainfrom
feature/multigpu-cloud-training

Conversation

@bghira

@bghira bghira commented Jun 27, 2026

Copy link
Copy Markdown
Owner

This pull request adds support for specifying a hardware profile when submitting cloud jobs, especially for the "replicate" provider. It introduces a new --hardware-profile CLI option, ensures hardware profile information is passed through all relevant backend layers, and updates provider configuration and metadata accordingly. The changes improve flexibility and traceability for job submissions involving different hardware configurations.

CLI and API Enhancements:

  • Added a --hardware-profile option to the simpletuner cloud jobs submit command, allowing users to select from supported Replicate hardware profiles (e.g., h100, h100-x4, l40s-x2). This value is passed through the CLI, API, and backend job submission logic. [1] [2] [3] [4] [5] [6] [7]

Backend and Data Model Changes:

  • Updated backend data models and schemas to include an optional hardware_profile field for job submission and provider configuration. This ensures the hardware profile is validated, normalized, and recorded throughout the job lifecycle. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Provider and Profile Management:

  • Integrated hardware profile normalization and validation for the "replicate" provider, preventing invalid profiles and ensuring consistent configuration. Provider configuration now supports storing and updating the default hardware profile, with validation for Replicate profiles. [1] [2] [3] [4] [5] [6]

Provider Registry and Metadata Reporting:

  • Enhanced provider registry and metadata reporting to include hardware profile details, available hardware profiles, and to use the selected profile’s model for job submission and display. [1] [2]

Replicate Client Defaults:

  • Updated the default model for Replicate jobs to be determined by the selected hardware profile, ensuring correct model selection for each hardware configuration. [1] [2]

This comment was marked as outdated.

@bghira bghira merged commit 4201383 into main Jun 27, 2026
4 checks passed
@bghira bghira deleted the feature/multigpu-cloud-training branch June 27, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants