fix(slurm): support co-located proxy in sglang-disagg (min 2 nodes) by mkuznet1 · Pull Request #149 · ROCm/madengine

mkuznet1 · 2026-06-25T17:12:13Z

Motivation

The SGLang disaggregated launcher hard-required a dedicated proxy node and a
minimum of 3 nodes (1 proxy + 1 prefill + 1 decode). This blocked the minimal
and more flexible co-located layout, where the proxy/router runs on the first
prefill node (started by the model run.sh), as already done by the vllm-disagg
launcher. The goal is to support both topologies and lower the minimum cluster
size to 2 nodes.

Technical Details

All changes are in src/madengine/deployment/slurm.py:

Relax custom-split validation to accept two topologies instead of one:
- dedicated proxy node: 1 + prefill + decode == nnodes
- co-located proxy on the first prefill node: prefill + decode == nnodes
  xP/yD and the exported topology env are unchanged; the proxy is launched by
  the model script, so both layouts are valid. Error message updated to state the
  expected node count for each topology.
Lower the minimum node-count guard from < 3 to < 2, since the validation
above already permits prefill + decode == nnodes. Updated the docstring
(Minimum cluster: 2 nodes, nnodes must be >= 2) and the corresponding
ValueError message accordingly.

The sglang-disagg launcher hard-required a dedicated proxy node and a minimum of 3 nodes (1 proxy + 1 prefill + 1 decode), unlike the vllm-disagg launcher which leaves topology to the model script. Support a co-located proxy/router on the first prefill node and lower the minimum cluster size to 2 nodes. - Relax custom-split validation to accept both topologies: dedicated proxy (1 + xP + yD == nnodes) or co-located proxy (xP + yD == nnodes). xP/yD and the exported topology env are unchanged; the proxy is started by the model run.sh, so both layouts are valid. Error message updated accordingly. - Lower the node-count guard from < 3 to < 2 and update the docstring and error message to reflect the co-located minimum.

Copilot

Pull request overview

This PR updates the SLURM SGLang-disaggregated launcher setup to support a co-located proxy/router topology (proxy runs on the first prefill node) in addition to the existing dedicated proxy-node topology, reducing the minimum cluster size to 2 nodes.

Changes:

Relaxes custom-split validation to accept either 1 + xP + yD == nnodes (dedicated proxy) or xP + yD == nnodes (co-located proxy).
Lowers the minimum-node guard and updates related docstrings/error messages accordingly.
Updates the generated “Cluster Configuration” comment block to describe the two supported topologies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mkuznet1 · 2026-06-26T11:18:32Z

Test run (8a166ba)

707 passed / 17 failed / 2 skipped (unit 522/2, integration 139/1, e2e 46/14).

Cat	#	Root cause
A	13	user not in `render`/`video` → `amd-smi` warnings break GPU count (17 vs 8) → empty product name
B	2	stale keepAlive tests expect container name without `_node0` suffix
C	2	node has `/nonexistent` dir (0750) → `PermissionError` not `FileNotFoundError`

A (13): integration test_renderD_count_matches_gpu_count; e2e Discover test_dynamic/test_multiple; 9 profiling tests; test_gpu_product_name_matches_arch.
B (2): test_keepAlive_keeps_docker_alive, test_keepAlive_preserves_model_dir.
C (2): unit test_upload_file_to_mongodb_file_not_found, test_validate_additional_context_file_not_found.

Bottom line: 15/17 fixable without code (groups or root-in-container); 2 keepAlive need a test fix.

…roxy The default (no custom prefill/decode) split assumed a dedicated proxy and computed yD = nnodes - 1 - xP, which yields yD=0 for nnodes=2 -- an invalid topology that contradicts the new 2-node co-located minimum. Special-case nnodes==2 to a co-located 1 prefill + 1 decode split. Add unit tests for the default-split path: nnodes=2 is co-located (xP=1, yD=1), every supported nnodes yields >=1 prefill and >=1 decode, and nnodes<2 raises ValueError.

mkuznet1 · 2026-06-26T11:31:55Z

The default-split path now special-cases nnodes == 2 to a co-located 1 prefill + 1 decode split (xP=1, yD=1) instead of yD=0. For nnodes >= 3 the dedicated-proxy default already yields yD >= 1.

Added test coverage: new TestSglangDisaggDefaultSplit exercises the default-split path — nnodes=2 is co-located (xP=1, yD=1), parametrized nnodes in [2,3,4,5,6,8] all yield >=1 prefill and >=1 decode, and nnodes<2 raises ValueError. Local unit run: 28 passed.

mkuznet1 self-assigned this Jun 25, 2026

mkuznet1 requested a review from Copilot June 25, 2026 17:12

Copilot started reviewing on behalf of mkuznet1 June 25, 2026 17:12 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread src/madengine/deployment/slurm.py

Comment thread src/madengine/deployment/slurm.py

mkuznet1 marked this pull request as ready for review June 26, 2026 11:15

mkuznet1 requested review from Cemberk, Rohan138, coketaste, gargrahul and leconcio as code owners June 26, 2026 11:15

i-kosarev approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(slurm): support co-located proxy in sglang-disagg (min 2 nodes)#149

fix(slurm): support co-located proxy in sglang-disagg (min 2 nodes)#149
mkuznet1 wants to merge 2 commits into
ROCm:developfrom
mkuznet1:fix/slurm-sglang-disagg-co-located-proxy

mkuznet1 commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

mkuznet1 commented Jun 26, 2026

Uh oh!

mkuznet1 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mkuznet1 commented Jun 25, 2026

Motivation

Technical Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

mkuznet1 commented Jun 26, 2026

Uh oh!

mkuznet1 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants