Skip to content

fix(slurm): support co-located proxy in sglang-disagg (min 2 nodes)#149

Open
mkuznet1 wants to merge 2 commits into
ROCm:developfrom
mkuznet1:fix/slurm-sglang-disagg-co-located-proxy
Open

fix(slurm): support co-located proxy in sglang-disagg (min 2 nodes)#149
mkuznet1 wants to merge 2 commits into
ROCm:developfrom
mkuznet1:fix/slurm-sglang-disagg-co-located-proxy

Conversation

@mkuznet1

Copy link
Copy Markdown
Contributor

Motivation

The SGLang disaggregated launcher hard-required a dedicated proxy node and a
minimum of 3 nodes (1 proxy + 1 prefill + 1 decode). This blocked the minimal
and more flexible co-located layout, where the proxy/router runs on the first
prefill node (started by the model run.sh), as already done by the vllm-disagg
launcher. The goal is to support both topologies and lower the minimum cluster
size to 2 nodes.

Technical Details

All changes are in src/madengine/deployment/slurm.py:

  • Relax custom-split validation to accept two topologies instead of one:
    • dedicated proxy node: 1 + prefill + decode == nnodes
    • co-located proxy on the first prefill node: prefill + decode == nnodes
      xP/yD and the exported topology env are unchanged; the proxy is launched by
      the model script, so both layouts are valid. Error message updated to state the
      expected node count for each topology.
  • Lower the minimum node-count guard from < 3 to < 2, since the validation
    above already permits prefill + decode == nnodes. Updated the docstring
    (Minimum cluster: 2 nodes, nnodes must be >= 2) and the corresponding
    ValueError message accordingly.

The sglang-disagg launcher hard-required a dedicated proxy node and a minimum
of 3 nodes (1 proxy + 1 prefill + 1 decode), unlike the vllm-disagg launcher
which leaves topology to the model script. Support a co-located proxy/router on
the first prefill node and lower the minimum cluster size to 2 nodes.

- Relax custom-split validation to accept both topologies: dedicated proxy
  (1 + xP + yD == nnodes) or co-located proxy (xP + yD == nnodes). xP/yD and the
  exported topology env are unchanged; the proxy is started by the model run.sh,
  so both layouts are valid. Error message updated accordingly.
- Lower the node-count guard from < 3 to < 2 and update the docstring and error
  message to reflect the co-located minimum.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SLURM SGLang-disaggregated launcher setup to support a co-located proxy/router topology (proxy runs on the first prefill node) in addition to the existing dedicated proxy-node topology, reducing the minimum cluster size to 2 nodes.

Changes:

  • Relaxes custom-split validation to accept either 1 + xP + yD == nnodes (dedicated proxy) or xP + yD == nnodes (co-located proxy).
  • Lowers the minimum-node guard and updates related docstrings/error messages accordingly.
  • Updates the generated “Cluster Configuration” comment block to describe the two supported topologies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/deployment/slurm.py
Comment thread src/madengine/deployment/slurm.py
@mkuznet1 mkuznet1 marked this pull request as ready for review June 26, 2026 11:15
@mkuznet1

Copy link
Copy Markdown
Contributor Author

Test run (8a166ba)

707 passed / 17 failed / 2 skipped (unit 522/2, integration 139/1, e2e 46/14).

Cat # Root cause
A 13 user not in render/videoamd-smi warnings break GPU count (17 vs 8) → empty product name
B 2 stale keepAlive tests expect container name without _node0 suffix
C 2 node has /nonexistent dir (0750) → PermissionError not FileNotFoundError

A (13): integration test_renderD_count_matches_gpu_count; e2e Discover test_dynamic/test_multiple; 9 profiling tests; test_gpu_product_name_matches_arch.
B (2): test_keepAlive_keeps_docker_alive, test_keepAlive_preserves_model_dir.
C (2): unit test_upload_file_to_mongodb_file_not_found, test_validate_additional_context_file_not_found.

Bottom line: 15/17 fixable without code (groups or root-in-container); 2 keepAlive need a test fix.

…roxy

The default (no custom prefill/decode) split assumed a dedicated proxy and
computed yD = nnodes - 1 - xP, which yields yD=0 for nnodes=2 -- an invalid
topology that contradicts the new 2-node co-located minimum. Special-case
nnodes==2 to a co-located 1 prefill + 1 decode split.

Add unit tests for the default-split path: nnodes=2 is co-located (xP=1,
yD=1), every supported nnodes yields >=1 prefill and >=1 decode, and
nnodes<2 raises ValueError.
@mkuznet1

Copy link
Copy Markdown
Contributor Author

The default-split path now special-cases nnodes == 2 to a co-located 1 prefill + 1 decode split (xP=1, yD=1) instead of yD=0. For nnodes >= 3 the dedicated-proxy default already yields yD >= 1.

Added test coverage: new TestSglangDisaggDefaultSplit exercises the default-split path — nnodes=2 is co-located (xP=1, yD=1), parametrized nnodes in [2,3,4,5,6,8] all yield >=1 prefill and >=1 decode, and nnodes<2 raises ValueError. Local unit run: 28 passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants