fix(slurm): support co-located proxy in sglang-disagg (min 2 nodes)#149
fix(slurm): support co-located proxy in sglang-disagg (min 2 nodes)#149mkuznet1 wants to merge 2 commits into
Conversation
The sglang-disagg launcher hard-required a dedicated proxy node and a minimum of 3 nodes (1 proxy + 1 prefill + 1 decode), unlike the vllm-disagg launcher which leaves topology to the model script. Support a co-located proxy/router on the first prefill node and lower the minimum cluster size to 2 nodes. - Relax custom-split validation to accept both topologies: dedicated proxy (1 + xP + yD == nnodes) or co-located proxy (xP + yD == nnodes). xP/yD and the exported topology env are unchanged; the proxy is started by the model run.sh, so both layouts are valid. Error message updated accordingly. - Lower the node-count guard from < 3 to < 2 and update the docstring and error message to reflect the co-located minimum.
There was a problem hiding this comment.
Pull request overview
This PR updates the SLURM SGLang-disaggregated launcher setup to support a co-located proxy/router topology (proxy runs on the first prefill node) in addition to the existing dedicated proxy-node topology, reducing the minimum cluster size to 2 nodes.
Changes:
- Relaxes custom-split validation to accept either
1 + xP + yD == nnodes(dedicated proxy) orxP + yD == nnodes(co-located proxy). - Lowers the minimum-node guard and updates related docstrings/error messages accordingly.
- Updates the generated “Cluster Configuration” comment block to describe the two supported topologies.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Test run ( 707 passed / 17 failed / 2 skipped (unit 522/2, integration 139/1, e2e 46/14).
A (13): integration Bottom line: 15/17 fixable without code (groups or root-in-container); 2 keepAlive need a test fix. |
…roxy The default (no custom prefill/decode) split assumed a dedicated proxy and computed yD = nnodes - 1 - xP, which yields yD=0 for nnodes=2 -- an invalid topology that contradicts the new 2-node co-located minimum. Special-case nnodes==2 to a co-located 1 prefill + 1 decode split. Add unit tests for the default-split path: nnodes=2 is co-located (xP=1, yD=1), every supported nnodes yields >=1 prefill and >=1 decode, and nnodes<2 raises ValueError.
|
The default-split path now special-cases nnodes == 2 to a co-located 1 prefill + 1 decode split (xP=1, yD=1) instead of yD=0. For nnodes >= 3 the dedicated-proxy default already yields yD >= 1. Added test coverage: new TestSglangDisaggDefaultSplit exercises the default-split path — nnodes=2 is co-located (xP=1, yD=1), parametrized nnodes in [2,3,4,5,6,8] all yield >=1 prefill and >=1 decode, and nnodes<2 raises ValueError. Local unit run: 28 passed. |
Motivation
The SGLang disaggregated launcher hard-required a dedicated proxy node and a
minimum of 3 nodes (1 proxy + 1 prefill + 1 decode). This blocked the minimal
and more flexible co-located layout, where the proxy/router runs on the first
prefill node (started by the model
run.sh), as already done by the vllm-disagglauncher. The goal is to support both topologies and lower the minimum cluster
size to 2 nodes.
Technical Details
All changes are in
src/madengine/deployment/slurm.py:1 + prefill + decode == nnodesprefill + decode == nnodesxP/yDand the exported topology env are unchanged; the proxy is launched bythe model script, so both layouts are valid. Error message updated to state the
expected node count for each topology.
< 3to< 2, since the validationabove already permits
prefill + decode == nnodes. Updated the docstring(
Minimum cluster: 2 nodes,nnodes must be >= 2) and the correspondingValueErrormessage accordingly.