Skip to content

ccl: hierarchical cross-node AllGather (intra-node SDMA + inter-node RDMA)#441

Open
inkcherry wants to merge 7 commits into
mainfrom
inkcherry/sdma-hier-allgather
Open

ccl: hierarchical cross-node AllGather (intra-node SDMA + inter-node RDMA)#441
inkcherry wants to merge 7 commits into
mainfrom
inkcherry/sdma-hier-allgather

Conversation

@inkcherry

Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

…RDMA)

Add mori.ccl.HierAllGather: an all_gather_into_tensor-compatible collective that
keeps intra-node traffic on the SDMA copy engines (XGMI) and moves inter-node
traffic over RDMA. A fused ring||local-gather kernel runs the inter-node RDMA
ring concurrently with the ring-independent local node-block SDMA gather in one
grid (stream-ordered, direct-to-output, no staging copy).

Bit-exact vs torch.distributed.all_gather_into_tensor for {bf16,fp16,fp32,int32}.
On 2 nodes x 4 GPUs (MI355X), fp32: standalone bandwidth >= RCCL for sizes
>=8MB (1.19-1.35x); under a concurrent GEMM the SDMA path overlaps with compute
and is 16-20% faster than RCCL at 128-512MB (copy engines vs CU contention).

Includes tests (test_hier_allgather*), size-sweep + gemm-overlap benches, a
plot script, and the measured result charts/CSVs under benchmarks/.
…ignature

HierAllGather now auto-detects the node-local rank count (LOCAL_WORLD_SIZE, else
hostname grouping, else npes) so callers use the same constructor/call signature
as the flat AllgatherSdma with no new required argument. ranks_per_node is now
optional and keyword-only; added transit_buffer_size for signature parity.
Single node still degenerates to the pure intra-node SDMA path.
Resolve conflicts with the upstream param-contiguous SDMA allgather:
- oneshot_sdma_kernel.hpp: keep both the hierarchical sub-group/broadcast SDMA
  kernels and the upstream param-contiguous kernel (additive).
- symmetric_memory.cpp: keep deviceHandles_d indexing by global pe (the array is
  worldSize-sized and all SDMA kernels index by global pe); adopt the upstream
  non-fatal GPU-metadata teardown.
Add HierAllGather.all_gather(tensor_list, tensor) matching
torch.distributed.all_gather (list output), built on the same hierarchical
intra-node SDMA / inter-node RDMA path as the contiguous all_gather_into_tensor.
Bit-exact vs torch across {bf16,fp16,fp32,int32}; adds test_hier_allgather_list.
…A + inter RDMA)

Adds a single drop-in FSDP2 AllGather backend, MoriAllGather, used identically
for single-node and cross-node runs via the stock
FSDPModule.set_custom_all_gather API. It routes intra-node traffic over SDMA
copy engines (XGMI) and, when the process group spans multiple nodes, inter-node
traffic over RDMA — the same object handles both, so user code is unchanged
between one node and many. (MoriHierAllGather kept as a backward-compat alias.)

Includes the Qwen-7B FSDP2 step benchmark, a 2-node driver, and the chart
script. No mori source change needed; the HierAllGather primitive already exists
and handles the single-node case as pure intra-node SDMA.
@inkcherry inkcherry force-pushed the inkcherry/sdma-hier-allgather branch from c8d4eca to 4188668 Compare July 3, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant