ccl: hierarchical cross-node AllGather (intra-node SDMA + inter-node RDMA) by inkcherry · Pull Request #441 · ROCm/mori

inkcherry · 2026-07-01T03:49:39Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…RDMA) Add mori.ccl.HierAllGather: an all_gather_into_tensor-compatible collective that keeps intra-node traffic on the SDMA copy engines (XGMI) and moves inter-node traffic over RDMA. A fused ring||local-gather kernel runs the inter-node RDMA ring concurrently with the ring-independent local node-block SDMA gather in one grid (stream-ordered, direct-to-output, no staging copy). Bit-exact vs torch.distributed.all_gather_into_tensor for {bf16,fp16,fp32,int32}. On 2 nodes x 4 GPUs (MI355X), fp32: standalone bandwidth >= RCCL for sizes >=8MB (1.19-1.35x); under a concurrent GEMM the SDMA path overlaps with compute and is 16-20% faster than RCCL at 128-512MB (copy engines vs CU contention). Includes tests (test_hier_allgather*), size-sweep + gemm-overlap benches, a plot script, and the measured result charts/CSVs under benchmarks/.

…ignature HierAllGather now auto-detects the node-local rank count (LOCAL_WORLD_SIZE, else hostname grouping, else npes) so callers use the same constructor/call signature as the flat AllgatherSdma with no new required argument. ranks_per_node is now optional and keyword-only; added transit_buffer_size for signature parity. Single node still degenerates to the pure intra-node SDMA path.

Resolve conflicts with the upstream param-contiguous SDMA allgather: - oneshot_sdma_kernel.hpp: keep both the hierarchical sub-group/broadcast SDMA kernels and the upstream param-contiguous kernel (additive). - symmetric_memory.cpp: keep deviceHandles_d indexing by global pe (the array is worldSize-sized and all SDMA kernels index by global pe); adopt the upstream non-fatal GPU-metadata teardown.

Add HierAllGather.all_gather(tensor_list, tensor) matching torch.distributed.all_gather (list output), built on the same hierarchical intra-node SDMA / inter-node RDMA path as the contiguous all_gather_into_tensor. Bit-exact vs torch across {bf16,fp16,fp32,int32}; adds test_hier_allgather_list.

…erlap bench)

…DMA + inter-RDMA)

…A + inter RDMA) Adds a single drop-in FSDP2 AllGather backend, MoriAllGather, used identically for single-node and cross-node runs via the stock FSDPModule.set_custom_all_gather API. It routes intra-node traffic over SDMA copy engines (XGMI) and, when the process group spans multiple nodes, inter-node traffic over RDMA — the same object handles both, so user code is unchanged between one node and many. (MoriHierAllGather kept as a backward-compat alias.) Includes the Qwen-7B FSDP2 step benchmark, a 2-node driver, and the chart script. No mori source change needed; the HierAllGather primitive already exists and handles the single-node case as pure intra-node SDMA.

inkcherry added 7 commits June 30, 2026 07:50

docs/bench: use English wording (drop non-ASCII in README and gemm-ov…

7fb4db3

…erlap bench)

bench: annotate result charts as cross-node (2 nodes x 4 GPU, intra-S…

e14abec

…DMA + inter-RDMA)

inkcherry force-pushed the inkcherry/sdma-hier-allgather branch from c8d4eca to 4188668 Compare July 3, 2026 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ccl: hierarchical cross-node AllGather (intra-node SDMA + inter-node RDMA)#441

ccl: hierarchical cross-node AllGather (intra-node SDMA + inter-node RDMA)#441
inkcherry wants to merge 7 commits into
mainfrom
inkcherry/sdma-hier-allgather

inkcherry commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inkcherry commented Jul 1, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant