fix(rollout): stop controller-managed workers from dp-scaling staleness capacity#1471
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the engine initialization in rollout_controller.py to set train_data_parallel_size to 1 by default, preventing controller-managed workers from incorrectly applying data-parallel scaled staleness constraints. The reviewer recommended replacing kwargs.setdefault with an explicit check for None to handle cases where the key is explicitly passed as None instead of being entirely absent.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # Workers are controller-managed: the controller handles staleness | ||
| # globally, so workers must NOT apply their own dp-scaled staleness | ||
| # constraints. Pass train_data_parallel_size=1 to prevent workers | ||
| # from dividing capacity by dist.get_world_size(). | ||
| kwargs.setdefault("train_data_parallel_size", 1) |
There was a problem hiding this comment.
Using kwargs.setdefault only sets the default value if the key "train_data_parallel_size" is entirely absent from kwargs. However, if train_data_parallel_size is explicitly passed as None (for example, due to configuration parsing defaults), setdefault will keep the None value. This would cause the worker's WorkflowExecutor to fall back to the data parallel world size, re-introducing the capacity-shrinking bug.
To make this more robust and defensively handle both missing keys and explicit None values, check if the value is None using kwargs.get(...) is None.
| # Workers are controller-managed: the controller handles staleness | |
| # globally, so workers must NOT apply their own dp-scaled staleness | |
| # constraints. Pass train_data_parallel_size=1 to prevent workers | |
| # from dividing capacity by dist.get_world_size(). | |
| kwargs.setdefault("train_data_parallel_size", 1) | |
| # Workers are controller-managed: the controller handles staleness | |
| # globally, so workers must NOT apply their own dp-scaled staleness | |
| # constraints. Pass train_data_parallel_size=1 to prevent workers | |
| # from dividing capacity by dist.get_world_size(). | |
| if kwargs.get("train_data_parallel_size") is None: | |
| kwargs["train_data_parallel_size"] = 1 |
…ss capacity RolloutController manages staleness globally, but workers also applied their own dp-scaled staleness constraint, dividing capacity by dist.get_world_size(). Force train_data_parallel_size=1 at engine initialization (unless a caller explicitly configures a non-None value) so controller-managed workers do not re-scale capacity; an explicit None must not survive either, or workers fall back to the world-size scaling.
53ae3a7 to
eaa9054
Compare
Description
RolloutControllermanages staleness globally, but controller-managed workers also applied their own dp-scaled staleness constraint, dividing capacity bydist.get_world_size(). With N workers this shrinks effective rollout capacity by N and can stall generation.Force
train_data_parallel_size=1at engine initialization so workers under the controller do not re-scale capacity. The guard treats both a missing key and an explicitNoneas "not configured" (kwargs.get(...) is None), since an explicitNonereaching the worker would fall back to world-size scaling inWorkflowExecutor; an explicitly configured non-None value is left untouched. Standalone (non-controller) usage is unaffected.Related Issue
N/A.
Type of Change
Checklist
Contributing Guide
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prAdditional Context
No current caller passes
train_data_parallel_sizeintoRolloutController.initialize(the controller-pathinit_kwargsnever contains the key, and the non-controller path passes anintproperty directly to the engine), so theis Noneguard is defensive hardening for future callers that may forward anOptional[int]config value.