Skip to content

fix(rollout): stop controller-managed workers from dp-scaling staleness capacity#1471

Open
Le8r0nJames wants to merge 1 commit into
areal-project:mainfrom
Le8r0nJames:fix/rollout-worker-staleness-dp-scale
Open

fix(rollout): stop controller-managed workers from dp-scaling staleness capacity#1471
Le8r0nJames wants to merge 1 commit into
areal-project:mainfrom
Le8r0nJames:fix/rollout-worker-staleness-dp-scale

Conversation

@Le8r0nJames

@Le8r0nJames Le8r0nJames commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Description

RolloutController manages staleness globally, but controller-managed workers also applied their own dp-scaled staleness constraint, dividing capacity by dist.get_world_size(). With N workers this shrinks effective rollout capacity by N and can stall generation.

Force train_data_parallel_size=1 at engine initialization so workers under the controller do not re-scale capacity. The guard treats both a missing key and an explicit None as "not configured" (kwargs.get(...) is None), since an explicit None reaching the worker would fall back to world-size scaling in WorkflowExecutor; an explicitly configured non-None value is left untouched. Standalone (non-controller) usage is unaffected.

Related Issue

N/A.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the
    Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Additional Context

No current caller passes train_data_parallel_size into RolloutController.initialize (the controller-path init_kwargs never contains the key, and the non-controller path passes an int property directly to the engine), so the is None guard is defensive hardening for future callers that may forward an Optional[int] config value.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the engine initialization in rollout_controller.py to set train_data_parallel_size to 1 by default, preventing controller-managed workers from incorrectly applying data-parallel scaled staleness constraints. The reviewer recommended replacing kwargs.setdefault with an explicit check for None to handle cases where the key is explicitly passed as None instead of being entirely absent.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +270 to +274
# Workers are controller-managed: the controller handles staleness
# globally, so workers must NOT apply their own dp-scaled staleness
# constraints. Pass train_data_parallel_size=1 to prevent workers
# from dividing capacity by dist.get_world_size().
kwargs.setdefault("train_data_parallel_size", 1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using kwargs.setdefault only sets the default value if the key "train_data_parallel_size" is entirely absent from kwargs. However, if train_data_parallel_size is explicitly passed as None (for example, due to configuration parsing defaults), setdefault will keep the None value. This would cause the worker's WorkflowExecutor to fall back to the data parallel world size, re-introducing the capacity-shrinking bug.

To make this more robust and defensively handle both missing keys and explicit None values, check if the value is None using kwargs.get(...) is None.

Suggested change
# Workers are controller-managed: the controller handles staleness
# globally, so workers must NOT apply their own dp-scaled staleness
# constraints. Pass train_data_parallel_size=1 to prevent workers
# from dividing capacity by dist.get_world_size().
kwargs.setdefault("train_data_parallel_size", 1)
# Workers are controller-managed: the controller handles staleness
# globally, so workers must NOT apply their own dp-scaled staleness
# constraints. Pass train_data_parallel_size=1 to prevent workers
# from dividing capacity by dist.get_world_size().
if kwargs.get("train_data_parallel_size") is None:
kwargs["train_data_parallel_size"] = 1

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…ss capacity

RolloutController manages staleness globally, but workers also applied
their own dp-scaled staleness constraint, dividing capacity by
dist.get_world_size(). Force train_data_parallel_size=1 at engine
initialization (unless a caller explicitly configures a non-None value)
so controller-managed workers do not re-scale capacity; an explicit None
must not survive either, or workers fall back to the world-size scaling.
@Le8r0nJames Le8r0nJames force-pushed the fix/rollout-worker-staleness-dp-scale branch from 53ae3a7 to eaa9054 Compare July 2, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant