[WIP] single controller: add rollout controller #415

dingzhiqiang · 2025-10-03T09:55:21Z

No description provided.

gemini-code-assist · 2025-10-03T09:55:50Z

Summary of Changes

Hello @dingzhiqiang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays crucial groundwork for enabling scalable and fault-tolerant distributed machine learning workflows. It introduces dedicated controllers for managing both distributed model rollouts and training, alongside a comprehensive refactoring of underlying API definitions. These changes streamline the orchestration of distributed tasks, improve inter-worker communication robustness, and provide greater flexibility in deploying distributed components, ultimately paving the way for more complex and efficient distributed ML systems.

Highlights

New Distributed Rollout Controller: A new DistributedRolloutController has been introduced, designed to manage and orchestrate distributed rollout generation across multiple workers, leveraging RPC for communication.
New Distributed Train Controller: A new DistributedTrainController has been added, providing capabilities for managing distributed training workflows, including process group creation and distributed batch processing.
API Refactoring for Distributed Operations: Core API interfaces (controller_api, engine_api, scheduler_api, workflow_api) have undergone significant refactoring to support distributed data handling, worker management, and updated return/parameter types for distributed results.
RPC and Scheduling Utilities: New utility functions, create_engine_with_retry and rpc_call, have been added to facilitate robust engine initialization and efficient concurrent remote procedure calls to distributed workers.
Flexible Port Configuration: The SGLang server now allows engine ports to be configured via an ENGINE_PORTS environment variable, enhancing deployment flexibility for distributed components.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces new DistributedRolloutController and DistributedTrainController classes, along with significant refactoring of the controller, scheduler, and engine APIs. The changes aim to establish a more robust single-controller architecture. While the overall direction is good, I've identified several critical issues, primarily concerning violations of the Liskov Substitution Principle in the DistributedRolloutController, which break the established API contracts. Additionally, there are concerns regarding resource management, code quality (e.g., code duplication, magic strings, foreign-language comments), and robustness in error handling. Addressing these points will be crucial for the stability and maintainability of the new controller framework.

gemini-code-assist · 2025-10-03T09:57:52Z

areal/controller/rollout_controller.py

+    def wait(self, counts: List[int], timeout: float | None = None)->DistributedBatch:
+        assert len(counts) == len(self.dp_head_workers)
+        results = self.custom_function_call("wait", counts, timeout)
+        return DistributedBatch.concat(results)


The signature of this wait method (self, counts: List[int], ...) is incompatible with the base class RolloutController.wait which has the signature (self, count: int, ...). This violates the Liskov Substitution Principle. Subclass methods should have compatible signatures with their parent classes to ensure polymorphism works as expected.

gemini-code-assist · 2025-10-03T09:57:52Z

areal/controller/rollout_controller.py

+    def update_weights(self, meta: WeightUpdateMeta) -> None:
+        """Update weights in the inference engine."""
+        self.custom_function_call("update_weights", None, meta)
+        return None


This implementation of update_weights violates the Liskov Substitution Principle. The base class RolloutController.update_weights is defined as a non-blocking method that returns a Future, but this implementation is blocking (due to rpc_call) and returns None. This breaks the API contract and can lead to deadlocks or unexpected behavior in client code that expects an asynchronous operation.

gemini-code-assist · 2025-10-03T09:57:52Z

areal/controller/rollout_controller.py

+    def prepare_batch(self, data: DistributedBatch, workflow: RolloutWorkflow) -> None:
+        """Asynchronously submit a request to the inference engine. Exits immediately."""
+        batches = data.chunk(self.alloc_mode.gen.dp_size)
+        self.custom_function_call("prepare_batch", batches, workflow)
+        return None


This prepare_batch method returns None, whereas the base class RolloutController.prepare_batch is type-hinted to return a DistributedBatch. This violates the Liskov Substitution Principle and breaks the API contract. The implementation should be updated to conform to the base class signature and behavior.

gemini-code-assist · 2025-10-03T09:57:53Z

areal/utils/http.py

+            if exit_on_exception:
+                logger.info("Exiting due to exception in future.")
+                os.kill(os.getpid(), signal.SIGTERM)


Using os.kill(os.getpid(), signal.SIGTERM) to handle an exception is a very aggressive approach that can be dangerous. It terminates the entire process abruptly, preventing any cleanup code (in finally blocks or atexit handlers) from running. This can lead to corrupted state or resource leaks. A better approach is to re-raise the exception and let the caller decide on the appropriate action, which might include a graceful shutdown.

if exit_on_exception: logger.info("Exiting due to exception in future.") raise e

gemini-code-assist · 2025-10-03T09:57:53Z

areal/api/workflow_api.py

+                except TimeoutError:
+                    pass
+        else:
+            self.data_list_index = 0


The instance attribute self.data_list_index is initialized here for the first time. It's a best practice to declare all instance attributes in the __init__ method of the class (WorkflowExecutor). This prevents potential AttributeError exceptions if other methods are called before this one and improves code readability by providing a single place to see all attributes of an object.

Suggested change

self.data_list_index = 0

# This should be initialized in WorkflowExecutor.__init__

if not hasattr(self, "data_list_index"):

self.data_list_index = 0

gemini-code-assist · 2025-10-03T09:57:53Z

areal/controller/rollout_controller.py

+    def custom_function_call(self, method: str, batches, *args, **kwargs):
+        return rpc_call(self.scheduler, self.dp_head_workers, method, batches, args, kwargs)


Using magic strings like "update_weights", "prepare_batch", etc., for RPC method names is brittle. Typos or changes in the remote method names will only be caught at runtime. It would be more robust to define these method names as constants in a shared API module. This would allow for static analysis and reduce the risk of runtime errors.

gemini-code-assist · 2025-10-03T09:57:53Z

areal/controller/utils.py

+            results = wait_future_ordered(futures, exit_on_exception=True)
+        except Exception as e:
+            raise RuntimeError(f"{method} failed, error: {e}")


When an exception occurs, it's being wrapped in a RuntimeError without preserving the original exception's context. This can make debugging more difficult as the original traceback is lost. It's better to use raise RuntimeError(...) from e to chain the exceptions and preserve the full context.

Suggested change

results = wait_future_ordered(futures, exit_on_exception=True)

except Exception as e:

raise RuntimeError(f"{method} failed, error: {e}")

raise RuntimeError(f"{method} failed, error: {e}") from e

gemini-code-assist · 2025-10-03T09:57:54Z

areal/engine/sglang_remote.py

        return self.workflow_executor.resume()

+    def get_scheduling_config(self) -> List[Scheduling]:
+        # 部署 launcher/sglang_server.py, local_scheduler 注入一个ENGINE_PORTS的端口环境变量,里面有两个端口


This comment is in Chinese. For consistency and maintainability of the codebase, please write all comments in English.

Suggested change

# 部署 launcher/sglang_server.py, local_scheduler 注入一个ENGINE_PORTS的端口环境变量,里面有两个端口

# Deploy launcher/sglang_server.py, local_scheduler injects an ENGINE_PORTS environment variable, which contains two ports

gemini-code-assist · 2025-10-03T09:57:54Z

areal/api/engine_api.py

+        Returns
+        -------
+        Scheduling
+            The scheduling configuration for the engine


The docstring indicates that this method returns Scheduling, but the function's type hint specifies List[Scheduling]. This inconsistency can be misleading for developers using this API. Please update the docstring to match the return type hint.

gemini-code-assist · 2025-10-03T09:57:54Z

areal/api/workflow_api.py

+            # 处理StatefulDataLoader类型 - 保持原有逻辑不变
+            if not hasattr(self, "data_generator"):
+                self.data_generator = cycle_dataloader(dataloader)
+            assert dataloader.batch_size is not None
+            batch_size = dataloader.batch_size
+
+            while True:
+                # Submit at least two batches to allow maximum overlap
+                if (
+                    self.get_capacity() + batch_size > 0
+                    and self.input_queue.qsize() + batch_size
+                    < self.input_queue.maxsize
+                ):
+                    data = next(self.data_generator)
+                    for item in data:
+                        self.submit(
+                            item,
+                            workflow=workflow,
+                            workflow_builder=workflow_builder,
+                            should_accept=should_accept,
+                        )
+                try:
+                    return self.wait(batch_size, timeout=1)
+                except TimeoutError:
+                    pass
+        else:
+            self.data_list_index = 0
+
+            # 对于List类型，使用固定的batch_size=1
+            batch_size = 1
+
+            while True:
+                # Submit at least two batches to allow maximum overlap
+                if (
+                    self.get_capacity() + batch_size > 0
+                    and self.input_queue.qsize() + batch_size
+                    < self.input_queue.maxsize
+                ):
+                    # 从List中获取数据，支持循环访问
+                    if self.data_list_index >= len(dataloader):
+                        self.data_list_index = 0  # 循环访问


The code contains comments in Chinese. To ensure the codebase is accessible and maintainable for all contributors, please translate these comments into English.

daihao and others added 2 commits October 2, 2025 18:00

single controller: add train controller

008bd7a

impl rollout controller

e322d3c

dingzhiqiang had a problem deploying to AReaL-unittests October 3, 2025 09:55 — with GitHub Actions Failure

dingzhiqiang had a problem deploying to AReaL-unittests October 3, 2025 09:55 — with GitHub Actions Error

gemini-code-assist bot reviewed Oct 3, 2025

View reviewed changes

		def custom_function_call(self, method: str, batches, args, *kwargs):
		return rpc_call(self.scheduler, self.dp_head_workers, method, batches, args, kwargs)

	# 部署 launcher/sglang_server.py, local_scheduler 注入一个ENGINE_PORTS的端口环境变量,里面有两个端口
	# Deploy launcher/sglang_server.py, local_scheduler injects an ENGINE_PORTS environment variable, which contains two ports

[WIP] single controller: add rollout controller #415

Are you sure you want to change the base?

[WIP] single controller: add rollout controller #415

Uh oh!

Conversation

dingzhiqiang commented Oct 3, 2025

Uh oh!

gemini-code-assist bot commented Oct 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant