- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[Bugfix][P/D] Fix throughput stats in disaggregated setup #27569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
378e06a
              4c463b6
              863ea67
              9415154
              c3d6723
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -121,6 +121,8 @@ class EngineCoreOutput( | |
| trace_headers: Mapping[str, str] | None = None | ||
| # The number of tokens with prefix cache hits. | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this comment looks incorrect ... assuming "prefix cache" refers to the local cache? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not familiar with it cc @chaunceyjiang | ||
| num_cached_tokens: int = 0 | ||
| # The number of tokens that have been computed remotely. | ||
| num_external_computed_tokens: int = 0 | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be tempted to refactor these two into a  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion on this tbh, we can probably wait to have a few more things to bundle before executing the suggestion | ||
|  | ||
| @property | ||
| def finished(self) -> bool: | ||
|  | ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -113,7 +113,7 @@ def _reset(self, now): | |
|  | ||
| def _track_iteration_stats(self, iteration_stats: IterationStats): | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Presumably you want to update the Prometheus metric too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @markmc which one? I intentionally left  | ||
| # Save tracked stats for token counters. | ||
| self.num_prompt_tokens += iteration_stats.num_prompt_tokens | ||
| self.num_prompt_tokens += iteration_stats.num_local_prompt_tokens | ||
|         
                  NickLucche marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
| self.num_generation_tokens += iteration_stats.num_generation_tokens | ||
|  | ||
| def _get_throughput(self, tracked_stats: int, now: float) -> float: | ||
|  | ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -221,6 +221,8 @@ def __init__(self): | |
| self.num_generation_tokens = 0 | ||
| self.num_prompt_tokens = 0 | ||
| self.num_preempted_reqs = 0 | ||
| # Num of prompt tokens that have been computed locally. | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the naming here a big confusing? By "computed locally" here we mean both computed and locally cached? If you just tracked  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 Yes the behavior is unchanged, cached ones would still result in higher throughput even in regular aggregated setup. 
 I think looking at the diff this is pretty clear that I just want to rule out the remote tokens ie I assume the semantic was the intended one from the beginning, it's just "local" used to be redundant | ||
| self.num_local_prompt_tokens = 0 | ||
| self.finished_requests: list[FinishedRequestStats] = [] | ||
| self.max_num_generation_tokens_iter: list[int] = [] | ||
| self.n_params_iter: list[int] = [] | ||
|  | @@ -251,6 +253,9 @@ def update_from_output( | |
| self.num_generation_tokens += num_new_generation_tokens | ||
| if is_prefilling: | ||
| self.num_prompt_tokens += prompt_len | ||
| self.num_local_prompt_tokens += ( | ||
| prompt_len - output.num_external_computed_tokens | ||
| ) | ||
|         
                  NickLucche marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
|  | ||
| first_token_latency = self._time_since(req_stats.arrival_time) | ||
| self.time_to_first_tokens_iter.append(first_token_latency) | ||
|  | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdavidbd can you please double check this, my understanding is that we have to re-compute the whole prefill now so we can track prompt throughput
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even from the docstring, it seems clear we don't always re-compute the whole prefill:
A few things:
num_external_computed_tokens = 0should only happen inside thenot marked_invalid_blockclause - here we're saying all externally computed blocks are invalidrequest.num_computed_tokens = request.num_cached_tokenson line 1489 doesn't make sense to me - sincenum_cached_tokensincludes both local and external computed tokens?num_external_computed_tokensat# Truncate the computed tokens at the first failed block- to something likerequest.num_computed_tokens - local_computed_tokens(but not obvious how we calculatedlocal_computed_tokens)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLucche — as @markmc noted, we don’t recompute the entire prefill. Only the externally computed tokens starting from the first failed block are recomputed.
To correctly update
num_external_computed_tokens, we should first determine how many externally computed tokens are affected. This can be derived from the delta between the original and truncatednum_computed_tokens— the same tokens already aggregated intotal_affected_tokens(lines 1473–1477):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markmc — regarding your points:
not marked_invalid_blockcondition covers the sync-loading edge case where a request is affected by externally computed tokens that failed to load but are shared with preceding requests that will handle their recomputation. In this situation, the affected request still treats those tokens as locally computed, so itsnum_external_computed_tokensremains unchanged.For example, assuming
block_size = 1and the following prompts (withR1recedingR2in the batch):Suppose
t1is locally computed,t2andt4are externally computed, andt2fails to load whilet4succeeds. Then:Both
R1andR2are affected and will recomputet2,t3andt5respectively, butR2’s total number of computed tokens remains unchanged.Correct —
num_cached_tokensrepresents the total number of computed tokens (both local and external). Settingnum_computed_tokens = num_cached_tokensensures that all new tokens are recomputed in the current iteration, since the previousnum_computed_tokensvalue already included them.Agreed — see my suggested code changes above for how we update
num_external_computed_tokensaccordingly.