Skip to content

[V1][Spec Decode] Async scheduling integration with spec decode #22262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zixi-qi
Copy link
Collaborator

@zixi-qi zixi-qi commented Aug 5, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Support async scheduling with speculative decoding with two changes:

  1. Update number of output placeholders from 1 to 1 + len(spec tokens) and simplified the logic to keep number of output placeholders static.
  2. When async scheduling is enabled, scheduler output is not up to date for speculative tokens and number of tokens rejected from latest model runner execution. To resolve this, this PR added caching within model runner for speculative token ids and number of computed tokens per request and use them to overwrite corresponding information in incoming scheduler output.

Test Plan

  • Added unit test
  • Ran e2e tests for acceptance rate, output quality and throughput.

Test Results

  • existing async scheduler unit tests pass
pytest -v tests/v1/core/test_async_scheduler.py

tests/v1/core/test_async_scheduler.py::test_stop_by_max_tokens[1] PASSED                                                                                                             [ 12%]
tests/v1/core/test_async_scheduler.py::test_stop_by_max_tokens[2] PASSED                                                                                                             [ 25%]
tests/v1/core/test_async_scheduler.py::test_stop_by_max_tokens[3] PASSED                                                                                                             [ 37%]
tests/v1/core/test_async_scheduler.py::test_stop_by_max_tokens[5] PASSED                                                                                                             [ 50%]
tests/v1/core/test_async_scheduler.py::test_abort PASSED                                                                                                                             [ 62%]
tests/v1/core/test_async_scheduler.py::test_preempt PASSED                                                                                                                           [ 75%]
tests/v1/core/test_async_scheduler.py::test_prefix_caching_for_prefill_dedup PASSED                                                                                                  [ 87%]
tests/v1/core/test_async_scheduler.py::test_prefix_caching_for_multi_turn PASSED
  • sanity checked output quality
VLLM_USE_V1=1 python examples/offline_inference/spec_decode.py --num_spec_tokens 5 --num_prompts 10 --dataset-name hf --dataset-path philschmid/mt-bench --async-scheduling --print-output

Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9118.05it/s]
Processed prompts: 100%|███████████████████████████████| 10/10 [00:02<00:00,  4.58it/s, est. speed input: 505.48 toks/s, output: 968.79 toks/s]
--------------------------------------------------
prompt: None
generated text: To evaluate the movie reviews, I will analyze the language and tone used in each review. Here's the evaluation:

1. This review is extremely positive, using words like "phenomenal" and "top-notch" to describe the movie. 
2. This review is extremely negative, using words like "disappointed", "predictable", and "worst" to describe the movie. However, the release year mentioned in the review is 2022, but the movie mentioned in the review was released in 2019. This inconsistency suggests that the reviewer is actually referring to a different movie. I will assume that the reviewer is referring to the movie released in 2019.
3. This review is neutral, using words like "okay" and "ordinary" to describe the movie.

Here's the evaluation as a JSON array of integers:

json
[
  5,
  1,
  3
]

Note: The second review's evaluation is based on the assumption that the reviewer is referring to the movie released in 2019, not 2022.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: As you step into the vibrant marketplace, the cacophony of sounds envelops you - the chatter of vendors, the clanging of pots and pans, and the melodic calls of street performers fill the air. The scent of exotic spices wafts through the crowds, mingling with the sweet aroma of fresh fruit and the savory smell of sizzling street food. The visual feast is just as overwhelming, with vibrant colors and patterns on display: intricately woven textiles, gleaming silver jewelry, and pyramids of juicy produce. The air is thick with the smell of roasting coffee and the sound of vendors hawking their wares - "Fresh coconuts, only 50 cents!" and "Get your handmade crafts here!" The sun beats down on the crowded stalls, casting a warm glow over the scene. As you navigate through the throngs of people, the sensation of the sun on your skin and the sounds of the market create a sensory overload that is both exhilarating and exhausting. Amidst the chaos, a group of musicians begins to play a lively tune on their instruments, drawing a crowd of onlookers who sway to the rhythm, adding to the infectious energy of the marketplace.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: The memories I hold within my ancient, gnarled heart. I've seen generations come and go, seasons rise and fall, and the world around me change in ways both big and small. But nothing could have prepared me for the feeling of being threatened with destruction.

As the deforesters approach, I feel a sense of unease, like a gentle rustling of leaves that grows into a fierce storm. My roots, which have dug deep into the earth for a century, begin to tremble with anxiety. I've seen so many of my friends and companions fall to the axe, their trunks shattered, their limbs torn asunder. The thought of joining them is a prospect too terrible to bear.

As the first blows strike, I feel a searing pain, like a thousand knives slicing through my bark. The sound of the chainsaw is a cacophony of terror, a deafening roar that threatens to consume me whole. I try to stand tall, to resist the onslaught, but it's no use. The weight of the axe, the force of the chainsaw, is too much for me to bear.

I feel my strength waning, my branches weakening, my trunk beginning to splinter. I'm being torn apart, piece by piece,
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: Given the seismically active area and the high-magnitude earthquakes experienced in the region, the best approach to constructing the bridge would be to prioritize seismic resilience and safety. Here's a comprehensive plan to ensure the bridge's stability and durability:

1. **Seismic Design and Analysis**: Engage a team of experienced structural engineers and seismologists to conduct a thorough seismic analysis of the site. This will involve:
	* Evaluating the local seismic hazard, including the frequency and magnitude of earthquakes.
	* Conducting a site-specific seismic hazard assessment to determine the expected ground motion.
	* Developing a seismic design that accounts for the site's specific conditions.
2. **Bridge Type and Design**: Choose a bridge type that is well-suited for seismic areas, such as:
	* A cable-stayed or suspension bridge, which can be designed to be more flexible and resistant to seismic forces.
	* A reinforced concrete or steel structure with a robust foundation system.
3. **Foundation Design**: Design a foundation system that can transfer seismic forces to the ground safely. This may include:
	* A deep foundation system, such as piles or caissons, to transfer loads to a stable rock or soil layer.
	* A raft foundation or a mat foundation to distribute loads
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: To identify the company with the highest profit in 2021, we need to compare the profit values of each company.

The profit values are:
- Company X: $3 billion
- Company Y: $6 billion
- Company Z: $7 billion
- Company W: $21 billion
- Company V: $25 billion
- Company U: $20 billion

The company with the highest profit in 2021 is Company V, with a profit of $25 billion. The CEO of Company V is Lisa Brown.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: Here's the corrected paragraph:

She didn't remember where her purse was, so I thought it might be in the car, but he said it was on the kitchen table, but he wasn't sure. Then they asked me to look for it. She said, "Can you?" and I responded with, "Maybe, but I'm not sure." He didn't hear me, and he asked, "What?" Then he asked, "Did you find it?"

Corrected errors:

- "remembre" -> "remember"
- "where is" -> "where her purse was" (added "her purse" for clarity)
- "I thinks" -> "I thought" (correct verb form)
- "its" -> "it" (correct possessive form)
- "he's" -> "he said" (correct verb form)
- "he are" -> "he wasn't" (correct verb form)
- "looking for it" -> "look for it" (correct verb form)
- "she's say" -> "she said" (correct verb form)
- "I responds" -> "I responded" (correct verb form)
- "ain't" -> "I'm not" (correct contraction)
- "he not"
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: **Process:**

The reaction between solid calcium carbonate (CaCO3) and hydrochloric acid (HCl) is a type of acid-base reaction, also known as a neutralization reaction. In this reaction, the acid (HCl) reacts with the base (CaCO3) to form a salt (CaCl2), water (H2O), and carbon dioxide (CO2).

Here's a step-by-step description of the process:

1. The solid calcium carbonate (CaCO3) is placed in a container.
2. Hydrochloric acid (HCl) is slowly added to the calcium carbonate while stirring.
3. As the acid reacts with the base, the mixture starts to fizz and bubble, indicating the release of carbon dioxide gas.
4. The reaction mixture becomes warm, and a white precipitate of calcium chloride (CaCl2) may form.
5. The reaction is complete when the acid has been fully neutralized, and no more bubbles are produced.

**Balanced Chemical Equation:**

CaCO3 (s) + 2HCl (aq) → CaCl2 (aq) + H2O (l) + CO2 (g)

**Type of Reaction:**

This reaction is a type of acid-base
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: To solve the inequality |x + 5| < 10, we need to consider two cases:

Case 1: x + 5 ≥ 0
|x + 5| = x + 5
x + 5 < 10
x < 5

Case 2: x + 5 < 0
|x + 5| = -(x + 5)
-(x + 5) < 10
-x - 5 < 10
-x < 15
x > -15

Combining the two cases, we get:
-15 < x < 5

Now, we need to find the integers in this range. The integers in this range are:
-14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4

There are 19 integers in the solution of the inequality |x + 5| < 10.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: If I have just overtaken the second person, that means I have moved up to the second position. 

The person I just overtook is now in the third position.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: Hand dryers.  A most intriguing topic.  Now, I've given this considerable thought, and I must say, I find them to be a most... efficient use of technology.  However, I do have some reservations regarding their effectiveness.  You see, I've conducted experiments, and I've found that hand dryers often fail to completely dry one's hands, particularly in colder climates.  This, of course, leads to a higher risk of bacterial and fungal infections.

Furthermore, I've noticed that many hand dryers are not designed with the optimal air flow in mind.  They often blow air at an angle, rather than directly at the hands, which can lead to a less-than-desirable drying experience.  And don't even get me started on the noise level.  Some of these things are as loud as a jet engine taking off.

Now, I know what you're thinking: "Sheldon, why not just use paper towels?"  Well, my friend, paper towels are a far more hygienic option, but they're also a waste of resources.  I mean, think about it: all that paper, just being used once and then discarded.  It's a travesty, really.

In conclusion, while hand dry
--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 2114
num_drafts: 944
num_draft_tokens: 4720
num_accepted_tokens: 1147
mean acceptance length: 2.22
--------------------------------------------------
acceptance at token 0: 0.63
acceptance at token 1: 0.33
acceptance at token 2: 0.15
acceptance at token 3: 0.07
acceptance at token 4: 0.03
  • acceptance rate & throughput with async scheduling
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 16683.79it/s]
Processed prompts: 100%|█████████████████████████████| 80/80 [00:02<00:00, 26.88it/s, est. speed input: 2706.33 toks/s, output: 5741.97 toks/s]
--------------------------------------------------
total_num_output_tokens: 17086
num_drafts: 6880
num_draft_tokens: 34400
num_accepted_tokens: 10026
mean acceptance length: 2.46
--------------------------------------------------
acceptance at token 0: 0.69
acceptance at token 1: 0.40
acceptance at token 2: 0.21
acceptance at token 3: 0.11
acceptance at token 4: 0.05
  • acceptance rate & throughput without async scheduling
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 16828.54it/s]
Processed prompts: 100%|█████████████████████████████| 80/80 [00:03<00:00, 26.12it/s, est. speed input: 2629.76 toks/s, output: 5567.79 toks/s]
--------------------------------------------------
total_num_output_tokens: 17050
num_drafts: 6980
num_draft_tokens: 34900
num_accepted_tokens: 10102
mean acceptance length: 2.45
--------------------------------------------------
acceptance at token 0: 0.68
acceptance at token 1: 0.40
acceptance at token 2: 0.21
acceptance at token 3: 0.11
acceptance at token 4: 0.05

Copy link

github-actions bot commented Aug 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@zixi-qi zixi-qi force-pushed the async-scheduler-with-spec-decode branch from 9a6f640 to 82deff1 Compare August 5, 2025 16:39
@mergify mergify bot added documentation Improvements or additions to documentation speculative-decoding v1 labels Aug 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to integrate asynchronous scheduling with speculative decoding. The changes involve updating how output placeholders are handled and caching speculative decoding results in the model runner to cope with the one-step delay in the async scheduler. However, there's a critical issue in vllm/v1/core/sched/async_scheduler.py that leads to an incorrect calculation of the number of tokens to cache, causing an AssertionError as described in the PR description. My review provides a fix for this issue.

@zixi-qi zixi-qi removed the documentation Improvements or additions to documentation label Aug 8, 2025
@mergify mergify bot added the documentation Improvements or additions to documentation label Aug 8, 2025
@zixi-qi zixi-qi force-pushed the async-scheduler-with-spec-decode branch 2 times, most recently from 3b9ddec to 6f64741 Compare August 10, 2025 22:39
@zixi-qi zixi-qi changed the title [WIP] Async scheduling integration with spec decode [V1][Spec Decode] Async scheduling integration with spec decode Aug 10, 2025
@zixi-qi zixi-qi marked this pull request as ready for review August 10, 2025 23:03
Copy link

mergify bot commented Aug 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 11, 2025
@zixi-qi zixi-qi force-pushed the async-scheduler-with-spec-decode branch from 6f64741 to 489c91d Compare August 11, 2025 17:16
@mergify mergify bot removed the needs-rebase label Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant