Skip to content

feat: Add optional prompt processing progress streaming #14731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

baonudesifeizhai
Copy link

  • Add include_prompt_progress parameter to slot_params (default: false)
  • Extend server_task_result_cmpl_partial with progress fields
  • Implement send_progress_response() function with 1% progress intervals
  • Add progress response in prompt processing loop
  • Update JSON response to include prompt_processing field when requested
  • Add comprehensive documentation to README.md
  • Ensure full backward compatibility with existing clients

Closes #14685

Make sure to read the contributing guidelines before submitting a PR

- Add include_prompt_progress parameter to slot_params (default: false)
- Extend server_task_result_cmpl_partial with progress fields
- Implement send_progress_response() function with 1% progress intervals
- Add progress response in prompt processing loop
- Update JSON response to include prompt_processing field when requested
- Add comprehensive documentation to README.md
- Ensure full backward compatibility with existing clients

Closes ggml-org#14685
@BradHutchings
Copy link

Is there a chance this could get approved? If it's not a welcome addition, I'll put it in the mmojo-server fork. Being able to display evaluating progress is a must for servers running on slow CPU, e.g. Raspberry Pi 5.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a more compact send_progress, instead of include_prompt_progress. Otherwise seems a good change.

params.stream = json_value(data, "stream", false);
params.cache_prompt = json_value(data, "cache_prompt", true);
params.return_tokens = json_value(data, "return_tokens", false);
params.include_prompt_progress = json_value(data, "include_prompt_progress", false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params.include_prompt_progress = json_value(data, "include_prompt_progress", false);
params.send_progress = json_value(data, "send_progress", false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Georgi! And for all you do with llama.cpp. Nobody says thank you enough!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

earlier in the code, we already had a param called return_tokens so maybe return_progress is a better naming

@ngxson
Copy link
Collaborator

ngxson commented Jul 27, 2025

Is there a chance this could get approved?

There are multiple PRs already open for this particular feature, it's very hard for maintainers to keep track. It's better to look at the list of opening PR before working on a feature.

@ngxson
Copy link
Collaborator

ngxson commented Jul 27, 2025

At least one test case is also required for this feature, maybe with a long prompt and a small batch size so we can clear see the effect.

@BradHutchings
Copy link

Hold up on this a moment. I've been testing this implementation today.

        // Send progress if:
        // 1. This is the first progress update (last_progress == -1)
        // 2. Progress increased by at least 1% or processed at least 10 tokens
        // 3. We've completed processing (current_progress >= 1.0)
        bool should_send = (last_progress < 0.0f) || 
                          (current_progress - last_progress >= 0.01f) || 
                          (current_progress >= 1.0f && last_progress < 1.0f);

        if (!should_send) {
            return;
        }

This logic should be eliminated. We should send progress only when a batch is complete. Otherwise, a bunch of progress messages blast the client as the batch completes. Most of the time spent processing a batch is not spent in the steps that emit this progress. It's spent, for example, here, as show with LLAMA_PROFIL=1 and looking at the profile data with gprof:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 77.45    312.58   312.58     9920    31.51    31.51  void (anonymous namespace)::tinyBLAS_Q0_AVX<block_q8_0, block_q8_0, float>::gemm4xN<2>(long, long, long, long)
 13.34    366.40    53.82 115047805     0.00     0.00  void (anonymous namespace)::tinyBLAS<8, float __vector(8), float __vector(8), unsigned short, unsigned short, float>::gemm_b>  2.06    374.71     8.31     3482     2.39     2.39  ggml_compute_forward_glu
  2.03    382.92     8.21     4688     1.75     2.29  ggml_compute_forward_soft_max
  1.24    387.92     5.00 23187277     0.00     0.00  ggml_vec_dot_q8_0_q8_0
  0.63    390.45     2.53  1925309     0.00     0.00  ggml_vec_soft_max_f32

If I bypass the should_send logic, and just send_progress_response() when a batch is complete, I can set e.g. --batch-size 64 on the server command and get very reasonable progress behavior on a Raspberry Pi 5 running Gemma 4B. Which is what I'm after here.

If some other PR has done this better, I'm happy to go try one. This PR almost gets the behavior perfect.

@BradHutchings
Copy link

This probably doesn't qualify as a test case, but the video shows this code, with the change to logic of when to send progress that I described above, is working as it should. 19K tokens, batch size 64, Raspberry Pi 5, generic CPU code, Gemma 1B.

Mmojo.Progress.mp4

-Brad

@baonudesifeizhai
Copy link
Author

This probably doesn't qualify as a test case, but the video shows this code, with the change to logic of when to send progress that I described above, is working as it should. 19K tokens, batch size 64, Raspberry Pi 5, generic CPU code, Gemma 1B.

Mmojo.Progress.mp4
-Brad

am just fixing it now

- Add return_progress parameter to slot_params (default: false)
- Extend server_task_result_cmpl_partial with progress fields
- Implement send_progress_response() function with batch completion logic
- Add progress response in prompt processing loop
- Update JSON response to include prompt_processing field when requested
- Add comprehensive documentation to README.md
- Add C++ test suite for progress feature validation
- Ensure full backward compatibility with existing clients
- Fix chat completions endpoint progress support

Closes ggml-org#14685
@github-actions github-actions bot added the testing Everything test related label Jul 28, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server test is tools/server/tests/unit/test_chat_completion.py

not ctest

…incrementally

- Remove incremental progress sending logic to avoid 'blasting the client'
- Send progress only when prompt processing is complete (100%)
- Add comprehensive test case with long prompt and small batch size
- Test shows clear progress from 2.3% to 99.9% with 45 progress responses
- Verify progress disabled functionality works correctly
- Fixes GitHub issue ggml-org#14685
@github-actions github-actions bot added the python python script changes label Jul 28, 2025
@ngxson
Copy link
Collaborator

ngxson commented Jul 28, 2025

Honestly by this point I'm spending more time reviewing this PR than just fixing it myself..

You clearly haven't even read the exiting code. pytest-compatible test is required.

@baonudesifeizhai
Copy link
Author

Honestly by this point I'm spending more time reviewing this PR than just fixing it myself..

You clearly haven't even read the exiting code. pytest-compatible test is required.
sorry my bad ,wont happen again

@baonudesifeizhai baonudesifeizhai force-pushed the feature/prompt-progress-v2 branch from 57d841f to 3969a8d Compare July 29, 2025 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Server stream response for "prompt processing progress"
4 participants