feat: Add optional prompt processing progress streaming #14731

baonudesifeizhai · 2025-07-17T04:18:04Z

Add include_prompt_progress parameter to slot_params (default: false)
Extend server_task_result_cmpl_partial with progress fields
Implement send_progress_response() function with 1% progress intervals
Add progress response in prompt processing loop
Update JSON response to include prompt_processing field when requested
Add comprehensive documentation to README.md
Ensure full backward compatibility with existing clients

Make sure to read the contributing guidelines before submitting a PR

- Add include_prompt_progress parameter to slot_params (default: false) - Extend server_task_result_cmpl_partial with progress fields - Implement send_progress_response() function with 1% progress intervals - Add progress response in prompt processing loop - Update JSON response to include prompt_processing field when requested - Add comprehensive documentation to README.md - Ensure full backward compatibility with existing clients Closes ggml-org#14685

BradHutchings · 2025-07-27T15:55:50Z

Is there a chance this could get approved? If it's not a welcome addition, I'll put it in the mmojo-server fork. Being able to display evaluating progress is a must for servers running on slow CPU, e.g. Raspberry Pi 5.

ggerganov

We can use a more compact send_progress, instead of include_prompt_progress. Otherwise seems a good change.

ggerganov · 2025-07-27T16:03:14Z

tools/server/server.cpp

+        params.stream                  = json_value(data, "stream",                  false);
+        params.cache_prompt            = json_value(data, "cache_prompt",            true);
+        params.return_tokens           = json_value(data, "return_tokens",           false);
+        params.include_prompt_progress = json_value(data, "include_prompt_progress", false);


Suggested change

params.include_prompt_progress = json_value(data, "include_prompt_progress", false);

params.send_progress = json_value(data, "send_progress", false);

Thanks, Georgi! And for all you do with llama.cpp. Nobody says thank you enough!

earlier in the code, we already had a param called return_tokens so maybe return_progress is a better naming

ngxson · 2025-07-27T21:29:42Z

Is there a chance this could get approved?

There are multiple PRs already open for this particular feature, it's very hard for maintainers to keep track. It's better to look at the list of opening PR before working on a feature.

ngxson · 2025-07-27T21:33:08Z

At least one test case is also required for this feature, maybe with a long prompt and a small batch size so we can clear see the effect.

BradHutchings · 2025-07-27T23:21:28Z

Hold up on this a moment. I've been testing this implementation today.

        // Send progress if:
        // 1. This is the first progress update (last_progress == -1)
        // 2. Progress increased by at least 1% or processed at least 10 tokens
        // 3. We've completed processing (current_progress >= 1.0)
        bool should_send = (last_progress < 0.0f) || 
                          (current_progress - last_progress >= 0.01f) || 
                          (current_progress >= 1.0f && last_progress < 1.0f);

        if (!should_send) {
            return;
        }

This logic should be eliminated. We should send progress only when a batch is complete. Otherwise, a bunch of progress messages blast the client as the batch completes. Most of the time spent processing a batch is not spent in the steps that emit this progress. It's spent, for example, here, as show with LLAMA_PROFIL=1 and looking at the profile data with gprof:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 77.45    312.58   312.58     9920    31.51    31.51  void (anonymous namespace)::tinyBLAS_Q0_AVX<block_q8_0, block_q8_0, float>::gemm4xN<2>(long, long, long, long)
 13.34    366.40    53.82 115047805     0.00     0.00  void (anonymous namespace)::tinyBLAS<8, float __vector(8), float __vector(8), unsigned short, unsigned short, float>::gemm_b>  2.06    374.71     8.31     3482     2.39     2.39  ggml_compute_forward_glu
  2.03    382.92     8.21     4688     1.75     2.29  ggml_compute_forward_soft_max
  1.24    387.92     5.00 23187277     0.00     0.00  ggml_vec_dot_q8_0_q8_0
  0.63    390.45     2.53  1925309     0.00     0.00  ggml_vec_soft_max_f32

If I bypass the should_send logic, and just send_progress_response() when a batch is complete, I can set e.g. --batch-size 64 on the server command and get very reasonable progress behavior on a Raspberry Pi 5 running Gemma 4B. Which is what I'm after here.

If some other PR has done this better, I'm happy to go try one. This PR almost gets the behavior perfect.

BradHutchings · 2025-07-28T00:58:22Z

This probably doesn't qualify as a test case, but the video shows this code, with the change to logic of when to send progress that I described above, is working as it should. 19K tokens, batch size 64, Raspberry Pi 5, generic CPU code, Gemma 1B.

Mmojo.Progress.mp4

-Brad

baonudesifeizhai · 2025-07-28T01:08:31Z

This probably doesn't qualify as a test case, but the video shows this code, with the change to logic of when to send progress that I described above, is working as it should. 19K tokens, batch size 64, Raspberry Pi 5, generic CPU code, Gemma 1B.

Mmojo.Progress.mp4
-Brad

am just fixing it now

- Add return_progress parameter to slot_params (default: false) - Extend server_task_result_cmpl_partial with progress fields - Implement send_progress_response() function with batch completion logic - Add progress response in prompt processing loop - Update JSON response to include prompt_processing field when requested - Add comprehensive documentation to README.md - Add C++ test suite for progress feature validation - Ensure full backward compatibility with existing clients - Fix chat completions endpoint progress support Closes ggml-org#14685

ngxson · 2025-07-28T10:44:43Z

tests/test-progress-feature.cpp

server test is tools/server/tests/unit/test_chat_completion.py

not ctest

…incrementally - Remove incremental progress sending logic to avoid 'blasting the client' - Send progress only when prompt processing is complete (100%) - Add comprehensive test case with long prompt and small batch size - Test shows clear progress from 2.3% to 99.9% with 45 progress responses - Verify progress disabled functionality works correctly - Fixes GitHub issue ggml-org#14685

ngxson · 2025-07-28T20:23:22Z

Honestly by this point I'm spending more time reviewing this PR than just fixing it myself..

You clearly haven't even read the exiting code. pytest-compatible test is required.

baonudesifeizhai · 2025-07-29T08:50:31Z

Honestly by this point I'm spending more time reviewing this PR than just fixing it myself..

You clearly haven't even read the exiting code. pytest-compatible test is required.
sorry my bad ,wont happen again

BradHutchings · 2025-08-10T16:05:02Z

Bumping this for reviewer attention. It's an important enough feature that I have implemented it in my Mmojo-Server fork.

BradHutchings · 2025-08-12T21:33:22Z

@baonudesifeizhai Another small issue...

The progress message is emitted before a batch is processed. So, if my batch size is 64, when completion starts, it almost immediately emits a message that 64 tokens have been evaluated. This throws off my estimates for processing the whole things for several batches. On a slow CPU with a large task, it might show a 3 minute ETA initially and adjust to over an hour.

Can you figure out the right place to put send_progress_response so that it happens after the batch of tokens is evaluated?

Side note: I don't even know if this PR will be accepted at this point, but it would probably be helpful to someone at some point to have these details worked out.

Thanks!

-Brad

baonudesifeizhai requested a review from ngxson as a code owner July 17, 2025 04:18

github-actions bot added examples server labels Jul 17, 2025

ggerganov reviewed Jul 27, 2025

View reviewed changes

BradHutchings mentioned this pull request Jul 28, 2025

Progress API and UI for evaluating long inputs. BradHutchings/Mmojo-Server#102

Merged

github-actions bot added the testing Everything test related label Jul 28, 2025

ngxson reviewed Jul 28, 2025

View reviewed changes

github-actions bot added the python python script changes label Jul 28, 2025

Fix type error: server_port should be int, not string

3969a8d

baonudesifeizhai force-pushed the feature/prompt-progress-v2 branch from 57d841f to 3969a8d Compare July 29, 2025 09:03

ExtReMLapin mentioned this pull request Aug 8, 2025

Server: Add prompt processing progress endpoint? #6586

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add optional prompt processing progress streaming #14731

feat: Add optional prompt processing progress streaming #14731

baonudesifeizhai commented Jul 17, 2025

Uh oh!

BradHutchings commented Jul 27, 2025

Uh oh!

ggerganov left a comment

Uh oh!

ggerganov Jul 27, 2025

Uh oh!

BradHutchings Jul 27, 2025

Uh oh!

ngxson Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

BradHutchings commented Jul 27, 2025

Uh oh!

BradHutchings commented Jul 28, 2025

Uh oh!

baonudesifeizhai commented Jul 28, 2025

Uh oh!

ngxson Jul 28, 2025

Uh oh!

ngxson commented Jul 28, 2025

Uh oh!

baonudesifeizhai commented Jul 29, 2025

Uh oh!

BradHutchings commented Aug 10, 2025

Uh oh!

BradHutchings commented Aug 12, 2025

Uh oh!

Uh oh!

	params.include_prompt_progress = json_value(data, "include_prompt_progress", false);
	params.send_progress = json_value(data, "send_progress", false);

feat: Add optional prompt processing progress streaming #14731

Are you sure you want to change the base?

feat: Add optional prompt processing progress streaming #14731

Conversation

baonudesifeizhai commented Jul 17, 2025

Uh oh!

BradHutchings commented Jul 27, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

BradHutchings Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

ngxson commented Jul 27, 2025

Uh oh!

BradHutchings commented Jul 27, 2025

Uh oh!

BradHutchings commented Jul 28, 2025

Uh oh!

baonudesifeizhai commented Jul 28, 2025

Uh oh!

ngxson Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jul 28, 2025

Uh oh!

baonudesifeizhai commented Jul 29, 2025

Uh oh!

BradHutchings commented Aug 10, 2025

Uh oh!

BradHutchings commented Aug 12, 2025

Uh oh!

Uh oh!