Skip to content

Conversation

@KevinOConnor
Copy link
Collaborator

During normal printing the host software would attempt to stay ahead of the micro-controller by 2 full seconds. Change that time to 1 second. This should make the software more responsive to user requests (such as pause requests).

This is a fairly notable change in that it should make the software feel more responsive to user input during prints - things like pausing a print, changing the extruder ratio, changing the Z offset, and similar. Now these types of mid-print commands should happen in about a second where previously it took about 2 seconds.

The recent motion queuing work makes this change more practical because the host software is more capable of rapidly generating steps (PR #7038, #7034, #7014)

@pedrolamas , @meteyou , @Sineos , @Arksine - I suspect this change may impact "load graphing tools" - any tool that previously looked at the log "buffer_time" stat. Previously that value would hover around 2 seconds, but now should hover around 1 second.

It is also possible that other tools may not expect the faster response time of mid-print commands.

-Kevin

@KevinOConnor KevinOConnor changed the title Reduce "lookahead" time from 1 second to 2 seconds Reduce "lookahead" time from 2 seconds to 1 second Sep 18, 2025
@thijstriemstra
Copy link
Contributor

thijstriemstra commented Sep 20, 2025

cc @alfrix (klipperscreen)

@dmbutyugin
Copy link
Collaborator

Hi Kevin. While I do not disagree with the change, I think we need to be careful not to increase the number of print_stall events. I do believe that 2s currently is an overkill, but I'm unsure what a safe reasonable threshold should be.

@KevinOConnor
Copy link
Collaborator Author

While I do not disagree with the change, I think we need to be careful not to increase the number of print_stall events. I do believe that 2s currently is an overkill, but I'm unsure what a safe reasonable threshold should be.

Agreed. Getting more test results would be good. I don't think there will be issues when using virtual_sdcard, but there is a potential for more print_stall issues for those printing using the /tmp/printer pseudo-tty interface.

Before and after this change, the host computer needs to be able to submit moves at least as fast as the printer moves. Prior to the recent changes, a submitter could fall behind for up to about a second and still recover without causing a "print stall" event. Now with this change, a submitter may fall behind for up to about 300ms before a "print stall" is likely.

Cheers,
-Kevin

@meteyou
Copy link
Contributor

meteyou commented Sep 22, 2025

Would it be possible to add an config setting in the printer section with an default? So a user which need a higher buffer can add this value to the config?

@KevinOConnor
Copy link
Collaborator Author

Would it be possible to add an config setting in the printer section with an default? So a user which need a higher buffer can add this value to the config?

The timing is not something that is easily configured by an end-user. Valid values are dependent on the timing constants at the top of klippy/extras/motion_queuing.py and similar code constants. So, technically we could read BUFFER_TIME_HIGH from the config, but setting it to something other than an expected value would likely just cause poor results.

If we do run into issues we should be able to tweak the timing, but I think we'll need to get those test results to see if and what tweaks are needed.

Cheers,
-Kevin

@dmbutyugin
Copy link
Collaborator

Agreed. Getting more test results would be good. I don't think there will be issues when using virtual_sdcard, but there is a potential for more print_stall issues for those printing using the /tmp/printer pseudo-tty interface.

Well, even with virtual_sdcard Klipper must be able to enqueue enough moves for the back-tracking pass of LookAheadQueue to work correctly. AFAICT, the bare minimum of the queue time is min_queue_buf_time = 2 * max_velocity / (min_accel * minimum_cruise_ratio), and let's say with max_velocity = 300, min_accel = 3000, minimum_cruise_ratio = 0.5, we get min_queue_buf_time = 0.4s. So, there's still some margin left with BUFFER_TIME_HIGH = 1s, but as the parameters in that equation change, min_queue_buf_time changes accordingly, and the safety margin can decrease.

@KevinOConnor
Copy link
Collaborator Author

Well, even with virtual_sdcard Klipper must be able to enqueue enough moves for the back-tracking pass of LookAheadQueue to work correctly.

Yes - virtual_sdcard needs to be able to rapidly fill upcoming moves - enough moves to fill the lookahead buffer to the point that it flushes out enough movement to avoid a "print stall".

Specifically, whenever toolhead._check_pause() detects that the planner is more than 1 second (BUFFER_TIME_HIGH) ahead of the robot, it'll sleep until it is only 1 second ahead. At that point, virtual_sdcard can start submitting more moves. It needs to submit a sufficient number of moves that the lookahead buffer starts a "lazy flush" and that lazy flush produces enough moves that we stay more than ~700ms ahead of the robot. So, virtual_sdcard has about 300ms to submit at least 300ms of movement fully through lookahead lazy flushing.

AFAICT, the bare minimum of the queue time is min_queue_buf_time = 2 * max_velocity / (min_accel * minimum_cruise_ratio), and let's say with max_velocity = 300, min_accel = 3000, minimum_cruise_ratio = 0.5, we get min_queue_buf_time = 0.4s. So, there's still some margin left with BUFFER_TIME_HIGH = 1s, but as the parameters in that equation change, min_queue_buf_time changes accordingly, and the safety margin can decrease.

Hrmm. I'm not sure I understand what you are saying here. The lookahead code performs a "lazy flush" every 250ms of nominal movement (LOOKAHEAD_FLUSH_TIME). That is, it sums the minimum possible move time for each move (move.min_move_t which is move_d / requested_velocity) and does a "lazy flush" when that sum is over 250ms. The lazy flush code just checks to see if there is sufficient information to flush out some moves from the lookahead queue. If there isn't sufficient information it'll just try again in another 250ms of nominal move time.

The only time the lookahead code would not apply normal junction speeds would be if it is fully flushed due to a "print stall" event (toolhead._handle_step_flush() -> toolhead._flush_lookahead() -> toolhead._process_lookahead() -> toolhead.lookahead.flush(lazy=False)). So, the lookahead junction speed assignments are not directly tied to BUFFER_TIME_HIGH.

I'm not sure if I'm missing something or misunderstood what you were reporting.

Cheers,
-Kevin

@dmbutyugin
Copy link
Collaborator

Specifically, whenever toolhead._check_pause() detects that the planner is more than 1 second (BUFFER_TIME_HIGH) ahead of the robot, it'll sleep until it is only 1 second ahead. At that point, virtual_sdcard can start submitting more moves. It needs to submit a sufficient number of moves that the lookahead buffer starts a "lazy flush" and that lazy flush produces enough moves that we stay more than ~700ms ahead of the robot. So, virtual_sdcard has about 300ms to submit at least 300ms of movement fully through lookahead lazy flushing.

Yes, this is a problematic part - virtual_sdcard may get blocked before it submits enough moves for a lazy flush to flush anything. Then, if a virtual_sdcard is blocked eventually Klipper will be forced to perform a full flush of the queue, causing a print stall and a short pause. To illustrate that effect, I sliced a test gcode that just prints a cylinder of 300 mm diameter with 200 mm/sec speed in vase mode, and sets kinematics limits as

SET_VELOCITY_LIMIT VELOCITY=300 ACCEL=1000 MINIMUM_CRUISE_RATIO=0.5

so the acceleration is on the lower side, but nothing super extreme. If I run this print on the current Klipper 6118525, then it works as expected - the toolhead moves with a constant speed of 200 mm/sec, slowly going up. But if I reduce

--- a/klippy/toolhead.py
+++ b/klippy/toolhead.py
@@ -190,7 +190,7 @@ class LookAheadQueue:
         # Check if enough moves have been queued to reach the target flush time.
         return self.junction_flush <= 0.
 
-BUFFER_TIME_HIGH = 2.0
+BUFFER_TIME_HIGH = 0.5
 BUFFER_TIME_START = 0.250
 
 # Main code to track events (and their timing) on the printer toolhead

which is a bit on the extreme side currently, then I start getting print stalls on the same print. I attached both the klippy.log (including a successful run on a clean Klipper and a bad run with this modification only) and the corresponding GCode file. You may notice that MCUs are using modified Klipper MCU firmware, but I can assure you it should not cause any effects regarding this issue - and I was too lazy to reflash them back and forth, I'm sorry.

buffer-time.zip

@KevinOConnor
Copy link
Collaborator Author

Thanks for testing.

Yes, this is a problematic part - virtual_sdcard may get blocked before it submits enough moves for a lazy flush to flush anything. Then, if a virtual_sdcard is blocked eventually Klipper will be forced to perform a full flush of the queue, causing a print stall and a short pause.

Yes, I think at a high-level we are "on the same page". There's a fundamental trade-off in the setting of BUFFER_TIME_HIGH - too high results in not being responsive to user requests, and too low we risk "print stall" events.

Just "thinking out loud" we could change to something like LOOKAHEAD_FLUSH_TIME=0.150 as part of setting BUFFER_TIME_HIGH = 1.0. It should only be a minor increase in processor time and may reduce the chance of a print stall in some corner cases.

BUFFER_TIME_HIGH = 0.5

Thanks for testing. FYI, it wouldn't be valid to set BUFFER_TIME_HIGH=0.5 without also changing the constants at the top of motion_queuing.py. With the current BGFLUSH_SG_HIGH_TIME=0.700 a BUFFER_TIME_HIGH=0.5 would result in incrementing print_stall reports even under very low load. That is, the toolhead code would be sleeping past the point where the flushing code would force full flushes.

Cheers,
-Kevin

@KevinOConnor KevinOConnor force-pushed the work-toolhead-1second-20250907 branch 3 times, most recently from cb5b397 to 3529011 Compare October 1, 2025 00:47
@KevinOConnor
Copy link
Collaborator Author

FYI, I rebased this branch and added a couple of additional minor timing tweaks (including tuning of LOOKAHEAD_FLUSH_TIME as discussed above).

-Kevin

@dmbutyugin
Copy link
Collaborator

Thanks for testing. FYI, it wouldn't be valid to set BUFFER_TIME_HIGH=0.5 without also changing the constants at the top of motion_queuing.py. With the current BGFLUSH_SG_HIGH_TIME=0.700 a BUFFER_TIME_HIGH=0.5 would result in incrementing print_stall reports even under very low load. That is, the toolhead code would be sleeping past the point where the flushing code would force full flushes.

Thanks, Kevin. Then I guess it was my misunderstanding on the blocking behavior of Klipper. I was under the impression that if buffer_time excedes BUFFER_TIME_HIGH, toolhead would stop accepting more moves until the buffer decreases further, but I guess I missed some branch that handles specifically the cases when there is nothing to flush and the buffer can grow beyond BUFFER_TIME_HIGH. Then this should be OK, at least with virtual_sdcard.

@KevinOConnor
Copy link
Collaborator Author

I was under the impression that if buffer_time excedes BUFFER_TIME_HIGH, toolhead would stop accepting more moves until the buffer decreases further,

That is correct. Moves are added to the lookahead (toolhead.move()) until the amount of planned movement generated from lookahead flushes exceeds BUFFER_TIME_HIGH (print_time - estimated_current_time_on_mcu > BUFFER_TIME_HIGH). When that happens, the code delays (reactor.pause()) until there is only that much data buffered (toolhead._check_pause()). On a lookahead flush the moves are added to the trapq (toolhead._process_moves()).

It used to be that steps were immediately generated and flushed from the trapq once a move was released from the lookahead queue. However, that behavior recently changed (PR #7038). Now, steps are generated from a separate timer in motion_queuing.py (_flush_handler()). That code now checks whenever steps are pending and tries to generates those steps in advance separately from the toolhead (BGFLUSH_SG_LOW_TIME > last_step_gen_time - estimated_current_time_on_mcu > BGFLUSH_SG_HIGH_TIME). To cover the case where steps are flushed faster than the toolhead can fill the trapq, the toolhead.py registers itself as a flush callback (motion_queuing.register_flush_callback(self._handle_step_flush()) and it will flush the lookhead if it somehow falls behind (and note a possible "print stall" event).

So, toolhead.py tries to keep the trapq entries 2 seconds ahead, and motion_queuing.py tries to keep the step generation 450-700ms ahead. This PR adjusts toolhead.py to keep just 1 second ahead. We could choose other times, but it wouldn't make sense for the toolhead to try to stay only 500ms ahead if the motion_queuing code is trying to stay 700ms ahead.

It may seem like it's more complex to have two different timers (lookahead vs step generation). However, the previous behavior had some problems: steps weren't only generated from the toolhead (pwm_tool and manual_stepper could be active separate from normal G1 type moves); generating steps directly from lookahead queue flushes meant having to support steps generated far into the future (eg, 60+ seconds); and the close coordination between lookahead processing and step generation made the toolhead.py code very complex. So, I think the recent separation of the low-level step generation and flushing logic to the new motion_queuing.py code has benefits. It does add complexity when tweaking timing though.

Maybe that helps a little,
-Kevin

@KevinOConnor KevinOConnor force-pushed the work-toolhead-1second-20250907 branch 4 times, most recently from 094ca72 to a537e1f Compare October 9, 2025 22:08
Make sure each command gets an additional 100ms before flushing via
the priming timer.

Signed-off-by: Kevin O'Connor <[email protected]>
…ng moves

Normally the toolhead code will flush the lookahead buffer every
~250ms and will briefly pause to avoid buffering too much data.  That
pause allows other tasks to run.  Make sure to periodically yield to
other tasks on each lookahead buffer flush even if a delay isn't
needed.

Signed-off-by: Kevin O'Connor <[email protected]>
The current code is likely to perform a lazy flush of the lookahead
queue around 4 times a second.  Increase that to around 6-7 times a
second.  This change may slightly improve the responsiveness to user
requests mid-print (eg, changing extrusion ratio) and may make a
"print stall" less likely in some corner cases.

Signed-off-by: Kevin O'Connor <[email protected]>
During normal printing the host software would attempt to stay ahead
of the micro-controller by 2 full seconds.  Change that time to 1
second.  This should make the software more responsive to user
requests (such as pause requests).

Signed-off-by: Kevin O'Connor <[email protected]>
If there are other users of the gcode mutex then pause for 50ms
(instead of 100ms).

Signed-off-by: Kevin O'Connor <[email protected]>
Signed-off-by: Kevin O'Connor <[email protected]>
@KevinOConnor KevinOConnor force-pushed the work-toolhead-1second-20250907 branch from a537e1f to 9117c09 Compare October 10, 2025 15:12
@KevinOConnor KevinOConnor merged commit 9117c09 into master Oct 10, 2025
2 checks passed
@KevinOConnor KevinOConnor deleted the work-toolhead-1second-20250907 branch October 10, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants