Skip to content

feat(logger): add rate limiter#5799

Open
kalyazin wants to merge 5 commits intofirecracker-microvm:mainfrom
kalyazin:log_rate_limiter
Open

feat(logger): add rate limiter#5799
kalyazin wants to merge 5 commits intofirecracker-microvm:mainfrom
kalyazin:log_rate_limiter

Conversation

@kalyazin
Copy link
Copy Markdown
Contributor

@kalyazin kalyazin commented Mar 27, 2026

Changes

Add per-callsite rate limiting for guest-triggered logging paths, following the Linux kernel printk_ratelimited pattern. The error_rate_limited! macro gives each callsite its own independent, preconfigured rate limiter set to 10 messages per 5-second window. When messages are suppressed, a summary is emitted once the callsite resumes logging. A new rate_limited_log_count metric tracks total suppressions.

I was not able to build an integration test that demonstrates that the rate limiting is effective against a real end-to-end scenario because it would've required a custom guest kernel, but I ran an ad hoc experiment by inserting an extra error_rate_limited! line into the balloon
inflate descriptor processing loop (hot path) and saw that it was rate-limited from 128 lines to 10 as expected.

Reason

Guest VMs can trigger repeated error!() calls through various virtio device paths (balloon, net, block, PCI, MMIO). Under sustained error conditions, this leads to excessive disk I/O and CPU consumption on the host from synchronous log writes.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 40.58355% with 224 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.04%. Comparing base (d181eda) to head (d6c7be0).

Files with missing lines Patch % Lines
src/vmm/src/persist.rs 22.22% 14 Missing ⚠️
...rc/vmm/src/devices/virtio/balloon/event_handler.rs 0.00% 12 Missing ⚠️
src/vmm/src/devices/virtio/vsock/event_handler.rs 20.00% 12 Missing ⚠️
src/vmm/src/devices/virtio/vsock/unix/muxer.rs 7.69% 12 Missing ⚠️
src/vmm/src/devices/legacy/serial.rs 8.33% 11 Missing ⚠️
src/vmm/src/devices/virtio/net/event_handler.rs 0.00% 11 Missing ⚠️
src/vmm/src/devices/virtio/vsock/device.rs 28.57% 10 Missing ⚠️
src/vmm/src/devices/virtio/block/virtio/device.rs 30.76% 9 Missing ⚠️
...m/src/devices/virtio/block/virtio/event_handler.rs 0.00% 9 Missing ⚠️
src/vmm/src/pci/msix.rs 47.05% 9 Missing ⚠️
... and 37 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5799      +/-   ##
==========================================
- Coverage   83.08%   83.04%   -0.04%     
==========================================
  Files         275      276       +1     
  Lines       29459    29494      +35     
==========================================
+ Hits        24476    24494      +18     
- Misses       4983     5000      +17     
Flag Coverage Δ
5.10-m5n.metal 83.37% <40.16%> (-0.04%) ⬇️
5.10-m6a.metal 82.70% <40.16%> (-0.05%) ⬇️
5.10-m6g.metal 79.96% <40.92%> (-0.03%) ⬇️
5.10-m6i.metal 83.37% <40.16%> (-0.05%) ⬇️
5.10-m7a.metal-48xl 82.69% <40.16%> (-0.04%) ⬇️
5.10-m7g.metal 79.96% <40.92%> (-0.03%) ⬇️
5.10-m7i.metal-24xl 83.34% <40.16%> (-0.06%) ⬇️
5.10-m7i.metal-48xl 83.34% <40.16%> (-0.04%) ⬇️
5.10-m8g.metal-24xl 79.96% <40.92%> (-0.03%) ⬇️
5.10-m8g.metal-48xl 79.96% <40.92%> (-0.03%) ⬇️
5.10-m8i.metal-48xl 83.34% <40.16%> (-0.05%) ⬇️
5.10-m8i.metal-96xl 83.35% <40.16%> (-0.04%) ⬇️
6.1-m5n.metal 83.41% <39.88%> (-0.03%) ⬇️
6.1-m6a.metal 82.73% <39.88%> (-0.05%) ⬇️
6.1-m6g.metal 79.96% <40.92%> (-0.03%) ⬇️
6.1-m6i.metal 83.40% <39.88%> (-0.04%) ⬇️
6.1-m7a.metal-48xl 82.73% <39.88%> (-0.04%) ⬇️
6.1-m7g.metal 79.95% <40.92%> (-0.04%) ⬇️
6.1-m7i.metal-24xl 83.41% <39.88%> (-0.04%) ⬇️
6.1-m7i.metal-48xl 83.42% <39.88%> (-0.04%) ⬇️
6.1-m8g.metal-24xl 79.95% <40.92%> (-0.04%) ⬇️
6.1-m8g.metal-48xl 79.96% <40.92%> (-0.03%) ⬇️
6.1-m8i.metal-48xl 83.42% <39.88%> (-0.04%) ⬇️
6.1-m8i.metal-96xl 83.42% <39.88%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kalyazin kalyazin force-pushed the log_rate_limiter branch 2 times, most recently from 0240225 to eb60521 Compare March 27, 2026 12:33
@kalyazin kalyazin marked this pull request as ready for review March 27, 2026 12:33
@kalyazin kalyazin requested review from Manciukic and pb8o as code owners March 27, 2026 12:33
@kalyazin kalyazin self-assigned this Mar 27, 2026
@kalyazin kalyazin added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Mar 27, 2026
@kalyazin kalyazin force-pushed the log_rate_limiter branch 3 times, most recently from 531998b to 80580f3 Compare March 30, 2026 14:40
use crate::rate_limiter::TokenBucket;

/// Maximum number of messages allowed per refill period.
pub const DEFAULT_BURST: u64 = 10;
Copy link
Copy Markdown
Contributor

@ilstam ilstam Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 10 message per 5 seconds overly conservative?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's per callsite

@kalyazin kalyazin force-pushed the log_rate_limiter branch 2 times, most recently from 0514643 to d5835aa Compare April 2, 2026 11:49
use crate::cpu_config::aarch64::custom_cpu_template::VcpuFeatures;
use crate::cpu_config::templates::CpuConfiguration;
use crate::logger::{IncMetric, METRICS, error};
use crate::logger::{IncMetric, METRICS, error_rate_limited};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make crate::logger::error be the rate limited one and ensure we're not using log::error directly?

Clippy can be configured to check it with clippy::disallowed_macros and

disallowed-macros = [
    { path = "log::error", reason = "use crate::logger::error! instead" },
    { path = "log::warn", reason = "use crate::logger::warn! instead" },
    { path = "log::info", reason = "use crate::logger::info! instead" },
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth keeping the _rate_limited suffix for all rate limited logs for consistency
disallowed_macros looks interesting though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about dropping _rate_limited part but couldn't fully convince myself in it. Also, debug and trace are not rate-limited, and there will be extra mental burden to remember the difference.
I like the clippy::disallowed_macros idea, will give it a go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Manciukic @ShadowCurse I added the clippy check. Please have another look.

Comment on lines +59 to +62
static LIMITER: $crate::logger::rate_limited::LogRateLimiter =
$crate::logger::rate_limited::LogRateLimiter::new();
static SUPPRESSED: std::sync::atomic::AtomicU64 =
std::sync::atomic::AtomicU64::new(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be 88 bytes per callsite, so roughly 14KiB (with roughly 150 callsites) in .bss. I think it's acceptable, just noting it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we can shrink TokenBucket to 48 bytes if we replace all u64 to u32 and then here we will get nice 64 bytes per call.
But here maybe switching to AtomicU32/U16 maybe worth since suppressing 64K logs wold be pretty uncommon + we only track this value as a metadata, so even overflow is not an issue.

Comment on lines +39 to +43
pub const fn new() -> Self {
Self {
inner: OnceLock::new(),
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this take burst and refil_time as args? And then the Default can be implement with new(DEFAULT_BURST, ...)
This way it can be configured if needed, like in the unit test that waits for 5 seconds for no reason.

Comment on lines +59 to +62
static LIMITER: $crate::logger::rate_limited::LogRateLimiter =
$crate::logger::rate_limited::LogRateLimiter::new();
static SUPPRESSED: std::sync::atomic::AtomicU64 =
std::sync::atomic::AtomicU64::new(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we can shrink TokenBucket to 48 bytes if we replace all u64 to u32 and then here we will get nice 64 bytes per call.
But here maybe switching to AtomicU32/U16 maybe worth since suppressing 64K logs wold be pretty uncommon + we only track this value as a metadata, so even overflow is not an issue.

use crate::cpu_config::aarch64::custom_cpu_template::VcpuFeatures;
use crate::cpu_config::templates::CpuConfiguration;
use crate::logger::{IncMetric, METRICS, error};
use crate::logger::{IncMetric, METRICS, error_rate_limited};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth keeping the _rate_limited suffix for all rate limited logs for consistency
disallowed_macros looks interesting though

@kalyazin kalyazin force-pushed the log_rate_limiter branch 5 times, most recently from 0e11369 to 18d9c30 Compare April 8, 2026 09:25
kalyazin added 5 commits April 8, 2026 13:37
Add a per-callsite rate limiter for logging that wraps the
existing TokenBucket in OnceLock<Mutex<...>>. Each macro
invocation site gets its own independent LogRateLimiter via
a static, so flooding one callsite does not suppress
unrelated log messages.

Default configuration: 10 messages per 5-second refill
period, matching the Linux kernel printk_ratelimited
defaults.

Include unit tests for burst enforcement, callsite
independence, and token refill after the configured
period.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Add error_rate_limited, warn_rate_limited, and
info_rate_limited macros that wrap the LogRateLimiter.
A shared helper macro deduplicates the logic across all
three macros.

Each macro checks log_enabled before touching the rate
limiter to avoid overhead for filtered-out log levels.
Per-callsite suppression counting via a static AtomicU64
reports the number of suppressed messages at warn level
when logging resumes.

Add rate_limited_log_count metric to LoggerSystemMetrics
and update fcmetrics.py accordingly.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Replace every error, warn, and info macro invocation in
the vmm crate with its rate-limited counterpart. This
covers all subsystems: virtio devices, legacy devices,
PCI, ACPI, vCPU handling, signal handlers, device
manager, snapshot paths, RPC interface, GDB support,
and the rate limiter itself.

debug-level calls are left unchanged as they are
filtered out in production builds.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Add clippy.toml with disallowed-macros configuration
that prevents direct use of log::error, log::warn, and
log::info. Enable the lint as deny in workspace clippy
config.

The rate-limited macro helper uses
allow(clippy::disallowed_macros) internally since it
must call the underlying log macros. The
log_dev_preview_warning function is also allowed since
it is not guest-triggerable.

Non-vmm crates (firecracker, cpu-template-helper,
log-instrument examples) are allowed since they do not
have access to the rate-limited macros and their log
calls are not guest-triggerable.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Document the new per-callsite rate-limited logging
feature in the changelog.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
unexpected_cfgs = { level = "warn", check-cfg = ['cfg(kani)'] }

[workspace.lints.clippy]
disallowed_macros = "deny"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to vmm crate itself?

iow, something like the following

# src/vmm/Cargo.toml
[lints.clippy]
disallowed_macros = "deny"

/// When logging resumes after suppression, a warn-level summary reports
/// the number of suppressed messages.
#[macro_export]
macro_rules! error_rate_limited {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should maybe just keep these as the standard error, warn, info so we don't have the extra mental burden of remembering everytime to use error_rate_limited (which I'm also lazy to type out). I think it makes sense since we want almost all the codeplaces to use it.
We could also introduce _unrestricted variants that are explicitly allowed for clippy and we can use in places we deem safe (like the dev preview warning).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the original names, do you suggest that clippy check makes sure we don't introduce accidental _unrestricted?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that clippy would block direct usages of log::error and suggest to use crate::logger::error instead. error_unrestricted would be allowed, but then it would be clear at code review time that we need to pay attention to it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see how I can make clippy disallow log::error and allow crate::logger::error at the same time because it expands the macro and sees the former inside the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants