feat(logger): add rate limiter#5799
feat(logger): add rate limiter#5799kalyazin wants to merge 5 commits intofirecracker-microvm:mainfrom
Conversation
2325c61 to
3ddd1f5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5799 +/- ##
==========================================
- Coverage 83.08% 83.04% -0.04%
==========================================
Files 275 276 +1
Lines 29459 29494 +35
==========================================
+ Hits 24476 24494 +18
- Misses 4983 5000 +17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0240225 to
eb60521
Compare
531998b to
80580f3
Compare
80580f3 to
b795a7b
Compare
| use crate::rate_limiter::TokenBucket; | ||
|
|
||
| /// Maximum number of messages allowed per refill period. | ||
| pub const DEFAULT_BURST: u64 = 10; |
There was a problem hiding this comment.
Is 10 message per 5 seconds overly conservative?
0514643 to
d5835aa
Compare
| use crate::cpu_config::aarch64::custom_cpu_template::VcpuFeatures; | ||
| use crate::cpu_config::templates::CpuConfiguration; | ||
| use crate::logger::{IncMetric, METRICS, error}; | ||
| use crate::logger::{IncMetric, METRICS, error_rate_limited}; |
There was a problem hiding this comment.
can we make crate::logger::error be the rate limited one and ensure we're not using log::error directly?
Clippy can be configured to check it with clippy::disallowed_macros and
disallowed-macros = [
{ path = "log::error", reason = "use crate::logger::error! instead" },
{ path = "log::warn", reason = "use crate::logger::warn! instead" },
{ path = "log::info", reason = "use crate::logger::info! instead" },
]
There was a problem hiding this comment.
I think it is worth keeping the _rate_limited suffix for all rate limited logs for consistency
disallowed_macros looks interesting though
There was a problem hiding this comment.
I thought about dropping _rate_limited part but couldn't fully convince myself in it. Also, debug and trace are not rate-limited, and there will be extra mental burden to remember the difference.
I like the clippy::disallowed_macros idea, will give it a go.
There was a problem hiding this comment.
@Manciukic @ShadowCurse I added the clippy check. Please have another look.
| static LIMITER: $crate::logger::rate_limited::LogRateLimiter = | ||
| $crate::logger::rate_limited::LogRateLimiter::new(); | ||
| static SUPPRESSED: std::sync::atomic::AtomicU64 = | ||
| std::sync::atomic::AtomicU64::new(0); |
There was a problem hiding this comment.
this seems to be 88 bytes per callsite, so roughly 14KiB (with roughly 150 callsites) in .bss. I think it's acceptable, just noting it.
There was a problem hiding this comment.
In the future we can shrink TokenBucket to 48 bytes if we replace all u64 to u32 and then here we will get nice 64 bytes per call.
But here maybe switching to AtomicU32/U16 maybe worth since suppressing 64K logs wold be pretty uncommon + we only track this value as a metadata, so even overflow is not an issue.
| pub const fn new() -> Self { | ||
| Self { | ||
| inner: OnceLock::new(), | ||
| } | ||
| } |
There was a problem hiding this comment.
can this take burst and refil_time as args? And then the Default can be implement with new(DEFAULT_BURST, ...)
This way it can be configured if needed, like in the unit test that waits for 5 seconds for no reason.
| static LIMITER: $crate::logger::rate_limited::LogRateLimiter = | ||
| $crate::logger::rate_limited::LogRateLimiter::new(); | ||
| static SUPPRESSED: std::sync::atomic::AtomicU64 = | ||
| std::sync::atomic::AtomicU64::new(0); |
There was a problem hiding this comment.
In the future we can shrink TokenBucket to 48 bytes if we replace all u64 to u32 and then here we will get nice 64 bytes per call.
But here maybe switching to AtomicU32/U16 maybe worth since suppressing 64K logs wold be pretty uncommon + we only track this value as a metadata, so even overflow is not an issue.
| use crate::cpu_config::aarch64::custom_cpu_template::VcpuFeatures; | ||
| use crate::cpu_config::templates::CpuConfiguration; | ||
| use crate::logger::{IncMetric, METRICS, error}; | ||
| use crate::logger::{IncMetric, METRICS, error_rate_limited}; |
There was a problem hiding this comment.
I think it is worth keeping the _rate_limited suffix for all rate limited logs for consistency
disallowed_macros looks interesting though
0e11369 to
18d9c30
Compare
Add a per-callsite rate limiter for logging that wraps the existing TokenBucket in OnceLock<Mutex<...>>. Each macro invocation site gets its own independent LogRateLimiter via a static, so flooding one callsite does not suppress unrelated log messages. Default configuration: 10 messages per 5-second refill period, matching the Linux kernel printk_ratelimited defaults. Include unit tests for burst enforcement, callsite independence, and token refill after the configured period. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Add error_rate_limited, warn_rate_limited, and info_rate_limited macros that wrap the LogRateLimiter. A shared helper macro deduplicates the logic across all three macros. Each macro checks log_enabled before touching the rate limiter to avoid overhead for filtered-out log levels. Per-callsite suppression counting via a static AtomicU64 reports the number of suppressed messages at warn level when logging resumes. Add rate_limited_log_count metric to LoggerSystemMetrics and update fcmetrics.py accordingly. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Replace every error, warn, and info macro invocation in the vmm crate with its rate-limited counterpart. This covers all subsystems: virtio devices, legacy devices, PCI, ACPI, vCPU handling, signal handlers, device manager, snapshot paths, RPC interface, GDB support, and the rate limiter itself. debug-level calls are left unchanged as they are filtered out in production builds. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Add clippy.toml with disallowed-macros configuration that prevents direct use of log::error, log::warn, and log::info. Enable the lint as deny in workspace clippy config. The rate-limited macro helper uses allow(clippy::disallowed_macros) internally since it must call the underlying log macros. The log_dev_preview_warning function is also allowed since it is not guest-triggerable. Non-vmm crates (firecracker, cpu-template-helper, log-instrument examples) are allowed since they do not have access to the rate-limited macros and their log calls are not guest-triggerable. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Document the new per-callsite rate-limited logging feature in the changelog. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
18d9c30 to
d6c7be0
Compare
| unexpected_cfgs = { level = "warn", check-cfg = ['cfg(kani)'] } | ||
|
|
||
| [workspace.lints.clippy] | ||
| disallowed_macros = "deny" |
There was a problem hiding this comment.
can we move this to vmm crate itself?
iow, something like the following
# src/vmm/Cargo.toml
[lints.clippy]
disallowed_macros = "deny"
| /// When logging resumes after suppression, a warn-level summary reports | ||
| /// the number of suppressed messages. | ||
| #[macro_export] | ||
| macro_rules! error_rate_limited { |
There was a problem hiding this comment.
I still think we should maybe just keep these as the standard error, warn, info so we don't have the extra mental burden of remembering everytime to use error_rate_limited (which I'm also lazy to type out). I think it makes sense since we want almost all the codeplaces to use it.
We could also introduce _unrestricted variants that are explicitly allowed for clippy and we can use in places we deem safe (like the dev preview warning).
There was a problem hiding this comment.
If we keep the original names, do you suggest that clippy check makes sure we don't introduce accidental _unrestricted?
There was a problem hiding this comment.
I mean that clippy would block direct usages of log::error and suggest to use crate::logger::error instead. error_unrestricted would be allowed, but then it would be clear at code review time that we need to pay attention to it.
There was a problem hiding this comment.
I can't see how I can make clippy disallow log::error and allow crate::logger::error at the same time because it expands the macro and sees the former inside the latter.
Changes
Add per-callsite rate limiting for guest-triggered logging paths, following the Linux kernel printk_ratelimited pattern. The error_rate_limited! macro gives each callsite its own independent, preconfigured rate limiter set to 10 messages per 5-second window. When messages are suppressed, a summary is emitted once the callsite resumes logging. A new rate_limited_log_count metric tracks total suppressions.
I was not able to build an integration test that demonstrates that the rate limiting is effective against a real end-to-end scenario because it would've required a custom guest kernel, but I ran an ad hoc experiment by inserting an extra
error_rate_limited!line into the ballooninflate descriptor processing loop (hot path) and saw that it was rate-limited from 128 lines to 10 as expected.
Reason
Guest VMs can trigger repeated error!() calls through various virtio device paths (balloon, net, block, PCI, MMIO). Under sustained error conditions, this leads to excessive disk I/O and CPU consumption on the host from synchronous log writes.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.