-
Notifications
You must be signed in to change notification settings - Fork 368
Description
Problem
A problem we've been facing involves the pruning of old AppUsageEvent and ServiceUsageEvent records. Often, these records are removed before the corresponding Apps or Services have actually stopped, making it difficult to determine how long those resources have been running. If a consumer starts polling after the start record is pruned, it won't know the true start time of that App or Service.
Challenges and Use of Purge/Seed
A specific pain point relates to the destructively_purge_all_and_reseed endpoints for App and Service Usage Events. These endpoints are often used by a consumer when they initially start consuming event records or when they realize they have missed event records that have now been pruned. While destructively_purge_all_and_reseed recreates running resources in the database, it assigns new start timestamps that do not reflect actual creation or launch times. As a result, usage metrics can become misleading.
Core Problems
- Pruning Before Completion
- The system prunes old records to manage database growth. However, if an App/Service remains running for a long period, its
startrecord may be deleted before thestoprecord exists - A newly added or recovering consumer will not see accurate start times
- The system prunes old records to manage database growth. However, if an App/Service remains running for a long period, its
- Extended Downtime Leading to Missed Events
- Sometimes, a usage-event polling service may go offline for an extended period (e.g. an unnoticed crash). By the time it resumes polling, older events may have been pruned, leaving gaps in historical data
- Accurate State Visibility
- It becomes challenging to piece together which Apps or Services are still running when critical events have already been removed, forcing reliance on destructively_purge_all_and_reseed to reset the data (where we lose accurate historical start times)
Potential Approaches
After running into this issue repeatedly, I’ve created a set of code changes for addressing some of these issues:
- Keep
startRecords for Active Apps/Services- Records remain in place until the corresponding
stopevent is exists, preventing the loss of essential lifecycle information.
- Records remain in place until the corresponding
- Consumer Registration
- By including
consumer_guidandafter_guidin usage-event requests, consumers can register themselves, allowing the Cloud Controller to avoid pruning events they have not yet processed
- By including
- Threshold-Based Pruning
- A configurable limit (
threshold_for_keeping_unprocessed_records) ensures the database does not grow indefinitely if a registered consumer stays offline. If the record count exceeds this threshold, older entries can still be pruned
- A configurable limit (
- Endpoints for Managing Consumers
- Operators or automated systems can view, remove, or otherwise manage registered consumers. This enables consumers to deregister themselves and make more informed decisions about when to request
destructively_purge_all_and_reseed
- Operators or automated systems can view, remove, or otherwise manage registered consumers. This enables consumers to deregister themselves and make more informed decisions about when to request
Questions for the Community
- Have folks run into a similar challenge with
startevents being pruned prematurely, leading to confusion about how long resources have been running? - Have you had to use
destructively_purge_all_and_reseedin a similar manner? - Does retaining usage events of running Apps and Services sound like a beneficial idea?
- Do consumer registration and threshold-based pruning strike a reasonable balance between data retention and database size management?
- Are there alternative approaches that could better manage event pruning while preserving critical usage data?