-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Description
Issue Summary
Severity: CRITICAL
Status: Recurring (3+ occurrences in 24 hours)
Affected Services: currencyservice (primary), store-checkoutservice, store-frontend, store-cartservice, store-productcatalogservice
Environment: Kubernetes cluster store-kubecluster, namespace store
First Occurrence: January 20, 2026 at 18:49 UTC
Problem Description
The currencyservice is experiencing recurring CPU throttling events (100% CPU utilization) causing cascading performance degradation across dependent microservices. During these events, checkout service response times increased 135x (from 0.02s to 2.7s), triggering CRITICAL alerts across the platform.
Root Cause Analysis
Timeline of Events
| Time (UTC) | Event | CPU Usage | Impact |
|---|---|---|---|
| 18:49 | Incident Start | 3% → 100% | currencyservice CPU spike |
| 18:49-18:59 | Peak Degradation | 100% | Response time 2.7s (was 0.02s) |
| 18:59-19:09 | Continued Throttling | 90-100% | Traffic dropped 75% |
| 19:09-19:29 | Recovery | 3% | Normal operations resumed |
| 19:29+ | Recurrence | 100% | Issue repeating |
Evidence
1. currencyservice CPU Metrics
Normal: ~3% CPU utilization
Incident: 100% CPU utilization (33x increase)
Pod: currencyservice-79b494fbd5-8dxln
2. Checkout Service Impact
Average Duration (Normal): 0.019s
Average Duration (Incident): 2.71s
P95 Duration: 0.297s (was 0.10s)
Throughput Drop: 340 → 290 requests/hour
3. Traffic Pattern Anomaly
- Normal: ~7,900 requests/30min (productcatalog), ~4,400 (currency)
- During Incident: Dropped to 1,400-900 requests (~75% reduction)
- After Recovery: Returned to normal levels
This indicates NOT a traffic spike, but rather performance degradation causing timeouts/failures.
4. Infrastructure Context
- Server:
ip-172-31-13-57 - CPU: ~11.5% average (within normal range)
- Memory: Data unavailable but correlation suggests potential issue
- No deployment events in last 48 hours
Likely Root Causes
-
Runaway Process/Thread
⚠️ MOST LIKELY- CPU pegged at 100% without traffic increase
- Suggests infinite loop, deadlock, or resource leak
- Node.js v12.13.0 runtime may have GC issues
-
Memory Leak
- Long-running pod accumulating memory over time
- Triggering GC thrashing in Node.js
-
Bad Request Pattern
- Specific currency conversion triggering expensive operation
- Edge case in currency calculation logic
-
External Dependency
- If currency service calls external API for exchange rates
- Potential rate limiting or timeout issues
-
Container Resource Limits
- Kubernetes CFS quota enforcement
- CPU throttling at container level
Impact Analysis
Directly Affected Services
- ✅
store-checkoutservice- Response time anomaly alerts - ✅
store-frontend- Response time > 80ms threshold - ✅
store-currencyservice- Primary affected service - ✅
store-cartservice- Response time anomalies
Alert Summary (Last 24 Hours)
- Total Issues: 7 for checkout service (all auto-resolved after CPU recovery)
- Currently Active: 4 CRITICAL issues still open
- Pattern: Recurring every ~2-3 hours
Business Impact
- 75% drop in successful transactions during incidents
- User-facing checkout failures
- Revenue loss during ~30-minute windows
Recommended Actions
Immediate (P0)
- ⚡ Restart currencyservice pod to clear potential memory leak
- 🔍 Collect heap dump/profile before restart for analysis
- 📊 Review logs from currencyservice around 18:49 UTC for anomalous requests
Short-term (P1)
-
✅ Add/verify resource limits and requests for currencyservice
resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi
-
📈 Enable Horizontal Pod Autoscaler (HPA)
- Target: 50% CPU utilization
- Min replicas: 2
- Max replicas: 5
-
🚨 Add detailed monitoring
- CPU per-thread metrics
- Heap size and GC metrics
- External API latency if applicable
Long-term (P2)
-
🔬 Code Review
- Identify expensive currency conversion operations
- Add caching for frequently requested currency pairs
- Review Node.js version (12.13.0 → latest LTS)
-
🎯 Circuit Breaker Pattern
- Implement timeout and fallback for currency conversions
- Prevent cascade failures to dependent services
-
📝 Load Testing
- Reproduce issue with specific currency conversion patterns
- Identify memory leak through sustained load
Additional Context
No Recent Deployments
- Zero code deployments or configuration changes in last 48 hours
- Issue is purely runtime/operational, not code-related
Server Metrics (ip-172-31-13-57)
- Overall CPU: 11.5% (healthy)
- Memory data: Unavailable
- Multiple services running on same node
Related Issues
- Multiple similar "App response time anomaly" patterns
- Container instability on currencyservice pod
Monitoring Links
- Checkout Service Dashboard
- Issue ID: f0d9d5bb-a687-4d4e-8401-9fc0b5e2da12
- Account: MSFT_BUILD_2025 (6751398)
Priority: P0 - Recurring production incident causing customer impact
Next Review: Monitor for recurrence within next 4 hours