Skip to content

CRITICAL: currencyservice CPU throttling causing cascading failures across checkout, frontend, and cart services #3222

@cswita

Description

@cswita

Issue Summary

Severity: CRITICAL
Status: Recurring (3+ occurrences in 24 hours)
Affected Services: currencyservice (primary), store-checkoutservice, store-frontend, store-cartservice, store-productcatalogservice
Environment: Kubernetes cluster store-kubecluster, namespace store
First Occurrence: January 20, 2026 at 18:49 UTC

Problem Description

The currencyservice is experiencing recurring CPU throttling events (100% CPU utilization) causing cascading performance degradation across dependent microservices. During these events, checkout service response times increased 135x (from 0.02s to 2.7s), triggering CRITICAL alerts across the platform.

Root Cause Analysis

Timeline of Events

Time (UTC) Event CPU Usage Impact
18:49 Incident Start 3% → 100% currencyservice CPU spike
18:49-18:59 Peak Degradation 100% Response time 2.7s (was 0.02s)
18:59-19:09 Continued Throttling 90-100% Traffic dropped 75%
19:09-19:29 Recovery 3% Normal operations resumed
19:29+ Recurrence 100% Issue repeating

Evidence

1. currencyservice CPU Metrics

Normal:   ~3% CPU utilization
Incident: 100% CPU utilization (33x increase)
Pod:      currencyservice-79b494fbd5-8dxln

2. Checkout Service Impact

Average Duration (Normal): 0.019s
Average Duration (Incident): 2.71s
P95 Duration: 0.297s (was 0.10s)
Throughput Drop: 340 → 290 requests/hour

3. Traffic Pattern Anomaly

  • Normal: ~7,900 requests/30min (productcatalog), ~4,400 (currency)
  • During Incident: Dropped to 1,400-900 requests (~75% reduction)
  • After Recovery: Returned to normal levels

This indicates NOT a traffic spike, but rather performance degradation causing timeouts/failures.

4. Infrastructure Context

  • Server: ip-172-31-13-57
  • CPU: ~11.5% average (within normal range)
  • Memory: Data unavailable but correlation suggests potential issue
  • No deployment events in last 48 hours

Likely Root Causes

  1. Runaway Process/Thread ⚠️ MOST LIKELY

    • CPU pegged at 100% without traffic increase
    • Suggests infinite loop, deadlock, or resource leak
    • Node.js v12.13.0 runtime may have GC issues
  2. Memory Leak

    • Long-running pod accumulating memory over time
    • Triggering GC thrashing in Node.js
  3. Bad Request Pattern

    • Specific currency conversion triggering expensive operation
    • Edge case in currency calculation logic
  4. External Dependency

    • If currency service calls external API for exchange rates
    • Potential rate limiting or timeout issues
  5. Container Resource Limits

    • Kubernetes CFS quota enforcement
    • CPU throttling at container level

Impact Analysis

Directly Affected Services

  • store-checkoutservice - Response time anomaly alerts
  • store-frontend - Response time > 80ms threshold
  • store-currencyservice - Primary affected service
  • store-cartservice - Response time anomalies

Alert Summary (Last 24 Hours)

  • Total Issues: 7 for checkout service (all auto-resolved after CPU recovery)
  • Currently Active: 4 CRITICAL issues still open
  • Pattern: Recurring every ~2-3 hours

Business Impact

  • 75% drop in successful transactions during incidents
  • User-facing checkout failures
  • Revenue loss during ~30-minute windows

Recommended Actions

Immediate (P0)

  1. Restart currencyservice pod to clear potential memory leak
  2. 🔍 Collect heap dump/profile before restart for analysis
  3. 📊 Review logs from currencyservice around 18:49 UTC for anomalous requests

Short-term (P1)

  1. Add/verify resource limits and requests for currencyservice

    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 500m
        memory: 512Mi
  2. 📈 Enable Horizontal Pod Autoscaler (HPA)

    • Target: 50% CPU utilization
    • Min replicas: 2
    • Max replicas: 5
  3. 🚨 Add detailed monitoring

    • CPU per-thread metrics
    • Heap size and GC metrics
    • External API latency if applicable

Long-term (P2)

  1. 🔬 Code Review

    • Identify expensive currency conversion operations
    • Add caching for frequently requested currency pairs
    • Review Node.js version (12.13.0 → latest LTS)
  2. 🎯 Circuit Breaker Pattern

    • Implement timeout and fallback for currency conversions
    • Prevent cascade failures to dependent services
  3. 📝 Load Testing

    • Reproduce issue with specific currency conversion patterns
    • Identify memory leak through sustained load

Additional Context

No Recent Deployments

  • Zero code deployments or configuration changes in last 48 hours
  • Issue is purely runtime/operational, not code-related

Server Metrics (ip-172-31-13-57)

  • Overall CPU: 11.5% (healthy)
  • Memory data: Unavailable
  • Multiple services running on same node

Related Issues

  • Multiple similar "App response time anomaly" patterns
  • Container instability on currencyservice pod

Monitoring Links


Priority: P0 - Recurring production incident causing customer impact
Next Review: Monitor for recurrence within next 4 hours

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions