Why might there be CPU throttling happening in LiteLLM pods, deployed in k8s, with 1 uvicorn worker? #23896

musabshak · 2026-03-17T20:50:09Z

musabshak
Mar 17, 2026

Haven't deep-dived into the architecture myself yet. Deployed LiteLLM in k8s. Decent traffic going through instance. CPU limit is 4, there are 4 pods.

I don't quite understand why I am seeing CPU throttling by the kernel, of the LiteLLM pod. It's running fastapi + asyncio + uvicorn. Uvicorn has 1 worker configured per pod / container. Python has GIL, and there's no multiprocessing used.

I am not intimately familiar with asyncio / fastapi. Does it spawn multiple threads? I thought it was an event loop in single thread. Are the multiple threads somehow not subject to the GIL? What's going on?

Grateful for any thoughts from the devs / folks who have deep-dived into the architecture 🙏

stef41 · 2026-04-11T19:43:08Z

stef41
Apr 11, 2026

This is expected behavior — LiteLLM is not single-threaded despite using asyncio.

Why you see multi-core CPU usage

1. ThreadPoolExecutor with 100 workers

LiteLLM has a global thread pool in litellm/litellm_core_utils/thread_pool_executor.py:

MAX_THREADS = 100
executor = ThreadPoolExecutor(max_workers=MAX_THREADS)

This is used by the asyncify wrapper (litellm_core_utils/asyncify.py) which runs synchronous SDK calls (like the OpenAI/Anthropic sync clients) on worker threads via anyio.to_thread.run_sync(). These threads do release the GIL during I/O (C-level HTTP calls via httpcore/h11), so multiple threads can genuinely run in parallel during network I/O wait.

2. Additional thread sources

Guardrails: concurrent.futures.ThreadPoolExecutor for Presidio PII detection (guardrail_hooks/presidio.py)
Background health checks: periodic async tasks that log thread_count metrics via threading.active_count() in the health check loop
asyncio.to_thread() calls in config management endpoints

3. GIL doesn't prevent CPU throttling

The GIL prevents parallel Python bytecode execution, but:

HTTP I/O releases the GIL (C extensions in httpcore/h11)
JSON serialization/deserialization of large payloads is CPU-bound and happens in bursts
Thread scheduling overhead itself costs CPU cycles
100 threads doing context switches under load is non-trivial

Recommendations

Set CPU request closer to your CPU limit — CFS throttling in k8s is triggered when a pod's CPU usage exceeds the limit within a 100ms quota period. Bursty workloads (many threads doing short CPU work) hit this easily even if average utilization is low.
Consider reducing MAX_THREADS — 100 is aggressive for a proxy that's mostly forwarding requests. If you're not using sync SDK calls heavily, a lower value (e.g., 20-40) may reduce thread scheduling overhead without hurting throughput.
Increase CPU limit or remove it — for proxy workloads, a higher limit with lower request is often better than a tight limit. The bursty nature of JSON parsing + thread scheduling makes tight limits particularly punishing.
Monitor thread count — LiteLLM already logs thread_count in health check cycles. Check if you're actually hitting 100 concurrent threads or if it's lower. If the actual thread count is much lower than 100, the throttling is likely from JSON serialization bursts rather than thread parallelism.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why might there be CPU throttling happening in LiteLLM pods, deployed in k8s, with 1 uvicorn worker? #23896

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Why might there be CPU throttling happening in LiteLLM pods, deployed in k8s, with 1 uvicorn worker? #23896

Uh oh!

musabshak Mar 17, 2026

Replies: 1 comment

Uh oh!

stef41 Apr 11, 2026

Why you see multi-core CPU usage

1. ThreadPoolExecutor with 100 workers

2. Additional thread sources

3. GIL doesn't prevent CPU throttling

Recommendations

musabshak
Mar 17, 2026

stef41
Apr 11, 2026