[WIP][HiCache][HA 2/N] feat: Support HiCache warmup for cold start acceleration#20105
[WIP][HiCache][HA 2/N] feat: Support HiCache warmup for cold start acceleration#20105alphabetc1 wants to merge 5 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a global warmup mechanism for HiCache, designed to pre-populate the host-level KV cache from shared storage during startup. This enhancement aims to significantly accelerate cold start performance by reducing the need for recomputation, particularly for frequently accessed or 'pinned' prefixes, thereby mitigating the 'cache stampede' problem. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a HiCache warmup mechanism to accelerate cold starts, a valuable feature implemented with well-structured background threads and queues for I/O operations. While the implementation correctly handles cache entry metadata and high-priority loading, critical security vulnerabilities have been identified. A distributed deadlock can occur during warmup if TP ranks have inconsistent manifests, potentially leading to a server hang. The default use of a shared temporary directory (/tmp/hicache) for storage also poses risks of data leakage and cache poisoning. Additionally, the manifest file's lack of size limits or rotation could be exploited for Denial of Service via memory exhaustion. General improvements are also suggested for the scalability of the warmup manifest file and ensuring warmed-up entry priority informs eviction policies.
| torch.distributed.all_reduce( | ||
| t, | ||
| op=torch.distributed.ReduceOp.MIN, | ||
| group=tp_group, | ||
| ) |
There was a problem hiding this comment.
The warmup thread performs a torch.distributed.all_reduce operation inside a loop over entries retrieved from the storage backend. Since each rank in a Tensor Parallel (TP) group may have a different set of entries in its local manifest (due to non-deterministic timestamps or previous partial failures), the number of iterations can differ between ranks. This will cause ranks with fewer entries to exit the loop while others are still waiting for an all_reduce, resulting in a permanent deadlock of the warmup process and potentially the entire server during startup.
| self._manifest_path = os.path.join( | ||
| self.file_path, f"__warmup_manifest__{self.config_suffix}.jsonl" | ||
| ) |
There was a problem hiding this comment.
The HiCache storage backend defaults to using /tmp/hicache for storing KV cache data and the warmup manifest. On multi-user systems, the /tmp directory is shared. This allows other users on the same machine to read sensitive KV cache data (which may contain PII from prompts) or poison the cache by modifying the manifest file or injecting malicious cache entries. Using a shared temporary directory for sensitive data without restricted permissions is a significant security risk.
| with open(self._manifest_path, "r") as f: | ||
| for line in f: | ||
| line = line.strip() | ||
| if not line: | ||
| continue | ||
| try: | ||
| record = json.loads(line) | ||
| except json.JSONDecodeError: | ||
| continue | ||
| chain_key = tuple(record["hash_chain"]) | ||
| # Later entries overwrite earlier (dedup by recency) | ||
| seen[chain_key] = record |
There was a problem hiding this comment.
The HiCacheFile backend appends records to the JSONL manifest file without size limits or rotation. This can lead to the file growing indefinitely, causing high memory consumption and slow startup times during warmup. Critically, an attacker with write access to the storage directory (defaulting to /tmp/hicache) could exploit this by appending a massive number of entries, leading to excessive memory consumption (OOM) and CPU usage, resulting in a Denial of Service. Consider implementing a mechanism to manage the manifest file's size, such as periodic truncation or rotation, or processing only the most recent portion of the file during warmup to limit memory usage.
| token_ids, hash_chain, host_indices, priority = item | ||
| key = RadixKey(token_ids=token_ids) | ||
| self._insert_helper_host(self.root_node, key, host_indices, hash_chain) |
There was a problem hiding this comment.
The priority of the warmup entry is retrieved from the queue but is not used when inserting the entry into the radix tree via _insert_helper_host. This priority information should be propagated to the TreeNode to influence priority-based eviction strategies. Consider modifying _insert_helper_host to accept and utilize this priority value.
| token_ids, hash_chain, host_indices, priority = item | ||
| key = RadixKey(token_ids=token_ids) | ||
| self._insert_helper_host(self.root_node, key, host_indices, hash_chain) |
There was a problem hiding this comment.
The priority of the warmup entry is retrieved from the queue but is not used when inserting the entry into the radix tree via _insert_helper_host. This priority information should be propagated to the TreeNode to influence priority-based eviction strategies. Consider modifying _insert_helper_host to accept and utilize this priority value.
Motivation
New sglang instances start with empty KV cache, causing all requests to recompute from scratch. Global warmup enables pre-populating host-level cache from shared storage on startup, prioritizing pinned prefixes, eliminating the cold start "cache stampede" problem.
Modifications
Extended storage write path to persist priority and token_ids metadata via record_warmup_metadata. Added WarmupEntry dataclass and list_warmup_entries to storage ABC with HiCacheFile JSONL manifest reference implementation. Implemented background warmup thread in HiCacheController with queue-based, thread-safe radix tree insertion on the main thread. Added warmup_ratio config. Mirrored changes for Mamba cache.
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci