Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/configuration/tpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ This initial compilation time ranges significantly and is impacted by many of th

#### max model len vs. most model len

![most_model_len](../assets/design/v1/tpu/most_model_len.png)
![most_model_len](../assets/design/tpu/most_model_len.png)

If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable.

Expand Down
6 changes: 3 additions & 3 deletions docs/design/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ And the calculated intervals are:

Put another way:

![Interval calculations - common case](../../assets/design/v1/metrics/intervals-1.png)
![Interval calculations - common case](../assets/design/metrics/intervals-1.png)

We explored the possibility of having the frontend calculate these
intervals using the timing of events visible by the frontend. However,
Expand All @@ -238,13 +238,13 @@ When a preemption occurs during decode, since any already generated
tokens are reused, we consider the preemption as affecting the
inter-token, decode, and inference intervals.

![Interval calculations - preempted decode](../../assets/design/v1/metrics/intervals-2.png)
![Interval calculations - preempted decode](../assets/design/metrics/intervals-2.png)

When a preemption occurs during prefill (assuming such an event
is possible), we consider the preemption as affecting the
time-to-first-token and prefill intervals.

![Interval calculations - preempted prefill](../../assets/design/v1/metrics/intervals-3.png)
![Interval calculations - preempted prefill](../assets/design/metrics/intervals-3.png)

### Frontend Stats Collection

Expand Down
16 changes: 8 additions & 8 deletions docs/design/prefix_caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ There are two design points to highlight:

As a result, we will have the following components when the KV cache manager is initialized:

![Component Overview](../../assets/design/v1/prefix_caching/overview.png)
![Component Overview](../assets/design/prefix_caching/overview.png)

* Block Pool: A list of KVCacheBlock.
* Free Block Queue: Only store the pointers of head and tail blocks for manipulations.
Expand Down Expand Up @@ -192,7 +192,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund

When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first.

![Free queue after a request us freed](../../assets/design/v1/prefix_caching/free.png)
![Free queue after a request us freed](../assets/design/prefix_caching/free.png)

### Eviction (LRU)

Expand All @@ -208,24 +208,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens),

**Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens.

![Example Time 1](../../assets/design/v1/prefix_caching/example-time-1.png)
![Example Time 1](../assets/design/prefix_caching/example-time-1.png)

**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.

![Example Time 3](../../assets/design/v1/prefix_caching/example-time-3.png)
![Example Time 3](../assets/design/prefix_caching/example-time-3.png)

**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens.

![Example Time 4](../../assets/design/v1/prefix_caching/example-time-4.png)
![Example Time 4](../assets/design/prefix_caching/example-time-4.png)

**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.

![Example Time 5](../../assets/design/v1/prefix_caching/example-time-5.png)
![Example Time 5](../assets/design/prefix_caching/example-time-5.png)

**Time 6: Request 1 is finished and free.**

![Example Time 6](../../assets/design/v1/prefix_caching/example-time-6.png)
![Example Time 6](../assets/design/prefix_caching/example-time-6.png)

**Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).

![Example Time 7](../../assets/design/v1/prefix_caching/example-time-7.png)
![Example Time 7](../assets/design/prefix_caching/example-time-7.png)