-
Notifications
You must be signed in to change notification settings - Fork 7
[Compute] Add llm.c request encoder kernel reference #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
12e86bb
6343790
f2e48cc
5dc8a1e
e2c6fd8
5eceddc
aed7590
c87a166
4c57007
d204529
742b27b
6c0018b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,5 @@ | ||
| <!-- markdownlint-disable-file MD033, MD001 --> | ||
|
|
||
| # Kernels for positional encoder forward pass in GPT-2 | ||
|
|
||
| <!-- Header --> | ||
|
|
@@ -6,6 +8,82 @@ | |
|
|
||
| <!-- Main Body --> | ||
|
|
||
| ## Introduction | ||
|
|
||
| After tokenization, batches of input texts would be turned into arrays of token | ||
| ids . The GPT-2 Encoder will turn these input ids into hidden representations that | ||
| can be proceessed with the remainder of the transformer blocks: | ||
|
|
||
| - Input: | ||
| - "input" `inp`, produced by the tokenizer- An integer array of shape $(B,\, T)$, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in-line math in the mdbook's render of markdown math doesn't use single "$" as delimeters. Rather it uses '\( \)'. Please replace this and all instances of '$ ... $' with '\( ... \)' |
||
| where $B$ denotes the batch size (a total of $B$ sequences in the batch) while | ||
| $T$ denotes the tokens in each sequence (each sequence contains a maximum of | ||
| $T$ tokens.) | ||
| - token embeddings `wte`, from the model parameters- a float array of shape | ||
| (vocab_size, $C$), where $C$ is the latent dimension of this transformer model. | ||
| The $n$th row of this matrix is the embedding vector for the $n$th token. | ||
| - position embeddings `wpe`, provided- a float array of shape (max_length, $C$), | ||
| where $C$ is the latent dimension of the transformer model. The $n$th role of | ||
| this matrix represents the position embedding vector for the a token at position | ||
| $n$ . | ||
| - Output: float tensor of shape $(B, T, C)$, where in addition to the batch ($B$) | ||
| and sequence length ($T$) dimension, a "channel" $C$ is added ($C$ is the same | ||
| as the latent embedding dimension of the transformer.) | ||
|
|
||
|  | ||
|
|
||
| Unlike most other parts of a transformer, the embedding kernel contains notably | ||
| few calculations- for the forward pass, both the word embeddings and the position | ||
| embeddings are **pre-computed** and available as arrays in memory. The only job | ||
| of the encoder kernel is to load the appropriate rows from both matrices, add | ||
| these together, and write the sum to the appropriate location in memory. | ||
|
|
||
| ## Kernel 1 | ||
|
|
||
|  | ||
|
|
||
| THis kernel uses a total of $B \times T$ threads, where each thread handles a token | ||
| in the $(B,\, T)$ input array. Within each thread, a for-loop iterates over the | ||
| hidden dimensions, reading one float from `wte` and from `wpe`, take the sum, | ||
| and write the sum to the output tensor. | ||
|
|
||
| While this approach might work fine on CPU because of the predictable memory access | ||
| pattern, the for-loop approach does not fully unleash the multiprocessing potentials | ||
| of the GPU. | ||
|
|
||
| ## Kernel 2 | ||
|
|
||
|  | ||
|
|
||
| This kernel uses substantially more threads than kernel 1, a total of $B\timesT\timesC$, | ||
| one for each element in the output. (Recall that in modern GPTs, the hidden dimension | ||
| $C$ is usually in the thousands if not larger.) Each thread handles one float | ||
| in the output tensor- the thread would read one float from the `wte` token embedding | ||
| array, one float from the `wpe` position embedding array, sum the two floating | ||
| point numbers, and write the result to the appropriate location in the output | ||
| tensor. | ||
|
|
||
| This approach improves over kernel 1 by using many more threads in parallel. However, | ||
| since each thread needs to do its own memory read/write, the memory access pattern | ||
| becomes a lot less predictable. Random memory access might quickly become a bottleneck | ||
| for this approach. | ||
|
|
||
| ## Kernel 3 | ||
|
|
||
|  | ||
|
|
||
| On NVIDIA GPUs, reading and writing consecutive chunks of memory at a time can be | ||
| significantly more efficient than doing so one byte at a time. This kernel leverages | ||
| this feature to make processing a lot more efficient. | ||
|
|
||
| While in the kernel, the additions are defined in a for-loop across the $\texttt{x128::size}$, | ||
| one floating point number at a time, the kernel uses `#pragma unroll` to automatically | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. New sentence: "The kernel uses |
||
| optimizes this part of the code during compilation. | ||
|
|
||
| #### References | ||
|
|
||
| 1. Code for encoder forward kernels from [llm.c](https://github.com/karpathy/llm.c/blob/master/dev/cuda/encoder_forward.cu) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please also change this to MLA style |
||
|
|
||
| <!-- Contributors --> | ||
|
|
||
| {{#author VectorInstitute}} <!-- replace VectorInstitute with your github user --> | ||
| {{#author jacobthebanana}} | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but can we change the title to "# Forward Pass Kernels of Positional Embeddings within GPT-2"