This project is a Rust implementation of the hand-crafted, decoder-only transformer described in the blog post "I made a transformer by hand (no training!)" by Theia Vogel.
The goal is to replicate the Python/NumPy implementation using Rust and the nalgebra
crate for matrix operations, demonstrating the core mechanics of a minimal transformer architecture without any training.
The transformer is designed for a very specific task: predicting the next character in the repeating sequence aabaabaabaab...
. That is, predicting the (aab)*
pattern.
The prediction rule is:
- If the previous two tokens are
aa
, predictb
. - If the previous two tokens are
ab
, predicta
. - If the previous two tokens are
ba
, predicta
.
The model follows a simplified GPT-2 like architecture:
- Decoder-Only: Predicts the next token based on previous tokens.
- Single Transformer Block: Contains one causal self-attention layer.
- Single Attention Head: No multi-head attention.
- No Layer Normalization: Removed for simplicity in hand-crafting weights.
- No MLP/Feed-Forward Layer: The block only contains the attention mechanism and residual connection.
- Learned Embeddings: Uses separate, hand-crafted embeddings for tokens (
wte
) and positions (wpe
). - Causal Masking: Prevents attention to future tokens.
N_CTX = 5
: Maximum context length (number of previous tokens considered).N_VOCAB = 2
: Vocabulary size ('a', 'b').N_EMBED = 8
: Embedding dimension size.
- Language: Rust (Stable)
- Matrix Library:
nalgebra
is used for matrix/vector operations (multiplication, addition, transpose, softmax, etc.). - Weights: All model weights (
wte
,wpe
, attention weightsc_attn
, projection weightsc_proj
) are hardcoded directly in theload_model_weights
function, mirroring the values from the original blog post.- Embeddings: Use a one-hot scheme within the 8 embedding dimensions.
- Dims 0-4: Position (0 to 4)
- Dims 5-6: Token ('a' or 'b')
- Dim 7: Scratch space used by the attention mechanism.
- Attention (
c_attn
): Designed to:- Create a query (
q
) that looks for the two most recent positions. - Extract the position one-hot encoding as the key (
k
). - Encode the token type ('a'->1, 'b'->-1) in the scratch space (dim 7) as the value (
v
).
- Create a query (
- Projection (
c_proj
): Transforms the attention output (the 0 or 1 in dim 7, representing predicted 'a' or 'b') back into a scaled one-hot embedding (using dims 5 and 6), adding a bias to default to 'a'. - Residual Connection: The output of the attention mechanism (after projection) is added back to the original input embedding (
x = x + attention_output
). The large scaling factor (LG = 1024.0
) ensures the attention output dominates the final prediction. - Final Output: The final embedding is projected back to vocabulary space using the transposed token embedding matrix (
wte.T
).
- Embeddings: Use a one-hot scheme within the 8 embedding dimensions.
- Install Rust: If you don't have it, install the Rust toolchain from rustup.rs.
- Clone the Repository (Optional): If you have the code in a directory structure.
# git clone # cd handmade_transformer_rust
- Build:
cargo build --release
- Run:
cargo run --release
You should see output similar to this:
--- Running Completions --- a :: baabaabaab aa :: baabaabaab aab :: aabaabaaba ba :: abaabaabaa abaab :: aabaabaaba ababa :: abaabaabaa bbbbb :: aabaabaaba --- Running Accuracy Test --- ACCURACY: 100.0% (27 / 27)
All the logic is contained within src/main.rs
:
- Constants definition (
N_CTX
,N_VOCAB
, etc.) - Tokenization functions (
tokenize
,untok
) - Math operations (
softmax
,linear
,attention
) built onnalgebra
. - Transformer components (
causal_self_attention
,transformer_block
) - GPT model function (
gpt
) - Weight loading (
load_model_weights
) - Inference functions (
predict
,complete
) main
function with examples and accuracy test.
This code is based on concepts and weights from Theia Vogel's blog post and inspired by the picoGPT implementation. It is provided under the MIT License. See the LICENSE file (or assume MIT if not present).
- Jay Mody for the clear picoGPT implementation which inspired the structure.