Hand-Crafted Transformer in Rust

This project is a Rust implementation of the hand-crafted, decoder-only transformer described in the blog post "I made a transformer by hand (no training!)" by Theia Vogel.

The goal is to replicate the Python/NumPy implementation using Rust and the nalgebra crate for matrix operations, demonstrating the core mechanics of a minimal transformer architecture without any training.

Task

The transformer is designed for a very specific task: predicting the next character in the repeating sequence aabaabaabaab.... That is, predicting the (aab)* pattern.

The prediction rule is:

If the previous two tokens are aa, predict b.
If the previous two tokens are ab, predict a.
If the previous two tokens are ba, predict a.

Architecture

The model follows a simplified GPT-2 like architecture:

Decoder-Only: Predicts the next token based on previous tokens.
Single Transformer Block: Contains one causal self-attention layer.
Single Attention Head: No multi-head attention.
No Layer Normalization: Removed for simplicity in hand-crafting weights.
No MLP/Feed-Forward Layer: The block only contains the attention mechanism and residual connection.
Learned Embeddings: Uses separate, hand-crafted embeddings for tokens (wte) and positions (wpe).
Causal Masking: Prevents attention to future tokens.

Key Parameters

N_CTX = 5: Maximum context length (number of previous tokens considered).
N_VOCAB = 2: Vocabulary size ('a', 'b').
N_EMBED = 8: Embedding dimension size.

Implementation Details

Language: Rust (Stable)
Matrix Library: nalgebra is used for matrix/vector operations (multiplication, addition, transpose, softmax, etc.).
Weights: All model weights (wte, wpe, attention weights c_attn, projection weights c_proj) are hardcoded directly in the load_model_weights function, mirroring the values from the original blog post.
- Embeddings: Use a one-hot scheme within the 8 embedding dimensions.
  - Dims 0-4: Position (0 to 4)
  - Dims 5-6: Token ('a' or 'b')
  - Dim 7: Scratch space used by the attention mechanism.
- Attention (c_attn): Designed to:
  - Create a query (q) that looks for the two most recent positions.
  - Extract the position one-hot encoding as the key (k).
  - Encode the token type ('a'->1, 'b'->-1) in the scratch space (dim 7) as the value (v).
- Projection (c_proj): Transforms the attention output (the 0 or 1 in dim 7, representing predicted 'a' or 'b') back into a scaled one-hot embedding (using dims 5 and 6), adding a bias to default to 'a'.
- Residual Connection: The output of the attention mechanism (after projection) is added back to the original input embedding (x = x + attention_output). The large scaling factor (LG = 1024.0) ensures the attention output dominates the final prediction.
- Final Output: The final embedding is projected back to vocabulary space using the transposed token embedding matrix (wte.T).

How to Run

Install Rust: If you don't have it, install the Rust toolchain from rustup.rs.
Clone the Repository (Optional): If you have the code in a directory structure.
```
# git clone 
# cd handmade_transformer_rust
```
Build:
```
cargo build --release
```
Run:
```
cargo run --release
```

You should see output similar to this:

--- Running Completions --- a :: baabaabaab aa :: baabaabaab aab :: aabaabaaba ba :: abaabaabaa abaab :: aabaabaaba ababa :: abaabaabaa bbbbb :: aabaabaaba --- Running Accuracy Test --- ACCURACY: 100.0% (27 / 27)

Code Structure

All the logic is contained within src/main.rs:

Constants definition (N_CTX, N_VOCAB, etc.)
Tokenization functions (tokenize, untok)
Math operations (softmax, linear, attention) built on nalgebra.
Transformer components (causal_self_attention, transformer_block)
GPT model function (gpt)
Weight loading (load_model_weights)
Inference functions (predict, complete)
main function with examples and accuracy test.

License

This code is based on concepts and weights from Theia Vogel's blog post and inspired by the picoGPT implementation. It is provided under the MIT License. See the LICENSE file (or assume MIT if not present).

Acknowledgements

Jay Mody for the clear picoGPT implementation which inspired the structure.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.MD		README.MD
blog.md		blog.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hand-Crafted Transformer in Rust

Task

Architecture

Key Parameters

Implementation Details

How to Run

Code Structure

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

antonvice/handmade_transformer

Folders and files

Latest commit

History

Repository files navigation

Hand-Crafted Transformer in Rust

Task

Architecture

Key Parameters

Implementation Details

How to Run

Code Structure

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages