Skip to content

antonvice/handmade_transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hand-Crafted Transformer in Rust

This project is a Rust implementation of the hand-crafted, decoder-only transformer described in the blog post "I made a transformer by hand (no training!)" by Theia Vogel.

The goal is to replicate the Python/NumPy implementation using Rust and the nalgebra crate for matrix operations, demonstrating the core mechanics of a minimal transformer architecture without any training.

Task

The transformer is designed for a very specific task: predicting the next character in the repeating sequence aabaabaabaab.... That is, predicting the (aab)* pattern.

The prediction rule is:

  • If the previous two tokens are aa, predict b.
  • If the previous two tokens are ab, predict a.
  • If the previous two tokens are ba, predict a.

Architecture

The model follows a simplified GPT-2 like architecture:

  • Decoder-Only: Predicts the next token based on previous tokens.
  • Single Transformer Block: Contains one causal self-attention layer.
  • Single Attention Head: No multi-head attention.
  • No Layer Normalization: Removed for simplicity in hand-crafting weights.
  • No MLP/Feed-Forward Layer: The block only contains the attention mechanism and residual connection.
  • Learned Embeddings: Uses separate, hand-crafted embeddings for tokens (wte) and positions (wpe).
  • Causal Masking: Prevents attention to future tokens.

Key Parameters

  • N_CTX = 5: Maximum context length (number of previous tokens considered).
  • N_VOCAB = 2: Vocabulary size ('a', 'b').
  • N_EMBED = 8: Embedding dimension size.

Implementation Details

  • Language: Rust (Stable)
  • Matrix Library: nalgebra is used for matrix/vector operations (multiplication, addition, transpose, softmax, etc.).
  • Weights: All model weights (wte, wpe, attention weights c_attn, projection weights c_proj) are hardcoded directly in the load_model_weights function, mirroring the values from the original blog post.
    • Embeddings: Use a one-hot scheme within the 8 embedding dimensions.
      • Dims 0-4: Position (0 to 4)
      • Dims 5-6: Token ('a' or 'b')
      • Dim 7: Scratch space used by the attention mechanism.
    • Attention (c_attn): Designed to:
      • Create a query (q) that looks for the two most recent positions.
      • Extract the position one-hot encoding as the key (k).
      • Encode the token type ('a'->1, 'b'->-1) in the scratch space (dim 7) as the value (v).
    • Projection (c_proj): Transforms the attention output (the 0 or 1 in dim 7, representing predicted 'a' or 'b') back into a scaled one-hot embedding (using dims 5 and 6), adding a bias to default to 'a'.
    • Residual Connection: The output of the attention mechanism (after projection) is added back to the original input embedding (x = x + attention_output). The large scaling factor (LG = 1024.0) ensures the attention output dominates the final prediction.
    • Final Output: The final embedding is projected back to vocabulary space using the transposed token embedding matrix (wte.T).

How to Run

  1. Install Rust: If you don't have it, install the Rust toolchain from rustup.rs.
  2. Clone the Repository (Optional): If you have the code in a directory structure.
    # git clone 
    # cd handmade_transformer_rust
  3. Build:
    cargo build --release
  4. Run:
    cargo run --release

You should see output similar to this:

--- Running Completions --- a :: baabaabaab aa :: baabaabaab aab :: aabaabaaba ba :: abaabaabaa abaab :: aabaabaaba ababa :: abaabaabaa bbbbb :: aabaabaaba --- Running Accuracy Test --- ACCURACY: 100.0% (27 / 27)

Code Structure

All the logic is contained within src/main.rs:

  • Constants definition (N_CTX, N_VOCAB, etc.)
  • Tokenization functions (tokenize, untok)
  • Math operations (softmax, linear, attention) built on nalgebra.
  • Transformer components (causal_self_attention, transformer_block)
  • GPT model function (gpt)
  • Weight loading (load_model_weights)
  • Inference functions (predict, complete)
  • main function with examples and accuracy test.

License

This code is based on concepts and weights from Theia Vogel's blog post and inspired by the picoGPT implementation. It is provided under the MIT License. See the LICENSE file (or assume MIT if not present).

Acknowledgements

  • Jay Mody for the clear picoGPT implementation which inspired the structure.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages