AI-Hypercomputer
diff --git a/‎MaxText/utils/ckpt_conversion/README.md
Lines changed: 177 additions & 0 deletions b/‎MaxText/utils/ckpt_conversion/README.md
Lines changed: 177 additions & 0 deletions
diff --git a/‎MaxText/utils/ckpt_conversion/to_huggingface.py
Lines changed: 6 additions & 5 deletions b/‎MaxText/utils/ckpt_conversion/to_huggingface.py
Lines changed: 6 additions & 5 deletions
diff --git a/‎MaxText/utils/ckpt_conversion/utils/shape_mapping.py renamed to ‎MaxText/utils/ckpt_conversion/utils/hf_shape.py
Lines changed: 15 additions & 15 deletions b/‎MaxText/utils/ckpt_conversion/utils/shape_mapping.py renamed to ‎MaxText/utils/ckpt_conversion/utils/hf_shape.py
Lines changed: 15 additions & 15 deletions
diff --git a/‎MaxText/utils/ckpt_conversion/utils/param_mapping.py
Lines changed: 7 additions & 3 deletions b/‎MaxText/utils/ckpt_conversion/utils/param_mapping.py
Lines changed: 7 additions & 3 deletions
diff --git a/‎end_to_end/tpu/gemma2/2b/test_gemma2_to_hf.sh
Lines changed: 51 additions & 0 deletions b/‎end_to_end/tpu/gemma2/2b/test_gemma2_to_hf.sh
Lines changed: 51 additions & 0 deletions
@@ -0,0 +1,177 @@
+# Checkpoint conversion utilities
+
+This guide provides instructions for using the scripts that convert model checkpoints bidirectionally between Hugging Face and MaxText formats.
+
+## Supported models
+
+The following models are supported:
+
+- Gemma2 (2B, 9B, 27B).
+- Gemma3 multimodal (4B, 12B, 27B).
+- Qwen3 (0.6B, 4B, 8B, 14B, 32B).
+
+## Prerequisites
+- Hugging Face requires Pytorch.
+- Hugging Face model checkpoints require local disk space.
+  - The model files are always downloaded to a disk cache first before being loaded into memory (for more info, please consult Hugging Face [docs](https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference)). The default local storage path for Hugging Face models is $HOME/.cache/huggingface/hub
+
+## Hugging Face to MaxText
+
+Use the `to_maxtext.py` script to convert a Hugging Face model into a MaxText checkpoint. The script will automatically download the specified model from the Hugging Face Hub, perform conversion, and save converted checkpoints to given output directory.
+
+\*\**For a complete example, see the test script at [`end_to_end/tpu/qwen3/4b/test_qwen3.sh`](../../../end_to_end/tpu/qwen3/4b/test_qwen3.sh) and [`end_to_end/tpu/gemma3/4b/test_gemma3_unified.sh`](../../../end_to_end/tpu/gemma3/4b/test_gemma3_unified.sh).*
+
+### Usage
+
+The following command demonstrates how to run the conversion. You must provide your Hugging Face token in the `MaxText/configs/base.yml` file (hf_access_token).
+
+```bash
+python -m MaxText.utils.ckpt_conversion.to_maxtext MaxText/configs/base.yml \
+    model_name=<model-name> \
+    base_output_directory=<gcs-path-to-save-checkpoint> \
+    hf_access_token=<your-hf-token> \
+    use_multimodal=false \
+    scan_layers=false
+```
+
+**Key arguments:**
+
+  * `model_name`: The model identifier, which should be defined in `MaxText/utils/utils.py`.
+  * `scan_layers`: Indicates if the output checkpoint is [scanned](https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/checkpoints.md) (scan_layers=true) or unscanned (scan_layers=false).
+  * `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
+  * `hf_access_token`: Your Hugging Face token.
+  * `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Googld Cloud Storage (GCS) or local. If not set, the default output directory is `Maxtext/tmp`.
+
+\*\**It only converts the official version of Hugging Face model. You can refer the supported official version in HF_IDS in `MaxText/utils/ckpt_conversion/utils/utils.py`*
+
+## MaxText to Hugging Face
+
+Use the `to_huggingface.py` script to convert a MaxText checkpoint into the Hugging Face format. This is useful for sharing your models or integrating them with the Hugging Face ecosystem.
+\*\**For a complete example, see the test script at [`end_to_end/tpu/qwen3/4b/test_qwen3_to_hf.sh`](../../../end_to_end/tpu/qwen3/4b/test_qwen3_to_hf.sh).*
+
+### Usage
+
+The following command converts a MaxText checkpoint and saves it locally, to GCS, or uploads it directly to the Hugging Face Hub.
+
+```bash
+python -m MaxText.utils.ckpt_conversion.to_huggingface MaxText/configs/base.yml \
+    model_name=<MODEL_NAME> \
+    load_parameters_path=<path-to-maxtext-checkpoint> \
+    base_output_directory=<path-to-save-converted-checkpoint> \
+    scan_layers=false \
+    use_multimodal=false \
+    hf_access_token=<your-hf-token> \
+```
+
+**Key arguments:**
+
+  * `load_parameters_path`: The path to the source MaxText Orbax checkpoint (e.g., `gs://your-bucket/maxtext-checkpoint/0/items`).
+  * `model_name`: The corresponding model name in the MaxText configuration (e.g., `qwen3-4b`).
+  * `scan_layers`: Indicates if the output checkpoint is [scanned](https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/checkpoints.md)  (scan_layers=true) or unscanned (scan_layers=false).
+  * `hf_access_token`: Your Hugging Face token.
+  * `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
+  * `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Googld Cloud Storage (GCS), Hugging Face Hub or local. If not set, the default output directory is `Maxtext/tmp`.
+
+
+## Verifying conversion correctness
+
+To ensure the conversion was successful, you can use the `MaxText/tests/forward_pass_logit_checker.py` script. It runs a forward pass on both the original and converted models and compares the output logits to verify conversion. It is used to verify the bidirectional conversion. 
+
+### Usage
+
+```bash
+python3 -m MaxText.tests.forward_pass_logit_checker MaxText/configs/base.yml \
+    tokenizer_path=assets/<tokenizer> \
+    load_parameters_path=<path-to-maxtext-checkpoint> \
+    model_name=<MODEL_NAME> \
+    scan_layers=false \
+    use_multimodal=false \
+    --run_hf_model=True \
+    --hf_model_path=<path-to-HF-checkpoint> \
+    --max_kl_div=0.015 \
+```
+
+**Key arguments:**
+
+  * `load_parameters_path`: The path to the source MaxText Orbax checkpoint (e.g., `gs://your-bucket/maxtext-checkpoint/0/items`).
+  * `model_name`: The corresponding model name in the MaxText configuration (e.g., `qwen3-4b`).
+  * `scan_layers`: Indicates if the output checkpoint is scanned (scan_layers=true) or unscanned (scan_layers=false).
+  * `use_multimodal`: Indicates if multimodality is used.
+  * `--run_hf_model`: Indicates if loading Hugging Face model from the hf_model_path. If not set, it will compare the maxtext logits with pre-saved golden logits. 
+  * `--hf_model_path`: The path to the Hugging Face checkpoint.
+  * `--max_kl_div`: Max KL divergence tolerance during comparisons.
+
+**Example successful conversion verification:**
+
+Here is part of the output of forward_pass_logit_checker for the gemma2-2b.
+
+```
+--- Prompt: What is the ---
+
+--- MaxText model top 10 tokens ---
+| Token ID   | Token                | Score      |
+|------------|----------------------|------------|
+| 5830       | difference           | 27.2500    |
+| 1963       | best                 | 26.6250    |
+| 5316       | average              | 26.3750    |
+| 2669       | change               | 26.1250    |
+| 12070      | percentage           | 26.1250    |
+| 1618       | value                | 25.8750    |
+| 1546       | most                 | 25.7500    |
+| 66202      | molar                | 25.5000    |
+| 3051       | total                | 25.5000    |
+| 1503       | name                 | 25.3750    |
+
+
+--- HF model top 10 tokens ---
+| Token ID   | Token                | Score      |
+|------------|----------------------|------------|
+| 5830       | difference           | 27.2500    |
+| 1963       | best                 | 26.6250    |
+| 5316       | average              | 26.3750    |
+| 12070      | percentage           | 26.1250    |
+| 2669       | change               | 26.1250    |
+| 1618       | value                | 25.8750    |
+| 1546       | most                 | 25.7500    |
+| 66202      | molar                | 25.5000    |
+| 3051       | total                | 25.5000    |
+| 6187       | purpose              | 25.3750    |
+
+
+--- Similarity Metrics of Top Tokens ---
+| Metric                         | Value                |
+|--------------------------------|----------------------|
+| overlap_count                  | 9/10                 |
+| jaccard_similarity             | 0.8181818181818182   |
+| rank_agreement_percentage      | 70.0                 |
+
+
+Average KL divergence per token (D_KL(P_golden || Q_model)): 0.000409
+
+Max KL divergence for a single token in the set: 0.003497
+```
+-----
+
+## Adding support for new models
+To extend conversion support to a new model architecture, you must define its specific parameter and configuration mappings. The conversion logic is decoupled, so you only need to modify the mapping files.
+
+1.  **Add parameter mappings**: 
+- In [`utils/param_mapping.py`](./utils/param_mapping.py), add the parameter name mappings(`def {MODEL}_MAXTEXT_TO_HF_PARAM_MAPPING`). This is the 1-to-1 mappings of parameters names per layer. 
+- In [`utils/param_mapping.py`](./utils/param_mapping.py), add the `hook_fn` logic (`def {MODEL}_MAXTEXT_TO_HF_PARAM_HOOK_FN`). This is the transformation needed per layer. 
+2.  **Add Hugging Face weights Shape**: In [`utils/hf_shape.py`](./utils/hf_shape.py), define the tensor shape of Hugging Face format (`def {MODEL}_HF_WEIGHTS_TO_SHAPE`). This is used to ensure the tensor shape is matched after to_huggingface conversion. 
+3.  **Register model key**: In [`utils/utils.py`](./utils/utils.py), add the new model key in `HF_IDS`.
+4.  **Add transformer config**: In [`utils/hf_model_configs.py`](./utils/hf_model_configs.py), add the `transformers.Config` object, describing the Hugging Face model configuration (defined in ['MaxText/configs/models'](../configs/models)). **Note**: This configuration must precisely match the MaxText model's architecture.
+
+Here is an example [PR to add support for gemma3 multi-modal model](https://github.com/AI-Hypercomputer/maxtext/pull/1983)
+
+## Debugging tips
+
+If a converted checkpoint loads without errors but produces incorrect output, consider these common issues:
+
+  * **Symptom**: The model generates garbage or nonsensical tokens.
+
+      * **Potential Cause**: The query/key/value (Q/K/V) or Out vectors weights were likely reshaped incorrectly during conversion.
+
+  * **Symptom**: The model generates repetitive text sequences.
+
+      * **Potential Cause**: The layer normalization parameters may have been converted incorrectly.
@@ -67,7 +67,7 @@
     HOOK_FNS,
     PARAM_MAPPING,
 )
-from MaxText.utils.ckpt_conversion.utils.shape_mapping import SHAPE_MAPPING
+from MaxText.utils.ckpt_conversion.utils.hf_shape import HF_SHAPE
 from MaxText.utils.ckpt_conversion.utils.hf_model_configs import HF_MODEL_CONFIGS
 from MaxText.utils.ckpt_conversion.utils.utils import (process_leaf_param, save_model_files, HF_IDS)
 
@@ -90,12 +90,12 @@ def _get_model_mappings(model_name: str, scan_layers: bool, config_dict: dict):
   Raises:
     ValueError: If mappings for the specified `model_name` are not found.
   """
-  if model_name not in PARAM_MAPPING or model_name not in SHAPE_MAPPING or model_name not in HOOK_FNS:
+  if model_name not in PARAM_MAPPING or model_name not in HF_SHAPE or model_name not in HOOK_FNS:
     raise ValueError(f"Mappings not found for model: {model_name}. Available PARAM_MAPPING keys: {PARAM_MAPPING.keys()}")
 
   return {
       "param_mapping": PARAM_MAPPING[model_name](config_dict, scan_layers),
-      "shape_mapping": SHAPE_MAPPING[model_name](config_dict),
+      "shape_mapping": HF_SHAPE[model_name](config_dict),
       "hook_fn_mapping": HOOK_FNS[model_name](config_dict, scan_layers, saving_to_hf=True),
   }
 
@@ -140,11 +140,12 @@ def main(argv: Sequence[str]) -> None:
   # 2. Load Tokenizer
   if model_key not in HF_IDS:
     raise ValueError(f"HF Tokenizer ID not found for model key: {model_key}")
+  hf_token = config.hf_access_token
   hf_tokenizer_id = HF_IDS[model_key]
-  tokenizer = AutoTokenizer.from_pretrained(hf_tokenizer_id)
+  tokenizer = AutoTokenizer.from_pretrained(hf_tokenizer_id,  token=hf_token)
 
   # For multi-modal case:
-  processor = AutoProcessor.from_pretrained(hf_tokenizer_id) if config.use_multimodal else None
+  processor = AutoProcessor.from_pretrained(hf_tokenizer_id,  token=hf_token) if config.use_multimodal else None
 
   # 3. Get parameter mappings
   mappings = _get_model_mappings(model_key, config.scan_layers, hf_config_obj.to_dict())
 
@@ -15,7 +15,7 @@
  """
 
 
-def GEMMA3_HF_WEIGHTS_TO_SHAPE_MAPPING(config):
+def GEMMA3_HF_WEIGHTS_TO_SHAPE(config):
   """Generates a shape mapping for Hugging Face Gemma3 parameters.
 
   This function computes the expected shapes for all parameters in a Hugging
@@ -153,7 +153,7 @@ def GEMMA3_HF_WEIGHTS_TO_SHAPE_MAPPING(config):
   return shapes
 
 
-def GEMMA2_HF_WEIGHTS_TO_SHAPE_MAPPING(config):
+def GEMMA2_HF_WEIGHTS_TO_SHAPE(config):
   """Returns mapping between HuggingFace weights path and weights shape.
 
   Args:
@@ -208,7 +208,7 @@ def GEMMA2_HF_WEIGHTS_TO_SHAPE_MAPPING(config):
   return mapping
 
 
-def QWEN3_HF_WEIGHTS_TO_SHAPE_MAPPING(config):
+def QWEN3_HF_WEIGHTS_TO_SHAPE(config):
   """Returns mapping between HuggingFace Qwen3 weights path and the HuggingFace weights shape.
 
   To check this mapping, dump the huggingface model shapes:
@@ -308,16 +308,16 @@ def QWEN3_HF_WEIGHTS_TO_SHAPE_MAPPING(config):
   return mapping
 
 
-SHAPE_MAPPING = {
-    "gemma2-2b": GEMMA2_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "gemma2-9b": GEMMA2_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "gemma2-27b": GEMMA2_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "gemma3-4b": GEMMA3_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "gemma3-12b": GEMMA3_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "gemma3-27b": GEMMA3_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "qwen3-0.6b": QWEN3_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "qwen3-4b": QWEN3_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "qwen3-8b": QWEN3_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "qwen3-14b": QWEN3_HF_WEIGHTS_TO_SHAPE_MAPPING,
-    "qwen3-32b": QWEN3_HF_WEIGHTS_TO_SHAPE_MAPPING,
+HF_SHAPE = {
+    "gemma2-2b": GEMMA2_HF_WEIGHTS_TO_SHAPE,
+    "gemma2-9b": GEMMA2_HF_WEIGHTS_TO_SHAPE,
+    "gemma2-27b": GEMMA2_HF_WEIGHTS_TO_SHAPE,
+    "gemma3-4b": GEMMA3_HF_WEIGHTS_TO_SHAPE,
+    "gemma3-12b": GEMMA3_HF_WEIGHTS_TO_SHAPE,
+    "gemma3-27b": GEMMA3_HF_WEIGHTS_TO_SHAPE,
+    "qwen3-0.6b": QWEN3_HF_WEIGHTS_TO_SHAPE,
+    "qwen3-4b": QWEN3_HF_WEIGHTS_TO_SHAPE,
+    "qwen3-8b": QWEN3_HF_WEIGHTS_TO_SHAPE,
+    "qwen3-14b": QWEN3_HF_WEIGHTS_TO_SHAPE,
+    "qwen3-32b": QWEN3_HF_WEIGHTS_TO_SHAPE,
 }
@@ -280,11 +280,15 @@ def pos_embed(x, target_shape):
   # Vision layers
   vc = config.get("vision_config", {})
   nvis = vc.get("num_hidden_layers", 0)
-  for i in list(range(nvis)):
-    base = f"params-vision_encoder-Gemma3VisionEncoderLayer_0-" f"Transformer-encoderblock_{i}-"
+  vision_layer_ids = [None] if scan_layers else list(range(nvis))
+  for i in vision_layer_ids:
+    base = (
+        f"params-vision_encoder-Gemma3VisionEncoderLayer_0-Transformer-encoderblock_{i}-"
+        if i is not None
+        else "params-vision_encoder-Gemma3VisionEncoderLayer_0-Transformer-encoderblock-"
+    )
     # Attention kernels & biases
     for qkv in ["query", "key", "value"]:
-      # key is [1152, 1152]-> [1152, 16, 72]
       hooks[base + f"MultiHeadDotProductAttention_0-{qkv}-kernel"] = reshape_kernel
       hooks[base + f"MultiHeadDotProductAttention_0-{qkv}-bias"] = vis_bias
     # [1152, 1152] -> [16, 72, 1152]
 
@@ -0,0 +1,51 @@
+#!/bin/bash
+
+# This file is both an integration test that runs once a day on a v4-8 and documentation for how to get started with Qwen3-4B.
+
+# The flow of this file is as follows:
+# 1. Convert the checkpoint downloaded from Hugging Face to make it compatible with MaxText
+# 2. Run a forward pass check to compare the logits and KL divergence between the converted ckpt and orginal golden HF model
+
+
+set -ex
+idx=$(date +%Y-%m-%d-%H-%M)
+MODEL_NAME='gemma2-2b'
+export MODEL_VARIATION='2b'
+HF_TOKEN='' # Important!!! Save your hf access token here
+TOKENIZER_PATH='assets/tokenizer.gemma'
+
+# Installing torch for deps in forward_pass_logit_checker.py
+python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
+
+# After downloading checkpoints, copy them to GCS bucket at $CHKPT_BUCKET \
+# Non-Googlers please remember to use separate GCS paths for uploading model weights from kaggle ($CHKPT_BUCKET) and MaxText compatible weights ($MODEL_BUCKET).
+# Non-Googlers please remember to point these variables to GCS buckets that you own, this script uses internal buckets for testing.
+export MODEL_BUCKET=gs://maxtext-gemma/unified/gemma2
+# Here is an example of gemma2-2b maxtext checkpoint, converted from google/gemma-2-2b
+export CKPT_PATH=gs://maxtext-gemma/unified/gemma2/2b/unscanned/2025-08-05-18-06/0/items
+
+# You can upload to huggingface hub or GCS using the HF_CKPT_PATH as base_output_directory
+# export HF_CKPT_PATH=${MODEL_BUCKET}/${MODEL_VARIATION}/hf/${idx}
+export LOCAL_PATH=./tmp/hf/${MODEL_NAME}/${idx}
+
+python3 -m MaxText.utils.ckpt_conversion.to_huggingface MaxText/configs/base.yml \
+    model_name=${MODEL_NAME} \
+    hf_access_token=${HF_TOKEN} \
+    load_parameters_path=${CKPT_PATH} \
+    base_output_directory=${LOCAL_PATH} \
+    scan_layers=false 
+
+# Alternatively, if uploaded the converted ckpt, HF requires local storage of model
+# mkdir -p "${LOCAL_PATH}"
+# gcloud storage cp -r ${HF_CKPT_PATH} ${LOCAL_PATH}
+
+# We also test whether the forward pass logits match the original HF model
+# to get higher precision (eg. float32) run on CPU with `JAX_PLATFORMS=cpu`
+python3 -m MaxText.tests.forward_pass_logit_checker MaxText/configs/base.yml \
+    tokenizer_path=${TOKENIZER_PATH} \
+    load_parameters_path=${CKPT_PATH} \
+    model_name=${MODEL_NAME} \
+    scan_layers=false \
+    --hf_model_path=${LOCAL_PATH} \
+    --max_kl_div=0.015 \
+    --run_hf_model=true