This repository provides a template for finetuning Mistral-7B models with custom datasets. It's designed to be easy to use while providing flexibility for different data sources and formats.
- Supports multiple data sources (Hugging Face datasets, CSV, JSON, text files)
- Uses LoRA (Low-Rank Adaptation) for efficient finetuning
- 4-bit quantization for reduced memory usage
- Configurable instruction templates
- Support for combining multiple datasets
- Command-line argument support for easy configuration
transformers>=4.34.0
peft>=0.5.0
datasets>=2.14.0
bitsandbytes>=0.41.0
accelerate>=0.23.0
torch>=2.0.0
huggingface_hub>=0.17.0
- Clone this repository:
git clone https://github.com/yourusername/mistral-finetune-template.git
cd mistral-finetune-template- Install the required packages:
pip install -r requirements.txtThe simplest way to use the template is to run it with command-line arguments:
python3 mistral_finetune_template.py \
--model_id="mistralai/Mistral-7B-Instruct-v0.2" \
--output_dir="./fine_tuned_mistral" \
--dataset_type="csv" \
--dataset_path="path/to/your/data.csv" \
--dataset_text_field="text" \
--instruction_template="Continue speaking in the style of: " \
--num_train_epochs=3The template supports many configuration options:
--model_id: The model ID to finetune (default: "mistralai/Mistral-7B-Instruct-v0.2")--output_dir: Directory to save the finetuned model (default: "./fine_tuned_mistral")
--use_hf_token: Whether to use a Hugging Face token (default: False)--hf_token: Hugging Face token for accessing gated models
--dataset_type: Type of dataset: 'huggingface', 'csv', 'json', 'text', or 'custom' (default: "custom")--dataset_path: Path to dataset or Hugging Face dataset ID--dataset_text_field: Field containing the text in the dataset (default: "text")--second_dataset_path: Optional path to a second dataset to combine with the first--second_dataset_text_field: Field containing the text in the second dataset
--instruction_template: Template for the instruction part (default: "Continue speaking in the style of: ")
--num_train_epochs: Number of training epochs (default: 3)--per_device_train_batch_size: Batch size per device during training (default: 4)--gradient_accumulation_steps: Number of gradient accumulation steps (default: 4)--learning_rate: Learning rate (default: 2e-4)--max_seq_length: Maximum sequence length for tokenization (default: 1024)
--lora_r: LoRA attention dimension (default: 16)--lora_alpha: LoRA alpha parameter (default: 64)--lora_dropout: LoRA dropout probability (default: 0.05)
--seed: Random seed for reproducibility (default: 42)--push_to_hub: Whether to push the model to the Hugging Face Hub (default: False)--hub_model_id: Model ID for pushing to the Hugging Face Hub
python mistral_finetune_template.py \
--model_id="mistralai/Mistral-7B-Instruct-v0.2" \
--output_dir="./fine_tuned_shakespeare" \
--dataset_type="csv" \
--dataset_path="shakespeare_data.csv" \
--dataset_text_field="text" \
--instruction_template="Continue writing in the style of Shakespeare: " \
--num_train_epochs=3 \
--per_device_train_batch_size=4python mistral_finetune_template.py \
--model_id="mistralai/Mistral-7B-Instruct-v0.2" \
--output_dir="./fine_tuned_code" \
--dataset_type="huggingface" \
--dataset_path="codeparrot/github-code" \
--dataset_text_field="content" \
--instruction_template="Write Python code for: " \
--num_train_epochs=2python mistral_finetune_template.py \
--model_id="mistralai/Mistral-7B-Instruct-v0.2" \
--output_dir="./fine_tuned_combined" \
--dataset_type="csv" \
--dataset_path="dataset1.csv" \
--dataset_text_field="text" \
--second_dataset_path="dataset2.csv" \
--second_dataset_text_field="content" \
--instruction_template="Continue in this style: "For custom dataset loading, modify the load_and_process_dataset function in the script. Look for the section with:
elif config.dataset_type == "custom":
# Implement your custom dataset loading logic here
# This is a placeholder - replace with your own logic
logger.warning("Using custom dataset type - you need to implement your loading logic")
dataset = Dataset.from_dict({"text": ["Example text"]})Replace this with your own dataset loading logic.
While this template is specifically designed for Mistral-7B models, it can be adapted for other transformer-based models with minimal changes. Compatible models include:
-
Mistral Family:
- Mistral-7B (optimized for this model)
- Mixtral-8x7B (may require additional memory)
-
Llama Family:
- Llama 2 (7B, 13B)
- Code Llama
- Llama 3 (8B)
-
Other Models:
- Falcon models
- MPT models
- Pythia models
- BLOOMZ models
The main adaptations needed for other models would be:
- Changing the model ID
- Adjusting the instruction format template to match the model's expected format
- Modifying the target modules for LoRA depending on the model architecture
When switching to a different model, remember to check its documentation for the correct prompt/instruction format.
This finetuning approach is particularly well-suited for Mistral-7B for several reasons:
-
Instruction Format Alignment: The template uses Mistral's specific instruction format (
<s>[INST] ... [/INST] ...) which is crucial for maintaining the model's instruction-following capabilities. -
LoRA Target Modules: The LoRA configuration targets specific attention modules in Mistral's architecture, optimized based on the model's design.
-
Quantization Strategy: The 4-bit quantization with nf4 datatype is particularly effective for Mistral models, offering an excellent balance between memory efficiency and performance.
-
Training Hyperparameters: The default learning rate, batch size, and other parameters are specifically tuned for Mistral finetuning based on empirical results.
When Mistral was originally trained, it used a specific instruction format and training approach that this template maintains. This consistency helps preserve the model's core capabilities while adapting it to your specific domain.
If you're new to model finetuning, here are some resources to help you understand the concepts and techniques used in this template:
-
Parameter-Efficient Fine-Tuning (PEFT) Guide - Comprehensive guide on PEFT methods like LoRA from Hugging Face
-
LoRA: Low-Rank Adaptation of Large Language Models - The original paper describing the LoRA technique
-
Hugging Face Fine-tuning Tutorial - General guide on fine-tuning transformer models
-
QLoRA: Efficient Finetuning of Quantized LLMs - Paper on quantized LoRA, which this template implements
-
Mistral 7B Technical Documentation - Technical details about the Mistral model architecture
-
Instruction Tuning for LLMs - Understanding how instruction finetuning works
- Fine-tuning LLMs for Beginners - Practical walkthrough of the finetuning process
-
Hugging Face PEFT Examples - Example scripts for various PEFT methods
-
LoRA Fine-tuning Explained - Step-by-step guide from Lightning AI
If you're a researcher looking to get started with finetuning, here's a simplified workflow:
-
Prepare your dataset: Format your text data as CSV, JSON, or text files. Make sure each example is well-formatted and relevant to your task.
-
Define your instruction template: Think about what type of behavior you want to teach the model. For example, "Respond to this medical question with accurate information:" for a medical assistant.
-
Start with a small test run: Begin with a small subset of your data and 1 epoch to ensure everything works.
-
Scale up gradually: Once your setup works, increase the dataset size and number of epochs.
-
Evaluate carefully: After finetuning, extensively test your model to ensure it behaves as expected and hasn't developed unwanted behaviors.
For academic research, remember to document your hyperparameters, dataset statistics, and evaluation methods to ensure reproducibility.
- Minimum: 24GB VRAM (RTX 3090, RTX 4090, A5000, etc.)
- Recommended: 40GB+ VRAM (A100, A6000, H100)
- Cloud options:
- Google Colab Pro+ with A100
- Lambda Labs instances
- Vast.ai GPU rentals
- RunPod services
For limited hardware, consider:
- Reducing batch size and increasing gradient accumulation steps
- Using 8-bit quantization for the base model
- Reducing context length
- Finetuning smaller models (e.g., 7B instead of 13B+)
- At least 24GB VRAM is recommended for finetuning Mistral-7B with 4-bit quantization
- For smaller GPUs, you may need to reduce batch size and use gradient accumulation
This project is licensed under the MIT License - see the LICENSE file for details.
- This template is based on best practices from the Hugging Face ecosystem
- Special thanks to the creators of Mistral-7B, PEFT, and BitsAndBytes libraries