1- # 🧠 QuantLLM: Lightweight Library for Quantized LLM Fine-Tuning and Deployment
1+ # 🧠 QuantLLM: Efficient GGUF Model Quantization and Deployment
22
33[ ![ PyPI Downloads] ( https://static.pepy.tech/badge/quantllm )] ( https://pepy.tech/projects/quantllm )
44<img alt =" PyPI - Version " src =" https://img.shields.io/pypi/v/quantllm?logo=pypi&label=version& " >
55
6-
76## 📌 Overview
87
9- ** QuantLLM** is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) ** efficiently** using ** 4-bit and 8-bit quantization** techniques. It provides a modular and flexible framework for:
10-
11- - ** Loading and quantizing models** with advanced configurations
12- - ** LoRA / QLoRA-based fine-tuning** with customizable parameters
13- - ** Dataset management** with preprocessing and splitting
14- - ** Training and evaluation** with comprehensive metrics
15- - ** Model checkpointing** and versioning
16- - ** Hugging Face Hub integration** for model sharing
8+ ** QuantLLM** is a Python library designed for efficient model quantization using the GGUF (GGML Universal Format) method. It provides a robust framework for converting and deploying large language models with minimal memory footprint and optimal performance. Key capabilities include:
179
18- The goal of QuantLLM is to ** democratize LLM training** , especially in low-resource environments, while keeping the workflow intuitive, modular, and production-ready.
10+ - ** Memory-efficient GGUF quantization** with multiple precision options (2-bit to 8-bit)
11+ - ** Chunk-based processing** for handling large models
12+ - ** Comprehensive benchmarking** tools
13+ - ** Detailed progress tracking** with memory statistics
14+ - ** Easy model export** and deployment
1915
2016## 🎯 Key Features
2117
2218| Feature | Description |
2319| ----------------------------------| -------------|
24- | ✅ Quantized Model Loading | Load HuggingFace models with various quantization techniques (including AWQ, GPTQ, GGUF) in 4-bit or 8-bit precision, featuring customizable settings. |
25- | ✅ Advanced Dataset Management | Load, preprocess, and split datasets with flexible configurations |
26- | ✅ LoRA / QLoRA Fine-Tuning | Memory-efficient fine-tuning with customizable LoRA parameters |
27- | ✅ Comprehensive Training | Advanced training loop with mixed precision, gradient accumulation, and early stopping |
28- | ✅ Model Evaluation | Flexible evaluation with custom metrics and batch processing |
29- | ✅ Checkpoint Management | Save, resume, and manage training checkpoints with versioning |
30- | ✅ Hub Integration | Push models and checkpoints to Hugging Face Hub with authentication |
31- | ✅ Configuration Management | YAML/JSON config support for reproducible experiments |
32- | ✅ Logging and Monitoring | Comprehensive logging and Weights & Biases integration |
20+ | ✅ Multiple GGUF Types | Support for various GGUF quantization types (Q2_K to Q8_0) with different precision-size tradeoffs |
21+ | ✅ Memory Optimization | Chunk-based processing and CPU offloading for efficient handling of large models |
22+ | ✅ Progress Tracking | Detailed layer-wise progress with memory statistics and ETA |
23+ | ✅ Benchmarking Tools | Comprehensive benchmarking suite for performance evaluation |
24+ | ✅ Hardware Optimization | Automatic device selection and memory management |
25+ | ✅ Easy Deployment | Simple conversion to GGUF format for deployment |
26+ | ✅ Flexible Configuration | Customizable quantization parameters and processing options |
3327
3428## 🚀 Getting Started
3529
3630### Installation
3731
32+ Basic installation:
3833``` bash
3934pip install quantllm
4035```
4136
37+ With GGUF support (recommended):
38+ ``` bash
39+ pip install quantllm[gguf]
40+ ```
41+
42+ ### Quick Example
43+
44+ ``` python
45+ from quantllm import QuantLLM
46+ from transformers import AutoTokenizer
47+
48+ # Load tokenizer and prepare data
49+ model_name = " facebook/opt-125m"
50+ tokenizer = AutoTokenizer.from_pretrained(model_name)
51+ calibration_text = [" Example text for calibration." ] * 10
52+ calibration_data = tokenizer(calibration_text, return_tensors = " pt" , padding = True )[" input_ids" ]
53+
54+ # Quantize model
55+ quantized_model, benchmark_results = QuantLLM.quantize_from_pretrained(
56+ model_name_or_path = model_name,
57+ bits = 4 , # Quantization bits (2-8)
58+ group_size = 32 , # Group size for quantization
59+ quant_type = " Q4_K_M" , # GGUF quantization type
60+ calibration_data = calibration_data,
61+ benchmark = True , # Run benchmarks
62+ benchmark_input_shape = (1 , 32 )
63+ )
64+
65+ # Save and convert to GGUF
66+ QuantLLM.save_quantized_model(model = quantized_model, output_path = " quantized_model" )
67+ QuantLLM.convert_to_gguf(model = quantized_model, output_path = " model.gguf" )
68+ ```
69+
4270For detailed usage examples and API documentation, please refer to our:
4371- 📚 [ Official Documentation] ( https://quantllm.readthedocs.io/ )
4472- 🎓 [ Tutorials] ( https://quantllm.readthedocs.io/tutorials/ )
@@ -48,39 +76,41 @@ For detailed usage examples and API documentation, please refer to our:
4876
4977### Minimum Requirements
5078- ** CPU** : 4+ cores
51- - ** RAM** : 16GB
52- - ** Storage** : 20GB free space
53- - ** Python** : 3.8 +
79+ - ** RAM** : 16GB+
80+ - ** Storage** : 10GB+ free space
81+ - ** Python** : 3.10 +
5482
55- ### Recommended Requirements
83+ ### Recommended for Large Models
84+ - ** CPU** : 8+ cores
85+ - ** RAM** : 32GB+
5686- ** GPU** : NVIDIA GPU with 8GB+ VRAM
57- - ** RAM** : 32GB
58- - ** Storage** : 50GB+ SSD
5987- ** CUDA** : 11.7+
88+ - ** Storage** : 20GB+ free space
89+
90+ ### GGUF Quantization Types
6091
61- ### Resource Usage Guidelines
62- | Model Size | 4-bit (GPU RAM) | 8-bit (GPU RAM) | CPU RAM (min) |
63- | ------------ | ---------------- | ----------------- | --------------- |
64- | 3B params | ~ 6GB | ~ 9GB | 16GB |
65- | 7B params | ~ 12GB | ~ 18GB | 32GB |
66- | 13B params | ~ 20GB | ~ 32GB | 64GB |
67- | 70B params | ~ 90GB | ~ 140GB | 256GB |
92+ | Type | Bits | Description | Use Case |
93+ | --------- | ------ | ----------------------- | ----------------------------- |
94+ | Q2_K | 2 | Extreme compression | Size-critical deployment |
95+ | Q3_K_S | 3 | Small size | Limited storage |
96+ | Q4_K_M | 4 | Balanced quality | General use |
97+ | Q5_K_M | 5 | Higher quality | Quality-sensitive tasks |
98+ | Q8_0 | 8 | Best quality | Accuracy-critical tasks |
6899
69100## 🔄 Version Compatibility
70101
71102| QuantLLM | Python | PyTorch | Transformers | CUDA |
72103| ----------| --------| ----------| --------------| -------|
73- | latest | ≥3.10 | ≥2.0.0 | ≥4.30.0 | ≥11.7 |
104+ | 1.2.0 | ≥3.10 | ≥2.0.0 | ≥4.30.0 | ≥11.7 |
74105
75106## 🗺 Roadmap
76107
77- - [ ] Multi-GPU training support
78- - [ ] AutoML for hyperparameter tuning
79- - [ ] Integration of additional advanced quantization algorithms and techniques.
80- - [ ] Custom model architecture support
81- - [ ] Enhanced logging and visualization
82- - [ ] Model compression techniques
83- - [ ] Deployment optimizations
108+ - [ ] Support for more GGUF model architectures
109+ - [ ] Enhanced benchmarking capabilities
110+ - [ ] Multi-GPU processing support
111+ - [ ] Advanced memory optimization techniques
112+ - [ ] Integration with more deployment platforms
113+ - [ ] Custom quantization kernels
84114
85115## 🤝 Contributing
86116
@@ -92,14 +122,12 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
92122
93123## 🙏 Acknowledgments
94124
95- - [ HuggingFace] ( https://huggingface.co/ ) for their amazing Transformers library
96- - [ bitsandbytes] ( https://github.com/TimDettmers/bitsandbytes ) for quantization
97- - [ PEFT] ( https://github.com/huggingface/peft ) for parameter-efficient fine-tuning
98- - [ Weights & Biases] ( https://wandb.ai/ ) for experiment tracking
125+ - [ llama.cpp] ( https://github.com/ggerganov/llama.cpp ) for GGUF format
126+ - [ HuggingFace] ( https://huggingface.co/ ) for Transformers library
127+ - [ CTransformers] ( https://github.com/marella/ctransformers ) for GGUF support
99128
100129## 📫 Contact & Support
101130
102- - GitHub Issues: [ Create an issue] ( https://github.com/yourusername /QuantLLM/issues )
131+ - GitHub Issues: [ Create an issue] ( https://github.com/codewithdark-git /QuantLLM/issues )
103132- Documentation: [ Read the docs] ( https://quantllm.readthedocs.io/ )
104- - Discord: [ Join our community] ( https://discord.gg/quantllm )
105- 133+
0 commit comments