diff --git a/examples/nlp_and_llms/nvidia-lora/README.md b/examples/nlp_and_llms/nvidia-lora/README.md new file mode 100644 index 00000000..38b24c71 --- /dev/null +++ b/examples/nlp_and_llms/nvidia-lora/README.md @@ -0,0 +1,12 @@ +# LoRA Fine-Tuning (PEFT + Transformers) + +![LoRA Fine-Tuning Header](https://cdn-icons-png.flaticon.com/512/8101/8101225.png) + +This template illustrates how **LoRA fine-tuning** can significantly reduce resource requirements while maintaining strong model performance. +By running it on **Saturn Cloud**, you benefit from a GPU-optimized, scalable environment that simplifies the entire fine-tuning workflow — from experimentation to production deployment. + +Learn more: + +* 🔗 [Saturn Cloud Documentation](https://saturncloud.io/docs/) +* 🔗 [Saturn Cloud Templates Gallery](https://saturncloud.io/resources/templates/) +* 🔗 [PEFT Library (Hugging Face)](https://huggingface.co/docs/peft/index) diff --git a/examples/nlp_and_llms/nvidia-lora/nvidia_lora.ipynb b/examples/nlp_and_llms/nvidia-lora/nvidia_lora.ipynb new file mode 100644 index 00000000..4ac2ecd0 --- /dev/null +++ b/examples/nlp_and_llms/nvidia-lora/nvidia_lora.ipynb @@ -0,0 +1 @@ +{"cells":[{"cell_type":"markdown","id":"0c21f79d","metadata":{"id":"0c21f79d"},"source":["# LoRA Fine-Tuning\n","\n","![](https://miro.medium.com/v2/resize:fit:700/1*bwbhjqxxC6IPKGxnmpVlwg.png)\n","\n","This example template demonstrates **parameter-efficient fine-tuning (PEFT)** using **LoRA (Low-Rank Adaptation)** with the FLAN-T5 model on a free public dataset (SAMSum) for summarization.\n","\n","This provides a lightweight, GPU-friendly workflow that runs fully offline — no API keys required. The notebook guides you through each step: loading data, applying LoRA adapters, fine-tuning, evaluating, and saving your model for reuse.\n","\n","On [Saturn Cloud](https://saturncloud.io), you can scale from a single NVIDIA GPU to multi-GPU clusters, enabling distributed inference for larger models or higher throughput workloads — all within a managed, GPU-ready environment."]},{"cell_type":"markdown","id":"572d0e23-b689-4be9-999b-a5da2f670d90","metadata":{"id":"572d0e23-b689-4be9-999b-a5da2f670d90"},"source":["## Install dependencies"]},{"cell_type":"code","execution_count":10,"id":"982862db-82e2-4c70-9221-3ed04c03aad3","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"982862db-82e2-4c70-9221-3ed04c03aad3","executionInfo":{"status":"ok","timestamp":1761300519635,"user_tz":-60,"elapsed":444023,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"3779bc45-2105-4f23-ab71-65ad97e06f29"},"outputs":[{"output_type":"stream","name":"stdout","text":["\u001b[1m\u001b[33mwarning\u001b[39m\u001b[0m\u001b[1m:\u001b[0m \u001b[1mThe `--system` flag has no effect, a system Python interpreter is always used in `uv venv`\u001b[0m\n","Using CPython 3.12.12 interpreter at: \u001b[36m/usr/bin/python3\u001b[39m\n","Creating virtual environment at: \u001b[36mlora-env\u001b[39m\n","\u001b[33m?\u001b[0m \u001b[1mA virtual environment already exists at `lora-env`. Do you want to replace it?\u001b[0m \u001b[38;5;8m[y/n]\u001b[0m \u001b[38;5;8m›\u001b[0m \u001b[36myes\u001b[0m\n","\n","\u001b[0J\u001b[32m✔\u001b[0m \u001b[1mA virtual environment already exists at `lora-env`. Do you want to replace it?\u001b[0m \u001b[38;5;8m·\u001b[0m \u001b[36myes\u001b[0m\n","\u001b[?25hActivate with: \u001b[32msource lora-env/bin/activate\u001b[39m\n","0.00s - Debugger warning: It seems that frozen modules are being used, which may\n","0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off\n","0.00s - to python to disable frozen modules.\n","0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.\n","Installed kernelspec lora-env in /root/.local/share/jupyter/kernels/lora-env\n"]}],"source":["# Step 1: Install UV (fast, modern package manager)\n","!pip install -q uv\n","# Step 2: Create a clean environment with Python 3.12\n","!uv venv lora-env -p 3.12\n","\n","# Step 3: Activate and install all required libraries inside it\n","!source lora-env/bin/activate && uv pip install -q torch transformers datasets peft accelerate evaluate bitsandbytes jedi\n","\n","# Step 4: Add the environment as a selectable Jupyter kernel\n","!source lora-env/bin/activate && pip install -q ipykernel\n","!python -m ipykernel install --user --name=lora-env --display-name \"LoRA Fine-Tune Env\"\n","\n","# (Optional fallback for environments without bitsandbytes)\n","try:\n"," import bitsandbytes\n","except Exception:\n"," print(\"⚠️ bitsandbytes not available — skipping GPU quantisation support.\")\n","\n","\n","!pip install -q --upgrade \\\n"," sentencepiece \\\n"," protobuf \\\n"," tqdm"]},{"cell_type":"markdown","id":"c12336a1-ae67-4f40-8bcc-df3b5ce9c404","metadata":{"id":"c12336a1-ae67-4f40-8bcc-df3b5ce9c404"},"source":["Download and prepares the GovReport Summarization dataset from `Hugging Face (ccdv/govreport-summarization)`. The dataset contains long government reports paired with their human-written summaries, making it suitable for text summarization tasks."]},{"cell_type":"code","execution_count":11,"id":"3b6f4321-71f6-4358-bce8-7665b0c3e560","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"3b6f4321-71f6-4358-bce8-7665b0c3e560","executionInfo":{"status":"ok","timestamp":1761300520639,"user_tz":-60,"elapsed":978,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"c9ff928c-0bff-4a54-ec94-3b4301bf0b45"},"outputs":[{"output_type":"stream","name":"stdout","text":["⏳ Downloading dataset: ccdv/govreport-summarization\n","✅ Dataset ready (govreport-summarization)\n"]}],"source":["from datasets import load_dataset, Dataset\n","import pandas as pd\n","\n","print(\"⏳ Downloading dataset: ccdv/govreport-summarization\")\n","ds = load_dataset(\"ccdv/govreport-summarization\")\n","train_ds = ds[\"train\"].select(range(1000))\n","eval_ds = ds[\"validation\"].select(range(200))\n","TEXT_COL, TARGET_COL = \"report\", \"summary\"\n","print(\"✅ Dataset ready (govreport-summarization)\")"]},{"cell_type":"markdown","id":"0dd28e48-64a4-4133-8310-e9aed982e595","metadata":{"id":"0dd28e48-64a4-4133-8310-e9aed982e595"},"source":["Loads the **FLAN-T5-Small model** and its tokenizer from Hugging Face. The tokenizer converts text into numerical tokens the model can understand, while the model itself (a sequence-to-sequence language model) performs tasks such as summarization or text generation."]},{"cell_type":"code","execution_count":12,"id":"1b080fd4-c153-4657-8642-bdb858a3f5e9","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"1b080fd4-c153-4657-8642-bdb858a3f5e9","executionInfo":{"status":"ok","timestamp":1761300521822,"user_tz":-60,"elapsed":1174,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"bef9cc71-31d4-41fb-eaca-6d63312e5379"},"outputs":[{"output_type":"stream","name":"stdout","text":["⏳ Loading model: google/flan-t5-small\n","✅ Model and tokenizer loaded successfully!\n","Tokenizer vocab size: 32100\n"]}],"source":["from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n","\n","model_name = \"google/flan-t5-small\"\n","print(f\"⏳ Loading model: {model_name}\")\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","model = AutoModelForSeq2SeqLM.from_pretrained(model_name)\n","\n","print(\"✅ Model and tokenizer loaded successfully!\")\n","print(\"Tokenizer vocab size:\", len(tokenizer))\n"]},{"cell_type":"markdown","source":["Adding LoRA (Low-Rank Adaptation) adapter to the base model using PEFT (Parameter-Efficient Fine-Tuning). Instead of updating all model parameters, LoRA inserts lightweight adapter layers that learn task-specific updates—making fine-tuning faster and more memory-efficient."],"metadata":{"id":"KhKaRIjZom1R"},"id":"KhKaRIjZom1R"},{"cell_type":"code","execution_count":13,"id":"d5f3740d-76c3-4f76-92ee-c61dcbed3144","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"d5f3740d-76c3-4f76-92ee-c61dcbed3144","executionInfo":{"status":"ok","timestamp":1761300521858,"user_tz":-60,"elapsed":19,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"5ff20b33-214e-4a15-9338-3a7aac5fdd31"},"outputs":[{"output_type":"stream","name":"stdout","text":["✅ LoRA adapter added successfully!\n","trainable params: 688,128 || all params: 77,649,280 || trainable%: 0.8862\n"]}],"source":["from peft import LoraConfig, get_peft_model\n","\n","# LoRA configuration\n","lora_config = LoraConfig(\n"," r=16, # rank\n"," lora_alpha=32, # scaling factor\n"," lora_dropout=0.05, # dropout for regularisation\n"," bias=\"none\",\n"," task_type=\"SEQ_2_SEQ_LM\" # T5-style sequence-to-sequence\n",")\n","\n","# Apply adapter to model\n","model = get_peft_model(model, lora_config)\n","\n","# Print summary\n","print(\"✅ LoRA adapter added successfully!\")\n","model.print_trainable_parameters()\n"]},{"cell_type":"markdown","source":["Prepare the text data for training by converting it into numerical tokens that the model can process."],"metadata":{"id":"dLitVvFvo5vn"},"id":"dLitVvFvo5vn"},{"cell_type":"code","execution_count":14,"id":"93fa696d-a672-4ad3-8343-73f8ebc71c7a","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":121,"referenced_widgets":["9db3a5ac0dd84249a2b236b96c58aad8","2c821f95cbf94e6f972651544b51bacf","69e89bf8eace41aa850498fd3fd61f99","3aaca7366ecb47d8b4ac27b6301aa91b","48ba285de8364e65a380add6e08e4d69","29dfb08a2a1d43b3878cb8a98b285b09","4edaefbb46844f8ba1583f63c20f9ccf","168534f6a2f3457b8dfa29da5aa15d6a","3e54568d0ae94350a1a461a6b1cc3423","bd8316fe2cc24289bf8d39ab6f065e43","d802453c7a484c89897a30b8ddde157b"]},"id":"93fa696d-a672-4ad3-8343-73f8ebc71c7a","executionInfo":{"status":"ok","timestamp":1761300535524,"user_tz":-60,"elapsed":13642,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"e9c09e29-4170-4e06-e229-5b7b5f002740"},"outputs":[{"output_type":"display_data","data":{"text/plain":["Map: 0%| | 0/200 [00:00"],"text/html":["\n","
\n"," \n"," \n"," [500/500 01:36, Epoch 1/1]\n","
\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
StepTraining Loss
250.000000
500.000000
750.000000
1000.000000
1250.000000
1500.000000
1750.000000
2000.000000
2250.000000
2500.000000
2750.000000
3000.000000
3250.000000
3500.000000
3750.000000
4000.000000
4250.000000
4500.000000
4750.000000
5000.000000

"]},"metadata":{}},{"output_type":"stream","name":"stdout","text":["✅ Training complete!\n"]}],"source":["import torch\n","from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer\n","\n","# Prepare data collator\n","data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)\n","\n","# Define training arguments\n","args = Seq2SeqTrainingArguments(\n"," output_dir=\"outputs-lora\",\n"," per_device_train_batch_size=2,\n"," per_device_eval_batch_size=2,\n"," learning_rate=2e-4,\n"," num_train_epochs=1,\n"," save_strategy=\"epoch\",\n"," logging_steps=25,\n"," predict_with_generate=True,\n"," fp16=torch.cuda.is_available(), # Use mixed precision if GPU supports it\n"," report_to=[], # disables online tracking (no API needed)\n",")\n","\n","# Initialise trainer\n","trainer = Seq2SeqTrainer(\n"," model=model,\n"," args=args,\n"," train_dataset=train_tok,\n"," eval_dataset=eval_tok,\n"," tokenizer=tokenizer,\n"," data_collator=data_collator,\n",")\n","\n","print(\"🚀 Starting fine-tuning…\")\n","trainer.train()\n","print(\"✅ Training complete!\")"]},{"cell_type":"markdown","id":"cb3261ba-fd89-42f8-8cbc-b9391b859ee6","metadata":{"id":"cb3261ba-fd89-42f8-8cbc-b9391b859ee6"},"source":["Let's test the fine-tuned model to verify that it can generate meaningful summaries. It performs a full inference pass using the model and tokenizer."]},{"cell_type":"code","execution_count":16,"id":"f86f32e1-49c1-426e-b013-3156cb6d6e4f","metadata":{"jp-MarkdownHeadingCollapsed":true,"colab":{"base_uri":"https://localhost:8080/"},"id":"f86f32e1-49c1-426e-b013-3156cb6d6e4f","executionInfo":{"status":"ok","timestamp":1761300634308,"user_tz":-60,"elapsed":233,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"057ca513-731d-438a-a6d3-c41225bfa966"},"outputs":[{"output_type":"stream","name":"stdout","text":["\n","🧠 Fine-tuned Model Output:\n","\n","Bob and Alice discuss the museum's history.\n"]}],"source":["test_input = \"Write a brief summary: Alice and Bob discussed weekend plans. Bob suggested hiking, but Alice preferred visiting the museum.\"\n","\n","# Tokenise and move to model device\n","inputs = tokenizer(test_input, return_tensors=\"pt\", truncation=True, padding=True).to(model.device)\n","\n","# Generate output\n","outputs = model.generate(**inputs, max_new_tokens=80)\n","\n","# Decode and display\n","print(\"\\n🧠 Fine-tuned Model Output:\\n\")\n","print(tokenizer.decode(outputs[0], skip_special_tokens=True))\n"]},{"cell_type":"markdown","id":"3ee4b6cb-1684-49ca-9cc2-74609bf610bd","metadata":{"id":"3ee4b6cb-1684-49ca-9cc2-74609bf610bd"},"source":["This allows interactively test the fine-tuned model with your own custom input."]},{"cell_type":"code","execution_count":17,"id":"3bad36a0-89b4-484d-953c-7371d83cfff6","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"3bad36a0-89b4-484d-953c-7371d83cfff6","executionInfo":{"status":"ok","timestamp":1761300740710,"user_tz":-60,"elapsed":106374,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}},"outputId":"cee233ae-58d2-42ac-90a9-3e49430bc355"},"outputs":[{"output_type":"stream","name":"stdout","text":["💬 Try your own prompt!\n","\n","Enter a text or paragraph you'd like the model to summarise: what is it doing \n","\n","🧩 Model Output:\n","\n","It is doing it doing it doing it\n"]}],"source":["print(\"💬 Try your own prompt!\")\n","\n","user_prompt = input(\"\\nEnter a text or paragraph you'd like the model to summarise: \")\n","\n","# Tokenise user prompt\n","inputs = tokenizer(user_prompt, return_tensors=\"pt\", truncation=True, padding=True).to(model.device)\n","\n","# Generate output\n","outputs = model.generate(**inputs, max_new_tokens=80)\n","\n","# Decode and print\n","print(\"\\n🧩 Model Output:\\n\")\n","print(tokenizer.decode(outputs[0], skip_special_tokens=True))\n"]},{"cell_type":"markdown","id":"a0a3c84e-2d27-46ad-9356-95e2ef9a598b","metadata":{"id":"a0a3c84e-2d27-46ad-9356-95e2ef9a598b"},"source":["In this template, you fine-tuned **Google’s FLAN-T5-Small** model using **LoRA (Low-Rank Adaptation)** with the **PEFT** library — a modern, lightweight approach to large language model adaptation.\n","\n","Running this workflow on **Saturn Cloud** makes it both **scalable and cost-effective**. Saturn Cloud’s managed infrastructure allows you to:\n","\n","* Start with a **single NVIDIA GPU** for experimentation and scale up to multi-GPU clusters for larger models.\n","* Collaborate across teams easily through shared Jupyter environments.\n","* Integrate this fine-tuning workflow into production pipelines for enterprise-ready deployment.\n","\n","By using this template, you now have a complete, ready-to-run foundation for **adapter-based fine-tuning** in Saturn Cloud — ideal for tasks like summarisation, translation, or instruction-following with minimal resource use.\n","\n","To continue exploring, check out:\n","\n","* [Saturn Cloud Documentation](https://saturncloud.io/docs/) — for advanced configuration and GPU scaling.\n","* [Saturn Cloud Templates](https://saturncloud.io/resources/templates/) — for more examples of ML, LLM, and data science workflows."]}],"metadata":{"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.13.7"},"colab":{"provenance":[],"gpuType":"T4"},"accelerator":"GPU","widgets":{"application/vnd.jupyter.widget-state+json":{"9db3a5ac0dd84249a2b236b96c58aad8":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_2c821f95cbf94e6f972651544b51bacf","IPY_MODEL_69e89bf8eace41aa850498fd3fd61f99","IPY_MODEL_3aaca7366ecb47d8b4ac27b6301aa91b"],"layout":"IPY_MODEL_48ba285de8364e65a380add6e08e4d69"}},"2c821f95cbf94e6f972651544b51bacf":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_29dfb08a2a1d43b3878cb8a98b285b09","placeholder":"​","style":"IPY_MODEL_4edaefbb46844f8ba1583f63c20f9ccf","value":"Map: 100%"}},"69e89bf8eace41aa850498fd3fd61f99":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_168534f6a2f3457b8dfa29da5aa15d6a","max":200,"min":0,"orientation":"horizontal","style":"IPY_MODEL_3e54568d0ae94350a1a461a6b1cc3423","value":200}},"3aaca7366ecb47d8b4ac27b6301aa91b":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_bd8316fe2cc24289bf8d39ab6f065e43","placeholder":"​","style":"IPY_MODEL_d802453c7a484c89897a30b8ddde157b","value":" 200/200 [00:13<00:00, 15.11 examples/s]"}},"48ba285de8364e65a380add6e08e4d69":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"29dfb08a2a1d43b3878cb8a98b285b09":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4edaefbb46844f8ba1583f63c20f9ccf":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"168534f6a2f3457b8dfa29da5aa15d6a":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3e54568d0ae94350a1a461a6b1cc3423":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"bd8316fe2cc24289bf8d39ab6f065e43":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d802453c7a484c89897a30b8ddde157b":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}}}}},"nbformat":4,"nbformat_minor":5} \ No newline at end of file diff --git a/examples/nlp_and_llms/nvidia-vllm-7b/README.md b/examples/nlp_and_llms/nvidia-vllm-7b/README.md new file mode 100644 index 00000000..503f7a73 --- /dev/null +++ b/examples/nlp_and_llms/nvidia-vllm-7b/README.md @@ -0,0 +1,69 @@ +# 🧠 LLM Inference with vLLM 7B + +**Saturn Cloud | GPU-Optimised Template** + +Run and serve large language models (LLMs) efficiently using **vLLM**, a high-performance inference and serving engine designed for speed and scalability. +This Saturn Cloud template demonstrates how to deploy **7B-class models** such as *Mistral*, *Llama*, or *Gemma* for text generation and interactive inference. + +--- + +## 🚀 Overview + +**vLLM** delivers lightning-fast text generation through techniques such as **PagedAttention**, **continuous batching**, and **quantisation**. +On **Saturn Cloud**, this notebook enables you to: + +* Deploy and test 7B-class LLMs for inference and serving. +* Scale seamlessly from a single GPU to **multi-GPU clusters**. +* Experiment interactively or integrate models into larger data-science pipelines. + +> ⚙️ Fully compatible with Saturn Cloud’s managed GPU environments and ready for immediate use. + +--- + +## 🧩 Features + +* **Pre-configured vLLM environment** for fast setup. +* **Support for NVIDIA GPUs** (A10G, A100) and multi-GPU scaling. +* **Quick-start workflow**: load, run, and test model prompts. +* **Local API-style inference** via vLLM’s serving engine. +* **Interactive prompt input** for experimentation. + +--- + +## 📋 Requirements + +* **Saturn Cloud account** with GPU instance access. +* Python ≥ 3.12 +* Compatible with **CUDA 12.0+** and **Transformers ≥ 4.40** + +All dependencies are pre-installed when running the notebook on Saturn Cloud. + +--- + +## 💡 Usage + +1. **Open the template** in Saturn Cloud. +2. **Select a GPU instance** (A10G or A100 recommended). +3. **Run the notebook cells sequentially** to: + + * Install dependencies + * Configure vLLM settings + * Load and test your model + * Input prompts interactively to generate text + +> For production, vLLM can also serve models as an **OpenAI-compatible API** using the `vllm serve` command. + +--- + +## 🧭 Learn More + +* [Saturn Cloud Documentation](https://saturncloud.io/docs/?utm_source=github&utm_medium=template) +* [Saturn Cloud Templates](https://saturncloud.io/templates/?utm_source=github&utm_medium=template) +* [vLLM Official Docs](https://docs.vllm.ai/en/latest/?utm_source=saturn&utm_medium=template) + +--- + +## 🏁 Conclusion + +This template provides a ready-to-run setup for **LLM inference with vLLM 7B on Saturn Cloud**, combining high performance, scalability, and ease of use. +Adapt it for experimentation, prototyping, or production-grade LLM deployments in your Saturn Cloud workspace. diff --git a/examples/nlp_and_llms/nvidia-vllm-7b/nvidia_vllm_7b.ipynb b/examples/nlp_and_llms/nvidia-vllm-7b/nvidia_vllm_7b.ipynb new file mode 100644 index 00000000..9c9dac42 --- /dev/null +++ b/examples/nlp_and_llms/nvidia-vllm-7b/nvidia_vllm_7b.ipynb @@ -0,0 +1,255 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Es_w2TvemoO3" + }, + "source": [ + "# LLM Inference vLLM 7B\n", + "\n", + "![chat Bubbles](https://cdn-icons-png.flaticon.com/512/2076/2076246.png) ![GPU Illustration](https://cdn-icons-png.flaticon.com/512/4854/4854226.png)\n", + "\n", + "**vLLM** is a high-performance inference and serving engine for large language models, optimised for speed and scalability. It delivers efficient text generation through innovations such as **PagedAttention**,** continuous batching**, and support for **quantisation**.\n", + "\n", + "This is a template demonstrates on how to run **7B-class models** (e.g. Mistral, Llama, Gemma) on Saturn Cloud.\n", + "\n", + "On [Saturn Cloud](https://saturncloud.io), you can scale from a single NVIDIA GPU to multi-GPU clusters, enabling distributed inference for larger models or higher throughput workloads — all within a managed, GPU-ready environment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1hhl8dEPmoO5" + }, + "source": [ + "## 1. Install dependencies\n", + "\n", + "\n", + "We install **vLLM** and **Transformers**. A recent NVIDIA CUDA runtime is recommended for best performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xDTiLAdfmoO6" + }, + "outputs": [], + "source": [ + "!pip install -q jedi\n", + "!pip install -q vllm transformers\n", + "!pip install uv\n", + "!uv venv vllm-env -p 3.12\n", + "!source vllm-env/bin/activate && uv pip install vllm\n", + "!source vllm-env/bin/activate && pip install ipykernel\n", + "!python -m ipykernel install --user --name=vllm-env --display-name \"vLLM Env\"\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ehqOzc4hmoO8" + }, + "source": [ + "## 2. Environment check\n", + "\n", + "Verify the GPU is visible and print library versions. Confirm the environment is GPU-enabled." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_A7AYnJmmoO9" + }, + "outputs": [], + "source": [ + "import torch, platform\n", + "import vllm, transformers\n", + "\n", + "cuda_ok = torch.cuda.is_available()\n", + "print(f\"✅ CUDA available: {cuda_ok}\")\n", + "if cuda_ok:\n", + " print(\"🧠 GPU:\", torch.cuda.get_device_name(0))\n", + "print(\"🧩 torch:\", torch.__version__)\n", + "print(\"🧩 vllm:\", vllm.__version__)\n", + "print(\"🧩 transformers:\", transformers.__version__)\n", + "print(\"🐍 python:\", platform.python_version())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qpk7TkAhmoO-" + }, + "source": [ + "## 3. Select model and vLLM settings\n", + "\n", + "Choose a **7B** model from Hugging Face. The defaults below work with common, openly available options. If a model is gated, select a different one." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Vujk0jtwmoO-" + }, + "outputs": [], + "source": [ + "# 🔧 Model & runtime config (edit these as needed)\n", + "MODEL_ID = \"mistralai/Mistral-7B-Instruct-v0.2\" # e.g., \"meta-llama/Llama-2-7b-chat-hf\", \"google/gemma-7b\"\n", + "DTYPE = \"auto\" # \"auto\", \"float16\", \"bfloat16\", \"float32\"\n", + "TENSOR_PARALLEL = 1 # single GPU = 1\n", + "GPU_MEMORY_UTIL = 0.90 # 0.6–0.95 depending on VRAM\n", + "MAX_MODEL_LEN = 8192 # context length (depends on model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gMjiJkoTmoPA" + }, + "source": [ + "## 4. Basic model inference\n", + "\n", + "Load the model with **vLLM** and generate text for one or more prompts using **SamplingParams** (temperature, top_p, max_tokens, etc.)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "D7IXT5FWmoPB" + }, + "outputs": [], + "source": [ + "from vllm import LLM, SamplingParams\n", + "\n", + "print(\"⏳ Loading model (this may download weights on first run)...\")\n", + "llm = LLM(\n", + " model=MODEL_ID,\n", + " dtype=DTYPE,\n", + " tensor_parallel_size=TENSOR_PARALLEL,\n", + " gpu_memory_utilization=GPU_MEMORY_UTIL,\n", + " max_model_len=MAX_MODEL_LEN,\n", + ")\n", + "print(\"✅ Model loaded!\")\n" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## 5. Sample prompts\n", + "\n", + "Use the customise Let's test the model using sample prompts." + ], + "metadata": { + "id": "yaaCIaOfDILx" + } + }, + { + "cell_type": "code", + "source": [ + "# Example prompts\n", + "prompts = [\n", + " \"You are a helpful assistant. Summarise why efficient attention helps LLM inference.\",\n", + " \"List three creative uses of a 7B model for education.\",\n", + "]\n", + "\n", + "# Sampling parameters\n", + "sampling = SamplingParams(\n", + " temperature=0.7,\n", + " top_p=0.9,\n", + " max_tokens=256,\n", + ")\n", + "\n", + "# Generate\n", + "outputs = llm.generate(prompts, sampling)\n", + "for out in outputs:\n", + " print(\"\\n---\")\n", + " print(\"Prompt:\", out.prompt)\n", + " print(\"Completion:\", out.outputs[0].text.strip())\n" + ], + "metadata": { + "id": "1s_ALheCCwfP" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 6. User Custom Prompt Testing\n", + "\n", + "You can enter your prompt to test the model's chat capabilities here." + ], + "metadata": { + "id": "kaSLGm0_GL62" + } + }, + { + "cell_type": "code", + "source": [ + "# Helper function for quick generation\n", + "def generate_text(prompt, temperature=0.7, top_p=0.9, max_tokens=256):\n", + " params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens)\n", + " result = llm.generate([prompt], params)[0].outputs[0].text\n", + " return result.strip()\n", + "\n", + "print(\"\\nQuick test:\")\n", + "new_Prompt = input(\"Enter a prompt: \")\n", + "print(generate_text(new_Prompt))\n", + "\n", + "\n", + "# print(generate_text(\"Explain what continuous batching means in vLLM.\"))" + ], + "metadata": { + "id": "AI9CELj5Ej5g" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJSF_-4FmoPD" + }, + "source": [ + "## 7. Conclusion\n", + "\n", + "You have successfully deployed and run a 7B-class Large Language Model using vLLM on Saturn Cloud. This template demonstrates how to perform high-speed inference, interact with your model via prompts, and scale seamlessly across single or multiple GPUs.\n", + "\n", + "\n", + "By using [Saturn Cloud’s GPU infrastructure](https://saturncloud.io/docs/user-guide/how-to/resources/), you can easily extend this workflow for larger models, API serving, or integrated data science pipelines — all within a managed, scalable environment designed for production-grade AI workloads. Visit [saturn cloud](https://saturncloud.io/) to easily deploy this model." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.13.7", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + }, + "colab": { + "provenance": [], + "gpuType": "A100" + }, + "accelerator": "GPU" + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file