Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions examples/nlp_and_llms/nvidia-lora/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# LoRA Fine-Tuning (PEFT + Transformers)

![LoRA Fine-Tuning Header](https://cdn-icons-png.flaticon.com/512/8101/8101225.png)

This template illustrates how **LoRA fine-tuning** can significantly reduce resource requirements while maintaining strong model performance.
By running it on **Saturn Cloud**, you benefit from a GPU-optimized, scalable environment that simplifies the entire fine-tuning workflow — from experimentation to production deployment.

Learn more:

* 🔗 [Saturn Cloud Documentation](https://saturncloud.io/docs/)
* 🔗 [Saturn Cloud Templates Gallery](https://saturncloud.io/resources/templates/)
* 🔗 [PEFT Library (Hugging Face)](https://huggingface.co/docs/peft/index)
1 change: 1 addition & 0 deletions examples/nlp_and_llms/nvidia-lora/nvidia_lora.ipynb

Large diffs are not rendered by default.

69 changes: 69 additions & 0 deletions examples/nlp_and_llms/nvidia-vllm-7b/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# 🧠 LLM Inference with vLLM 7B

**Saturn Cloud | GPU-Optimised Template**

Run and serve large language models (LLMs) efficiently using **vLLM**, a high-performance inference and serving engine designed for speed and scalability.
This Saturn Cloud template demonstrates how to deploy **7B-class models** such as *Mistral*, *Llama*, or *Gemma* for text generation and interactive inference.

---

## 🚀 Overview

**vLLM** delivers lightning-fast text generation through techniques such as **PagedAttention**, **continuous batching**, and **quantisation**.
On **Saturn Cloud**, this notebook enables you to:

* Deploy and test 7B-class LLMs for inference and serving.
* Scale seamlessly from a single GPU to **multi-GPU clusters**.
* Experiment interactively or integrate models into larger data-science pipelines.

> ⚙️ Fully compatible with Saturn Cloud’s managed GPU environments and ready for immediate use.

---

## 🧩 Features

* **Pre-configured vLLM environment** for fast setup.
* **Support for NVIDIA GPUs** (A10G, A100) and multi-GPU scaling.
* **Quick-start workflow**: load, run, and test model prompts.
* **Local API-style inference** via vLLM’s serving engine.
* **Interactive prompt input** for experimentation.

---

## 📋 Requirements

* **Saturn Cloud account** with GPU instance access.
* Python ≥ 3.12
* Compatible with **CUDA 12.0+** and **Transformers ≥ 4.40**

All dependencies are pre-installed when running the notebook on Saturn Cloud.

---

## 💡 Usage

1. **Open the template** in Saturn Cloud.
2. **Select a GPU instance** (A10G or A100 recommended).
3. **Run the notebook cells sequentially** to:

* Install dependencies
* Configure vLLM settings
* Load and test your model
* Input prompts interactively to generate text

> For production, vLLM can also serve models as an **OpenAI-compatible API** using the `vllm serve` command.

---

## 🧭 Learn More

* [Saturn Cloud Documentation](https://saturncloud.io/docs/?utm_source=github&utm_medium=template)
* [Saturn Cloud Templates](https://saturncloud.io/templates/?utm_source=github&utm_medium=template)
* [vLLM Official Docs](https://docs.vllm.ai/en/latest/?utm_source=saturn&utm_medium=template)

---

## 🏁 Conclusion

This template provides a ready-to-run setup for **LLM inference with vLLM 7B on Saturn Cloud**, combining high performance, scalability, and ease of use.
Adapt it for experimentation, prototyping, or production-grade LLM deployments in your Saturn Cloud workspace.
255 changes: 255 additions & 0 deletions examples/nlp_and_llms/nvidia-vllm-7b/nvidia_vllm_7b.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Es_w2TvemoO3"
},
"source": [
"# LLM Inference vLLM 7B\n",
"\n",
"![chat Bubbles](https://cdn-icons-png.flaticon.com/512/2076/2076246.png) ![GPU Illustration](https://cdn-icons-png.flaticon.com/512/4854/4854226.png)\n",
"\n",
"**vLLM** is a high-performance inference and serving engine for large language models, optimised for speed and scalability. It delivers efficient text generation through innovations such as **PagedAttention**,** continuous batching**, and support for **quantisation**.\n",
"\n",
"This is a template demonstrates on how to run **7B-class models** (e.g. Mistral, Llama, Gemma) on Saturn Cloud.\n",
"\n",
"On [Saturn Cloud](https://saturncloud.io), you can scale from a single NVIDIA GPU to multi-GPU clusters, enabling distributed inference for larger models or higher throughput workloads — all within a managed, GPU-ready environment."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1hhl8dEPmoO5"
},
"source": [
"## 1. Install dependencies\n",
"\n",
"\n",
"We install **vLLM** and **Transformers**. A recent NVIDIA CUDA runtime is recommended for best performance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xDTiLAdfmoO6"
},
"outputs": [],
"source": [
"!pip install -q jedi\n",
"!pip install -q vllm transformers\n",
"!pip install uv\n",
"!uv venv vllm-env -p 3.12\n",
"!source vllm-env/bin/activate && uv pip install vllm\n",
"!source vllm-env/bin/activate && pip install ipykernel\n",
"!python -m ipykernel install --user --name=vllm-env --display-name \"vLLM Env\"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ehqOzc4hmoO8"
},
"source": [
"## 2. Environment check\n",
"\n",
"Verify the GPU is visible and print library versions. Confirm the environment is GPU-enabled."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_A7AYnJmmoO9"
},
"outputs": [],
"source": [
"import torch, platform\n",
"import vllm, transformers\n",
"\n",
"cuda_ok = torch.cuda.is_available()\n",
"print(f\"✅ CUDA available: {cuda_ok}\")\n",
"if cuda_ok:\n",
" print(\"🧠 GPU:\", torch.cuda.get_device_name(0))\n",
"print(\"🧩 torch:\", torch.__version__)\n",
"print(\"🧩 vllm:\", vllm.__version__)\n",
"print(\"🧩 transformers:\", transformers.__version__)\n",
"print(\"🐍 python:\", platform.python_version())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Qpk7TkAhmoO-"
},
"source": [
"## 3. Select model and vLLM settings\n",
"\n",
"Choose a **7B** model from Hugging Face. The defaults below work with common, openly available options. If a model is gated, select a different one."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Vujk0jtwmoO-"
},
"outputs": [],
"source": [
"# 🔧 Model & runtime config (edit these as needed)\n",
"MODEL_ID = \"mistralai/Mistral-7B-Instruct-v0.2\" # e.g., \"meta-llama/Llama-2-7b-chat-hf\", \"google/gemma-7b\"\n",
"DTYPE = \"auto\" # \"auto\", \"float16\", \"bfloat16\", \"float32\"\n",
"TENSOR_PARALLEL = 1 # single GPU = 1\n",
"GPU_MEMORY_UTIL = 0.90 # 0.6–0.95 depending on VRAM\n",
"MAX_MODEL_LEN = 8192 # context length (depends on model)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gMjiJkoTmoPA"
},
"source": [
"## 4. Basic model inference\n",
"\n",
"Load the model with **vLLM** and generate text for one or more prompts using **SamplingParams** (temperature, top_p, max_tokens, etc.)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "D7IXT5FWmoPB"
},
"outputs": [],
"source": [
"from vllm import LLM, SamplingParams\n",
"\n",
"print(\"⏳ Loading model (this may download weights on first run)...\")\n",
"llm = LLM(\n",
" model=MODEL_ID,\n",
" dtype=DTYPE,\n",
" tensor_parallel_size=TENSOR_PARALLEL,\n",
" gpu_memory_utilization=GPU_MEMORY_UTIL,\n",
" max_model_len=MAX_MODEL_LEN,\n",
")\n",
"print(\"✅ Model loaded!\")\n"
]
},
{
"cell_type": "markdown",
"source": [
"## 5. Sample prompts\n",
"\n",
"Use the customise Let's test the model using sample prompts."
],
"metadata": {
"id": "yaaCIaOfDILx"
}
},
{
"cell_type": "code",
"source": [
"# Example prompts\n",
"prompts = [\n",
" \"You are a helpful assistant. Summarise why efficient attention helps LLM inference.\",\n",
" \"List three creative uses of a 7B model for education.\",\n",
"]\n",
"\n",
"# Sampling parameters\n",
"sampling = SamplingParams(\n",
" temperature=0.7,\n",
" top_p=0.9,\n",
" max_tokens=256,\n",
")\n",
"\n",
"# Generate\n",
"outputs = llm.generate(prompts, sampling)\n",
"for out in outputs:\n",
" print(\"\\n---\")\n",
" print(\"Prompt:\", out.prompt)\n",
" print(\"Completion:\", out.outputs[0].text.strip())\n"
],
"metadata": {
"id": "1s_ALheCCwfP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## 6. User Custom Prompt Testing\n",
"\n",
"You can enter your prompt to test the model's chat capabilities here."
],
"metadata": {
"id": "kaSLGm0_GL62"
}
},
{
"cell_type": "code",
"source": [
"# Helper function for quick generation\n",
"def generate_text(prompt, temperature=0.7, top_p=0.9, max_tokens=256):\n",
" params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens)\n",
" result = llm.generate([prompt], params)[0].outputs[0].text\n",
" return result.strip()\n",
"\n",
"print(\"\\nQuick test:\")\n",
"new_Prompt = input(\"Enter a prompt: \")\n",
"print(generate_text(new_Prompt))\n",
"\n",
"\n",
"# print(generate_text(\"Explain what continuous batching means in vLLM.\"))"
],
"metadata": {
"id": "AI9CELj5Ej5g"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "yJSF_-4FmoPD"
},
"source": [
"## 7. Conclusion\n",
"\n",
"You have successfully deployed and run a 7B-class Large Language Model using vLLM on Saturn Cloud. This template demonstrates how to perform high-speed inference, interact with your model via prompts, and scale seamlessly across single or multiple GPUs.\n",
"\n",
"\n",
"By using [Saturn Cloud’s GPU infrastructure](https://saturncloud.io/docs/user-guide/how-to/resources/), you can easily extend this workflow for larger models, API serving, or integrated data science pipelines — all within a managed, scalable environment designed for production-grade AI workloads. Visit [saturn cloud](https://saturncloud.io/) to easily deploy this model."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.13.7",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"colab": {
"provenance": [],
"gpuType": "A100"
},
"accelerator": "GPU"
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading