diff --git a/nemo/NeMo-Data-Designer/README.md b/nemo/NeMo-Data-Designer/README.md new file mode 100644 index 000000000..73c00d2fc --- /dev/null +++ b/nemo/NeMo-Data-Designer/README.md @@ -0,0 +1,25 @@ +# 🎨 NeMo Data Designer Tutorial Notebooks + +This directory contains the tutorial notebooks for getting started with NeMo Data Designer. + +## 🐳 Deploy the NeMo Data Designer microservice locally + +In order to run these notebooks, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details. + +## 📦 Set up the environment + +We will use the `uv` package manager to set up our environment and install the necessary dependencies. If you don't have `uv` installed, you can follow the installation instructions from the [uv documentation](https://docs.astral.sh/uv/getting-started/installation/). + +Once you have `uv` installed, be sure you are in the `Nemo-Data-Designer` directory and run the following command: + +```bash +uv sync +``` + +This will create a virtual environment and install the necessary dependencies. Activate the virtual environment by running the following command: + +```bash +source .venv/bin/activate +``` + +Be sure to select this virtual environment as your kernel when running the notebooks. diff --git a/nemo/NeMo-Data-Designer/intro-tutorials/1-the-basics.ipynb b/nemo/NeMo-Data-Designer/intro-tutorials/1-the-basics.ipynb new file mode 100644 index 000000000..d5275403d --- /dev/null +++ b/nemo/NeMo-Data-Designer/intro-tutorials/1-the-basics.ipynb @@ -0,0 +1,479 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 🎨 NeMo Data Designer 101: The Basics\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", + ">\n", + "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", + "\n", + "
\n", + "\n", + "In this notebook, we will demonstrate the basics of Data Designer by generating a simple product review dataset.\n", + "\n", + "#### 💾 Install dependencies\n", + "\n", + "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the installation worked, you should be able to make the following imports:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "\n", + "from nemo_microservices import NeMoMicroservices\n", + "from nemo_microservices.beta.data_designer import (\n", + " DataDesignerConfigBuilder,\n", + " DataDesignerClient,\n", + ")\n", + "from nemo_microservices.beta.data_designer.config import columns as C\n", + "from nemo_microservices.beta.data_designer.config import params as P" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n", + "\n", + "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🏗️ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- You must provide a list of model configs to the builder at initialization.\n", + "\n", + "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# build.nvidia.com model endpoint\n", + "endpoint = \"https://integrate.api.nvidia.com/v1\"\n", + "model_id = \"mistralai/mistral-small-24b-instruct\"\n", + "\n", + "model_alias = \"mistral-small\"\n", + "\n", + "# You will need to enter your model provider API key to run this notebook.\n", + "api_key = getpass(\"Enter model provider API key: \")\n", + "\n", + "if len(api_key) > 0:\n", + " print(\"✅ API key received.\")\n", + "else:\n", + " print(\"❌ No API key provided. Please enter your model provider API key.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model_configs = [\n", + " P.ModelConfig(\n", + " alias=model_alias,\n", + " inference_parameters=P.InferenceParameters(\n", + " max_tokens=1024,\n", + " temperature=0.5,\n", + " top_p=1.0,\n", + " ),\n", + " model=P.Model(\n", + " api_endpoint=P.ApiEndpoint(\n", + " api_key=api_key,\n", + " model_id=model_id,\n", + " url=endpoint,\n", + " ),\n", + " ),\n", + " )\n", + "]\n", + "\n", + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🎲 Getting started with sampler columns\n", + "\n", + "- Sampler columns offer non-LLM based generation of synthetic data.\n", + "\n", + "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n", + "\n", + "
\n", + "\n", + "Let's start designing our product review dataset by adding product category and subcategory columns.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"product_category\",\n", + " type=P.SamplerType.CATEGORY,\n", + " params=P.CategorySamplerParams(\n", + " values=[\n", + " \"Electronics\",\n", + " \"Clothing\",\n", + " \"Home & Kitchen\",\n", + " \"Books\",\n", + " \"Home Office\",\n", + " ],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"product_subcategory\",\n", + " type=P.SamplerType.SUBCATEGORY,\n", + " params=P.SubcategorySamplerParams(\n", + " category=\"product_category\",\n", + " values={\n", + " \"Electronics\": [\n", + " \"Smartphones\",\n", + " \"Laptops\",\n", + " \"Headphones\",\n", + " \"Cameras\",\n", + " \"Accessories\",\n", + " ],\n", + " \"Clothing\": [\n", + " \"Men's Clothing\",\n", + " \"Women's Clothing\",\n", + " \"Winter Coats\",\n", + " \"Activewear\",\n", + " \"Accessories\",\n", + " ],\n", + " \"Home & Kitchen\": [\n", + " \"Appliances\",\n", + " \"Cookware\",\n", + " \"Furniture\",\n", + " \"Decor\",\n", + " \"Organization\",\n", + " ],\n", + " \"Books\": [\n", + " \"Fiction\",\n", + " \"Non-Fiction\",\n", + " \"Self-Help\",\n", + " \"Textbooks\",\n", + " \"Classics\",\n", + " ],\n", + " \"Home Office\": [\n", + " \"Desks\",\n", + " \"Chairs\",\n", + " \"Storage\",\n", + " \"Office Supplies\",\n", + " \"Lighting\",\n", + " ],\n", + " },\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"target_age_range\",\n", + " type=P.SamplerType.CATEGORY,\n", + " params=P.CategorySamplerParams(\n", + " values=[\"18-25\", \"25-35\", \"35-50\", \"50-65\", \"65+\"]\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# Optionally validate that the columns are configured correctly.\n", + "config_builder.validate()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, let's add samplers to generate data related to the customer and their review.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This column will sample synthetic person data based on statistics from the US Census.\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"customer\",\n", + " type=P.SamplerType.PERSON,\n", + " params=P.PersonSamplerParams(age_range=[18, 70]),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"number_of_stars\",\n", + " type=P.SamplerType.UNIFORM,\n", + " params=P.UniformSamplerParams(low=1, high=5),\n", + " convert_to=\"int\",\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"review_style\",\n", + " type=P.SamplerType.CATEGORY,\n", + " params=P.CategorySamplerParams(\n", + " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n", + " weights=[1, 2, 2, 1],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.validate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🦜 LLM-generated columns\n", + "\n", + "- The real power of Data Designer comes from leveraging LLMs to generate text, code, and structured data.\n", + "\n", + "- For our product review dataset, we will use LLM-generated text columns to generate product names and customer reviews.\n", + "\n", + "- When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.\n", + "\n", + "- As we see below, nested json columns can be accessed using dot notation.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "config_builder.add_column(\n", + " C.LLMTextColumn(\n", + " name=\"product_name\",\n", + " prompt=(\n", + " \"Come up with a creative product name for a product in the '{{ product_category }}' category, focusing \"\n", + " \"on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is \"\n", + " \"{{ target_age_range }} years old. Respond with only the product name, no other text.\"\n", + " ),\n", + " # This is optional, but it can be useful for controlling the behavior of the LLM. Do not include instructions\n", + " # related to output formatting in the system prompt, as Data Designer handles this based on the column type.\n", + " system_prompt=(\n", + " \"You are a helpful assistant that generates product names. You respond with only the product name, \"\n", + " \"no other text. You do NOT add quotes around the product name. \"\n", + " ),\n", + " model_alias=model_alias,\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.LLMTextColumn(\n", + " name=\"customer_review\",\n", + " prompt=(\n", + " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n", + " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n", + " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n", + " \"The style of the review should be '{{ review_style }}'. \"\n", + " ),\n", + " model_alias=model_alias,\n", + " )\n", + ")\n", + "\n", + "config_builder.validate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 👀 Preview the dataset\n", + "\n", + "- Iteration is key to generating high-quality synthetic data.\n", + "\n", + "- Use the `preview` method to generate 10 records for inspection.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preview = ndd.preview(config_builder, verbose_logging=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The preview dataset is available as a pandas DataFrame.\n", + "preview.dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell multiple times to cycle through the 10 preview records.\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🧐 Adding an Evaluation Report\n", + "\n", + "- Data Designer offers an evaluation report for a quick look at the quality of the generated data.\n", + "\n", + "- To add a report, which will be generated at the end of a generation job, simply run the `with_evaluation_report` method.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "config_builder.with_evaluation_report()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🧬 Generate your dataset\n", + "\n", + "- Once you are happy with the preview, scale up to a larger dataset.\n", + "\n", + "- The `create` method will submit your generation job to the microservice and return a results object.\n", + "\n", + "- If you want to wait for the job to complete, set `wait_until_done=True`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results = ndd.create(config_builder, num_records=20, wait_until_done=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# load the dataset into a pandas DataFrame\n", + "dataset = results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🔎 View the evaluation report\n", + "\n", + "- The evaluation report is generated in HTML format and can be viewed in a browser.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import webbrowser\n", + "from pathlib import Path\n", + "\n", + "eval_report_path = Path(\"./1-the-basics-eval-report.html\").resolve()\n", + "\n", + "results.download_evaluation_report(eval_report_path)\n", + "\n", + "webbrowser.open_new_tab(f\"file:///{eval_report_path}\");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ⏭️ Next Steps\n", + "\n", + "Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n", + "\n", + "- [Structured outputs and jinja expressions](./2-structured-outputs-and-jinja-expressions.ipynb)\n", + "\n", + "- [Seeding synthetic data generation with an external dataset](./3-seeding-with-a-dataset.ipynb)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/nemo/NeMo-Data-Designer/intro-tutorials/2-structured-outputs-and-jinja-expressions.ipynb b/nemo/NeMo-Data-Designer/intro-tutorials/2-structured-outputs-and-jinja-expressions.ipynb new file mode 100644 index 000000000..82fc571c5 --- /dev/null +++ b/nemo/NeMo-Data-Designer/intro-tutorials/2-structured-outputs-and-jinja-expressions.ipynb @@ -0,0 +1,448 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 🎨 NeMo Data Designer 101: Structured Outputs and Jinja Expressions\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", + ">\n", + "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", + "\n", + "
\n", + "\n", + "In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs and Jinja expressions.\n", + "\n", + "#### 💾 Install dependencies\n", + "\n", + "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the installation worked, you should be able to make the following imports:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "\n", + "from nemo_microservices import NeMoMicroservices\n", + "from nemo_microservices.beta.data_designer import (\n", + " DataDesignerConfigBuilder,\n", + " DataDesignerClient,\n", + ")\n", + "from nemo_microservices.beta.data_designer.config import columns as C\n", + "from nemo_microservices.beta.data_designer.config import params as P" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🧑‍🎨 Designing our data\n", + "\n", + "- We will again create a product review dataset, but this time we will use structured outputs and Jinja expressions.\n", + "\n", + "- Structured outputs let you specify the exact schema of the data you want to generate.\n", + "\n", + "- Data Designer supports schemas specified using either json schema or Pydantic data models (recommended).\n", + "\n", + "
\n", + "\n", + "We'll define our structured outputs using Pydantic data models:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from decimal import Decimal\n", + "from typing import Literal\n", + "from pydantic import BaseModel, Field\n", + "\n", + "\n", + "# We define a Product schema so that the name, description, and price are generated\n", + "# in one go, with the types and constraints specified.\n", + "class Product(BaseModel):\n", + " name: str = Field(description=\"The name of the product\")\n", + " description: str = Field(description=\"A description of the product\")\n", + " price: Decimal = Field(\n", + " description=\"The price of the product\", ge=10, le=1000, decimal_places=2\n", + " )\n", + "\n", + "\n", + "class ProductReview(BaseModel):\n", + " rating: int = Field(description=\"The rating of the product\", ge=1, le=5)\n", + " customer_mood: Literal[\"irritated\", \"mad\", \"happy\", \"neutral\", \"excited\"] = Field(\n", + " description=\"The mood of the customer\"\n", + " )\n", + " review: str = Field(description=\"A review of the product\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n", + "\n", + "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🏗️ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- You must provide a list of model configs to the builder at initialization.\n", + "\n", + "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# build.nvidia.com model endpoint\n", + "endpoint = \"https://integrate.api.nvidia.com/v1\"\n", + "model_id = \"mistralai/mistral-small-24b-instruct\"\n", + "\n", + "model_alias = \"mistral-small\"\n", + "\n", + "# You will need to enter your model provider API key to run this notebook.\n", + "api_key = getpass(\"Enter model provider API key: \")\n", + "\n", + "if len(api_key) > 0:\n", + " print(\"✅ API key received.\")\n", + "else:\n", + " print(\"❌ No API key provided. Please enter your model provider API key.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model_configs = [\n", + " P.ModelConfig(\n", + " alias=model_alias,\n", + " inference_parameters=P.InferenceParameters(\n", + " max_tokens=1024,\n", + " temperature=0.5,\n", + " top_p=1.0,\n", + " ),\n", + " model=P.Model(\n", + " api_endpoint=P.ApiEndpoint(\n", + " api_key=api_key,\n", + " model_id=model_id,\n", + " url=endpoint,\n", + " ),\n", + " ),\n", + " )\n", + "]\n", + "\n", + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, let's design our product review dataset using a few more tricks compared to the previous notebook:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Since we often just want a few attributes from Person objects, we can use\n", + "# Data Designer's `with_person_samplers` method to create multiple person samplers\n", + "# at once and drop the person object columns from the final dataset.\n", + "config_builder.with_person_samplers(\n", + " {\"customer\": P.PersonSamplerParams(age_range=[18, 65])}\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"product_category\",\n", + " type=P.SamplerType.CATEGORY,\n", + " params=P.CategorySamplerParams(\n", + " values=[\n", + " \"Electronics\",\n", + " \"Clothing\",\n", + " \"Home & Kitchen\",\n", + " \"Books\",\n", + " \"Home Office\",\n", + " ],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"product_subcategory\",\n", + " type=P.SamplerType.SUBCATEGORY,\n", + " params=P.SubcategorySamplerParams(\n", + " category=\"product_category\",\n", + " values={\n", + " \"Electronics\": [\n", + " \"Smartphones\",\n", + " \"Laptops\",\n", + " \"Headphones\",\n", + " \"Cameras\",\n", + " \"Accessories\",\n", + " ],\n", + " \"Clothing\": [\n", + " \"Men's Clothing\",\n", + " \"Women's Clothing\",\n", + " \"Winter Coats\",\n", + " \"Activewear\",\n", + " \"Accessories\",\n", + " ],\n", + " \"Home & Kitchen\": [\n", + " \"Appliances\",\n", + " \"Cookware\",\n", + " \"Furniture\",\n", + " \"Decor\",\n", + " \"Organization\",\n", + " ],\n", + " \"Books\": [\n", + " \"Fiction\",\n", + " \"Non-Fiction\",\n", + " \"Self-Help\",\n", + " \"Textbooks\",\n", + " \"Classics\",\n", + " ],\n", + " \"Home Office\": [\n", + " \"Desks\",\n", + " \"Chairs\",\n", + " \"Storage\",\n", + " \"Office Supplies\",\n", + " \"Lighting\",\n", + " ],\n", + " },\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"target_age_range\",\n", + " type=P.SamplerType.CATEGORY,\n", + " params=P.CategorySamplerParams(\n", + " values=[\"18-25\", \"25-35\", \"35-50\", \"50-65\", \"65+\"]\n", + " ),\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.SamplerColumn(\n", + " name=\"review_style\",\n", + " type=P.SamplerType.CATEGORY,\n", + " params=P.CategorySamplerParams(\n", + " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n", + " weights=[1, 2, 2, 1],\n", + " ),\n", + " )\n", + ")\n", + "\n", + "# We can create new columns using Jinja expressions that reference\n", + "# existing columns, including attributes of nested objects.\n", + "config_builder.add_column(\n", + " C.ExpressionColumn(\n", + " name=\"customer_name\", expr=\"{{ customer.first_name }} {{ customer.last_name }}\"\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.ExpressionColumn(name=\"customer_age\", expr=\"{{ customer.age }}\")\n", + ")\n", + "\n", + "# Add an `LLMStructuredColumn` column to generate structured outputs.\n", + "config_builder.add_column(\n", + " C.LLMStructuredColumn(\n", + " name=\"product\",\n", + " prompt=(\n", + " \"Create a product in the '{{ product_category }}' category, focusing on products \"\n", + " \"related to '{{ product_subcategory }}'. The target age range of the ideal customer is \"\n", + " \"{{ target_age_range }} years old. The product should be priced between $10 and $1000.\"\n", + " ),\n", + " output_format=Product,\n", + " model_alias=model_alias,\n", + " )\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " C.LLMStructuredColumn(\n", + " name=\"customer_review\",\n", + " prompt=(\n", + " \"Your task is to write a review for the following product:\\n\\n\"\n", + " \"Product Name: {{ product.name }}\\n\"\n", + " \"Product Description: {{ product.description }}\\n\"\n", + " \"Price: {{ product.price }}\\n\\n\"\n", + " \"Imagine your name is {{ customer_name }} and you are from {{ customer.city }}, {{ customer.state }}. \"\n", + " \"Write the review in a style that is '{{ review_style }}'.\"\n", + " ),\n", + " output_format=ProductReview,\n", + " model_alias=model_alias,\n", + " )\n", + ")\n", + "\n", + "# Let's add an evaluation report to our dataset.\n", + "config_builder.with_evaluation_report().validate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 👀 Preview the dataset\n", + "\n", + "- Iteration is key to generating high-quality synthetic data.\n", + "\n", + "- Use the `preview` method to generate 10 records for inspection.\n", + "\n", + "- Setting `verbose_logging=True` prints logs within each task of the generation process.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preview = ndd.preview(config_builder, verbose_logging=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The preview dataset is available as a pandas DataFrame.\n", + "preview.dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell multiple times to cycle through the 10 preview records.\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🧬 Generate your dataset\n", + "\n", + "- Once you are happy with the preview, scale up to a larger dataset.\n", + "\n", + "- The `create` method will submit your generate job to the microservice and return a results object.\n", + "\n", + "- If you want to pause and wait for the job to complete, set `wait_until_done=True`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results = ndd.create(config_builder, num_records=20, wait_until_done=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# load the dataset into a pandas DataFrame\n", + "dataset = results.load_dataset()\n", + "\n", + "dataset.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🔎 View the evaluation report\n", + "\n", + "- The evaluation report is generated in HTML format and can be viewed in a browser.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import webbrowser\n", + "from pathlib import Path\n", + "\n", + "eval_report_path = Path(\n", + " \"./2-structured-outputs-and-jinja-expressions-eval-report.html\"\n", + ").resolve()\n", + "\n", + "results.download_evaluation_report(eval_report_path)\n", + "\n", + "webbrowser.open_new_tab(f\"file:///{eval_report_path}\");" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb b/nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb new file mode 100644 index 000000000..184801ad8 --- /dev/null +++ b/nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb @@ -0,0 +1,385 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 🎨 NeMo Data Designer 101: Seeding synthetic data generation with an external dataset\n", + "\n", + "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", + ">\n", + "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n", + "\n", + "
\n", + "\n", + "In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n", + "\n", + "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.\n", + "\n", + "#### 💾 Install dependencies\n", + "\n", + "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the installation worked, you should be able to make the following imports:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "\n", + "from nemo_microservices import NeMoMicroservices\n", + "from nemo_microservices.beta.data_designer import (\n", + " DataDesignerConfigBuilder,\n", + " DataDesignerClient,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n", + "\n", + "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🏗️ Initialize the Data Designer Config Builder\n", + "\n", + "- The Data Designer config defines the dataset schema and generation process.\n", + "\n", + "- The config builder provides an intuitive interface for building this configuration.\n", + "\n", + "- You must provide a list of model configs to the builder at initialization.\n", + "\n", + "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# build.nvidia.com model endpoint\n", + "endpoint = \"https://integrate.api.nvidia.com/v1\"\n", + "model_id = \"mistralai/mistral-small-24b-instruct\"\n", + "\n", + "model_alias = \"mistral-small\"\n", + "\n", + "# You will need to enter your model provider API key to run this notebook.\n", + "api_key = getpass(\"Enter model provider API key: \")\n", + "\n", + "if len(api_key) > 0:\n", + " print(\"✅ API key received.\")\n", + "else:\n", + " print(\"❌ No API key provided. Please enter your model provider API key.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# You can also load the model configs from a YAML string or file.\n", + "\n", + "model_configs_yaml = f\"\"\"\\\n", + "model_configs:\n", + " - alias: \"{model_alias}\"\n", + " inference_parameters:\n", + " max_tokens: 1024\n", + " temperature: 0.5\n", + " top_p: 1.0\n", + " model:\n", + " api_endpoint:\n", + " api_key: \"{api_key}\"\n", + " model_id: \"{model_id}\"\n", + " url: \"{endpoint}\"\n", + "\"\"\"\n", + "\n", + "config_builder = DataDesignerConfigBuilder(model_configs=model_configs_yaml)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🏥 Download a seed dataset\n", + "\n", + "- For this notebook, we'll change gears and create a synthetic dataset of patient notes.\n", + "\n", + "- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "df_seed = load_dataset(\"gretelai/symptom_to_diagnosis\")[\"train\"].to_pandas()\n", + "\n", + "# Rename the columns to something more descriptive.\n", + "df_seed = df_seed.rename(\n", + " columns={\"output_text\": \"diagnosis\", \"input_text\": \"patient_summary\"}\n", + ")\n", + "\n", + "print(f\"Number of records: {len(df_seed)}\")\n", + "\n", + "# Save the file so we can upload it to the microservice.\n", + "df_seed.to_csv(\"symptom_to_diagnosis.csv\", index=False)\n", + "\n", + "df_seed.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🎨 Designing our synthetic patient notes dataset\n", + "\n", + "- We set the seed dataset using the `with_seed_dataset` method.\n", + "\n", + "- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n", + "\n", + "- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The repo_id and filename arguments follow the Hugging Face Hub API format.\n", + "# Passing the dataset_path argument signals that we need to upload the dataset\n", + "# to the datastore. Note we need to pass in the datastore's endpoint, which\n", + "# must match the endpoint in the docker-compose file.\n", + "config_builder.with_seed_dataset(\n", + " repo_id=\"into-tutorials/seeding-with-a-dataset\",\n", + " filename=\"symptom_to_diagnosis.csv\",\n", + " dataset_path=\"./symptom_to_diagnosis.csv\",\n", + " sampling_strategy=\"shuffle\",\n", + " with_replacement=False,\n", + " datastore={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Since we often just want a few attributes from Person objects, we can use\n", + "# Data Designer's `with_person_samplers` method to create multiple person samplers\n", + "# at once and drop the person object columns from the final dataset.\n", + "\n", + "# Empty dictionaries mean use default settings for the person samplers.\n", + "config_builder.with_person_samplers({\"patient_sampler\": {}, \"doctor_sampler\": {}})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Here we demonstrate how you can add a column by calling `add_column` with the\n", + "# column name, column type, and any parameters for that column type. This is in\n", + "# contrast to using the column and parameter type objects, via `C` and `P`, as we\n", + "# did in the previous notebooks. Generally, we recommend using the concrete column\n", + "# and parameter type objects, but this is a convenient shorthand when you are\n", + "# familiar with the required arguments for each type.\n", + "\n", + "config_builder.add_column(\n", + " name=\"patient_id\",\n", + " type=\"uuid\",\n", + " params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True},\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"first_name\",\n", + " type=\"expression\",\n", + " expr=\"{{ patient_sampler.first_name}} \",\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"last_name\",\n", + " type=\"expression\",\n", + " expr=\"{{ patient_sampler.last_name }}\",\n", + ")\n", + "\n", + "\n", + "config_builder.add_column(\n", + " name=\"dob\", type=\"expression\", expr=\"{{ patient_sampler.birth_date }}\"\n", + ")\n", + "\n", + "\n", + "config_builder.add_column(\n", + " name=\"patient_email\",\n", + " type=\"expression\",\n", + " expr=\"{{ patient_sampler.email_address }}\",\n", + ")\n", + "\n", + "\n", + "config_builder.add_column(\n", + " name=\"symptom_onset_date\",\n", + " type=\"datetime\",\n", + " params={\"start\": \"2024-01-01\", \"end\": \"2024-12-31\"},\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"date_of_visit\",\n", + " type=\"timedelta\",\n", + " params={\"dt_min\": 1, \"dt_max\": 30, \"reference_column_name\": \"symptom_onset_date\"},\n", + ")\n", + "\n", + "config_builder.add_column(\n", + " name=\"physician\",\n", + " type=\"expression\",\n", + " expr=\"Dr. {{ doctor_sampler.last_name }}\",\n", + ")\n", + "\n", + "# Note we have access to the seed data fields.\n", + "config_builder.add_column(\n", + " name=\"physician_notes\",\n", + " prompt=\"\"\"\\\n", + "You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},\n", + "who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.\n", + "The date of today's visit is {{ date_of_visit }}.\n", + "\n", + "{{ patient_summary }}\n", + "\n", + "Write careful notes about your visit with {{ first_name }},\n", + "as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.\n", + "\n", + "Format the notes as a busy doctor might.\n", + "\"\"\",\n", + " model_alias=model_alias,\n", + ")\n", + "\n", + "config_builder.validate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 👀 Preview the dataset\n", + "\n", + "- Iteration is key to generating high-quality synthetic data.\n", + "\n", + "- Use the `preview` method to generate 10 records for inspection.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preview = ndd.preview(config_builder, verbose_logging=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The preview dataset is available as a pandas DataFrame.\n", + "preview.dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this cell multiple times to cycle through the 10 preview records.\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🧬 Generate your dataset\n", + "\n", + "- Once you are happy with the preview, scale up to a larger dataset.\n", + "\n", + "- The `create` method will submit your generation job to the microservice and return a results object.\n", + "\n", + "- If you want to wait for the job to complete, set `wait_until_done=True`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results = ndd.create(config_builder, num_records=20, wait_until_done=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# load the dataset into a pandas DataFrame\n", + "dataset = results.load_dataset()\n", + "\n", + "dataset.head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/nemo/NeMo-Data-Designer/pyproject.toml b/nemo/NeMo-Data-Designer/pyproject.toml new file mode 100644 index 000000000..a57b0f762 --- /dev/null +++ b/nemo/NeMo-Data-Designer/pyproject.toml @@ -0,0 +1,13 @@ +[project] +name = "nemo-data-designer" +version = "0.0.1" +description = "NeMo Data Designer tutorial notebooks" +readme = "README.md" +requires-python = ">=3.9" + +dependencies = [ + "datasets", + "jupyter", + "pydantic", + "nemo-microservices[data-designer]", +]