diff --git a/nemo/NeMo-Data-Designer/README.md b/nemo/NeMo-Data-Designer/README.md
new file mode 100644
index 000000000..73c00d2fc
--- /dev/null
+++ b/nemo/NeMo-Data-Designer/README.md
@@ -0,0 +1,25 @@
+# 🎨 NeMo Data Designer Tutorial Notebooks
+
+This directory contains the tutorial notebooks for getting started with NeMo Data Designer.
+
+## 🐳 Deploy the NeMo Data Designer microservice locally
+
+In order to run these notebooks, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
+
+## 📦 Set up the environment
+
+We will use the `uv` package manager to set up our environment and install the necessary dependencies. If you don't have `uv` installed, you can follow the installation instructions from the [uv documentation](https://docs.astral.sh/uv/getting-started/installation/).
+
+Once you have `uv` installed, be sure you are in the `Nemo-Data-Designer` directory and run the following command:
+
+```bash
+uv sync
+```
+
+This will create a virtual environment and install the necessary dependencies. Activate the virtual environment by running the following command:
+
+```bash
+source .venv/bin/activate
+```
+
+Be sure to select this virtual environment as your kernel when running the notebooks.
diff --git a/nemo/NeMo-Data-Designer/intro-tutorials/1-the-basics.ipynb b/nemo/NeMo-Data-Designer/intro-tutorials/1-the-basics.ipynb
new file mode 100644
index 000000000..d5275403d
--- /dev/null
+++ b/nemo/NeMo-Data-Designer/intro-tutorials/1-the-basics.ipynb
@@ -0,0 +1,479 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 🎨 NeMo Data Designer 101: The Basics\n",
+ "\n",
+ "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n",
+ ">\n",
+ "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "In this notebook, we will demonstrate the basics of Data Designer by generating a simple product review dataset.\n",
+ "\n",
+ "#### 💾 Install dependencies\n",
+ "\n",
+ "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If the installation worked, you should be able to make the following imports:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from getpass import getpass\n",
+ "\n",
+ "from nemo_microservices import NeMoMicroservices\n",
+ "from nemo_microservices.beta.data_designer import (\n",
+ " DataDesignerConfigBuilder,\n",
+ " DataDesignerClient,\n",
+ ")\n",
+ "from nemo_microservices.beta.data_designer.config import columns as C\n",
+ "from nemo_microservices.beta.data_designer.config import params as P"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n",
+ "\n",
+ "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 🏗️ Initialize the Data Designer Config Builder\n",
+ "\n",
+ "- The Data Designer config defines the dataset schema and generation process.\n",
+ "\n",
+ "- The config builder provides an intuitive interface for building this configuration.\n",
+ "\n",
+ "- You must provide a list of model configs to the builder at initialization.\n",
+ "\n",
+ "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# build.nvidia.com model endpoint\n",
+ "endpoint = \"https://integrate.api.nvidia.com/v1\"\n",
+ "model_id = \"mistralai/mistral-small-24b-instruct\"\n",
+ "\n",
+ "model_alias = \"mistral-small\"\n",
+ "\n",
+ "# You will need to enter your model provider API key to run this notebook.\n",
+ "api_key = getpass(\"Enter model provider API key: \")\n",
+ "\n",
+ "if len(api_key) > 0:\n",
+ " print(\"✅ API key received.\")\n",
+ "else:\n",
+ " print(\"❌ No API key provided. Please enter your model provider API key.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model_configs = [\n",
+ " P.ModelConfig(\n",
+ " alias=model_alias,\n",
+ " inference_parameters=P.InferenceParameters(\n",
+ " max_tokens=1024,\n",
+ " temperature=0.5,\n",
+ " top_p=1.0,\n",
+ " ),\n",
+ " model=P.Model(\n",
+ " api_endpoint=P.ApiEndpoint(\n",
+ " api_key=api_key,\n",
+ " model_id=model_id,\n",
+ " url=endpoint,\n",
+ " ),\n",
+ " ),\n",
+ " )\n",
+ "]\n",
+ "\n",
+ "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🎲 Getting started with sampler columns\n",
+ "\n",
+ "- Sampler columns offer non-LLM based generation of synthetic data.\n",
+ "\n",
+ "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "Let's start designing our product review dataset by adding product category and subcategory columns.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"product_category\",\n",
+ " type=P.SamplerType.CATEGORY,\n",
+ " params=P.CategorySamplerParams(\n",
+ " values=[\n",
+ " \"Electronics\",\n",
+ " \"Clothing\",\n",
+ " \"Home & Kitchen\",\n",
+ " \"Books\",\n",
+ " \"Home Office\",\n",
+ " ],\n",
+ " ),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"product_subcategory\",\n",
+ " type=P.SamplerType.SUBCATEGORY,\n",
+ " params=P.SubcategorySamplerParams(\n",
+ " category=\"product_category\",\n",
+ " values={\n",
+ " \"Electronics\": [\n",
+ " \"Smartphones\",\n",
+ " \"Laptops\",\n",
+ " \"Headphones\",\n",
+ " \"Cameras\",\n",
+ " \"Accessories\",\n",
+ " ],\n",
+ " \"Clothing\": [\n",
+ " \"Men's Clothing\",\n",
+ " \"Women's Clothing\",\n",
+ " \"Winter Coats\",\n",
+ " \"Activewear\",\n",
+ " \"Accessories\",\n",
+ " ],\n",
+ " \"Home & Kitchen\": [\n",
+ " \"Appliances\",\n",
+ " \"Cookware\",\n",
+ " \"Furniture\",\n",
+ " \"Decor\",\n",
+ " \"Organization\",\n",
+ " ],\n",
+ " \"Books\": [\n",
+ " \"Fiction\",\n",
+ " \"Non-Fiction\",\n",
+ " \"Self-Help\",\n",
+ " \"Textbooks\",\n",
+ " \"Classics\",\n",
+ " ],\n",
+ " \"Home Office\": [\n",
+ " \"Desks\",\n",
+ " \"Chairs\",\n",
+ " \"Storage\",\n",
+ " \"Office Supplies\",\n",
+ " \"Lighting\",\n",
+ " ],\n",
+ " },\n",
+ " ),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"target_age_range\",\n",
+ " type=P.SamplerType.CATEGORY,\n",
+ " params=P.CategorySamplerParams(\n",
+ " values=[\"18-25\", \"25-35\", \"35-50\", \"50-65\", \"65+\"]\n",
+ " ),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "# Optionally validate that the columns are configured correctly.\n",
+ "config_builder.validate()\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, let's add samplers to generate data related to the customer and their review.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# This column will sample synthetic person data based on statistics from the US Census.\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"customer\",\n",
+ " type=P.SamplerType.PERSON,\n",
+ " params=P.PersonSamplerParams(age_range=[18, 70]),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"number_of_stars\",\n",
+ " type=P.SamplerType.UNIFORM,\n",
+ " params=P.UniformSamplerParams(low=1, high=5),\n",
+ " convert_to=\"int\",\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"review_style\",\n",
+ " type=P.SamplerType.CATEGORY,\n",
+ " params=P.CategorySamplerParams(\n",
+ " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n",
+ " weights=[1, 2, 2, 1],\n",
+ " ),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.validate()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🦜 LLM-generated columns\n",
+ "\n",
+ "- The real power of Data Designer comes from leveraging LLMs to generate text, code, and structured data.\n",
+ "\n",
+ "- For our product review dataset, we will use LLM-generated text columns to generate product names and customer reviews.\n",
+ "\n",
+ "- When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.\n",
+ "\n",
+ "- As we see below, nested json columns can be accessed using dot notation.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "config_builder.add_column(\n",
+ " C.LLMTextColumn(\n",
+ " name=\"product_name\",\n",
+ " prompt=(\n",
+ " \"Come up with a creative product name for a product in the '{{ product_category }}' category, focusing \"\n",
+ " \"on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is \"\n",
+ " \"{{ target_age_range }} years old. Respond with only the product name, no other text.\"\n",
+ " ),\n",
+ " # This is optional, but it can be useful for controlling the behavior of the LLM. Do not include instructions\n",
+ " # related to output formatting in the system prompt, as Data Designer handles this based on the column type.\n",
+ " system_prompt=(\n",
+ " \"You are a helpful assistant that generates product names. You respond with only the product name, \"\n",
+ " \"no other text. You do NOT add quotes around the product name. \"\n",
+ " ),\n",
+ " model_alias=model_alias,\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.LLMTextColumn(\n",
+ " name=\"customer_review\",\n",
+ " prompt=(\n",
+ " \"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. \"\n",
+ " \"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. \"\n",
+ " \"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. \"\n",
+ " \"The style of the review should be '{{ review_style }}'. \"\n",
+ " ),\n",
+ " model_alias=model_alias,\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.validate()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 👀 Preview the dataset\n",
+ "\n",
+ "- Iteration is key to generating high-quality synthetic data.\n",
+ "\n",
+ "- Use the `preview` method to generate 10 records for inspection.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "preview = ndd.preview(config_builder, verbose_logging=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# The preview dataset is available as a pandas DataFrame.\n",
+ "preview.dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Run this cell multiple times to cycle through the 10 preview records.\n",
+ "preview.display_sample_record()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🧐 Adding an Evaluation Report\n",
+ "\n",
+ "- Data Designer offers an evaluation report for a quick look at the quality of the generated data.\n",
+ "\n",
+ "- To add a report, which will be generated at the end of a generation job, simply run the `with_evaluation_report` method.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "config_builder.with_evaluation_report()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🧬 Generate your dataset\n",
+ "\n",
+ "- Once you are happy with the preview, scale up to a larger dataset.\n",
+ "\n",
+ "- The `create` method will submit your generation job to the microservice and return a results object.\n",
+ "\n",
+ "- If you want to wait for the job to complete, set `wait_until_done=True`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results = ndd.create(config_builder, num_records=20, wait_until_done=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# load the dataset into a pandas DataFrame\n",
+ "dataset = results.load_dataset()\n",
+ "\n",
+ "dataset.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 🔎 View the evaluation report\n",
+ "\n",
+ "- The evaluation report is generated in HTML format and can be viewed in a browser.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import webbrowser\n",
+ "from pathlib import Path\n",
+ "\n",
+ "eval_report_path = Path(\"./1-the-basics-eval-report.html\").resolve()\n",
+ "\n",
+ "results.download_evaluation_report(eval_report_path)\n",
+ "\n",
+ "webbrowser.open_new_tab(f\"file:///{eval_report_path}\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## ⏭️ Next Steps\n",
+ "\n",
+ "Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n",
+ "\n",
+ "- [Structured outputs and jinja expressions](./2-structured-outputs-and-jinja-expressions.ipynb)\n",
+ "\n",
+ "- [Seeding synthetic data generation with an external dataset](./3-seeding-with-a-dataset.ipynb)\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": ".venv",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/nemo/NeMo-Data-Designer/intro-tutorials/2-structured-outputs-and-jinja-expressions.ipynb b/nemo/NeMo-Data-Designer/intro-tutorials/2-structured-outputs-and-jinja-expressions.ipynb
new file mode 100644
index 000000000..82fc571c5
--- /dev/null
+++ b/nemo/NeMo-Data-Designer/intro-tutorials/2-structured-outputs-and-jinja-expressions.ipynb
@@ -0,0 +1,448 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 🎨 NeMo Data Designer 101: Structured Outputs and Jinja Expressions\n",
+ "\n",
+ "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n",
+ ">\n",
+ "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs and Jinja expressions.\n",
+ "\n",
+ "#### 💾 Install dependencies\n",
+ "\n",
+ "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If the installation worked, you should be able to make the following imports:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from getpass import getpass\n",
+ "\n",
+ "from nemo_microservices import NeMoMicroservices\n",
+ "from nemo_microservices.beta.data_designer import (\n",
+ " DataDesignerConfigBuilder,\n",
+ " DataDesignerClient,\n",
+ ")\n",
+ "from nemo_microservices.beta.data_designer.config import columns as C\n",
+ "from nemo_microservices.beta.data_designer.config import params as P"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🧑🎨 Designing our data\n",
+ "\n",
+ "- We will again create a product review dataset, but this time we will use structured outputs and Jinja expressions.\n",
+ "\n",
+ "- Structured outputs let you specify the exact schema of the data you want to generate.\n",
+ "\n",
+ "- Data Designer supports schemas specified using either json schema or Pydantic data models (recommended).\n",
+ "\n",
+ "
\n",
+ "\n",
+ "We'll define our structured outputs using Pydantic data models:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from decimal import Decimal\n",
+ "from typing import Literal\n",
+ "from pydantic import BaseModel, Field\n",
+ "\n",
+ "\n",
+ "# We define a Product schema so that the name, description, and price are generated\n",
+ "# in one go, with the types and constraints specified.\n",
+ "class Product(BaseModel):\n",
+ " name: str = Field(description=\"The name of the product\")\n",
+ " description: str = Field(description=\"A description of the product\")\n",
+ " price: Decimal = Field(\n",
+ " description=\"The price of the product\", ge=10, le=1000, decimal_places=2\n",
+ " )\n",
+ "\n",
+ "\n",
+ "class ProductReview(BaseModel):\n",
+ " rating: int = Field(description=\"The rating of the product\", ge=1, le=5)\n",
+ " customer_mood: Literal[\"irritated\", \"mad\", \"happy\", \"neutral\", \"excited\"] = Field(\n",
+ " description=\"The mood of the customer\"\n",
+ " )\n",
+ " review: str = Field(description=\"A review of the product\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n",
+ "\n",
+ "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 🏗️ Initialize the Data Designer Config Builder\n",
+ "\n",
+ "- The Data Designer config defines the dataset schema and generation process.\n",
+ "\n",
+ "- The config builder provides an intuitive interface for building this configuration.\n",
+ "\n",
+ "- You must provide a list of model configs to the builder at initialization.\n",
+ "\n",
+ "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# build.nvidia.com model endpoint\n",
+ "endpoint = \"https://integrate.api.nvidia.com/v1\"\n",
+ "model_id = \"mistralai/mistral-small-24b-instruct\"\n",
+ "\n",
+ "model_alias = \"mistral-small\"\n",
+ "\n",
+ "# You will need to enter your model provider API key to run this notebook.\n",
+ "api_key = getpass(\"Enter model provider API key: \")\n",
+ "\n",
+ "if len(api_key) > 0:\n",
+ " print(\"✅ API key received.\")\n",
+ "else:\n",
+ " print(\"❌ No API key provided. Please enter your model provider API key.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model_configs = [\n",
+ " P.ModelConfig(\n",
+ " alias=model_alias,\n",
+ " inference_parameters=P.InferenceParameters(\n",
+ " max_tokens=1024,\n",
+ " temperature=0.5,\n",
+ " top_p=1.0,\n",
+ " ),\n",
+ " model=P.Model(\n",
+ " api_endpoint=P.ApiEndpoint(\n",
+ " api_key=api_key,\n",
+ " model_id=model_id,\n",
+ " url=endpoint,\n",
+ " ),\n",
+ " ),\n",
+ " )\n",
+ "]\n",
+ "\n",
+ "config_builder = DataDesignerConfigBuilder(model_configs=model_configs)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, let's design our product review dataset using a few more tricks compared to the previous notebook:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Since we often just want a few attributes from Person objects, we can use\n",
+ "# Data Designer's `with_person_samplers` method to create multiple person samplers\n",
+ "# at once and drop the person object columns from the final dataset.\n",
+ "config_builder.with_person_samplers(\n",
+ " {\"customer\": P.PersonSamplerParams(age_range=[18, 65])}\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"product_category\",\n",
+ " type=P.SamplerType.CATEGORY,\n",
+ " params=P.CategorySamplerParams(\n",
+ " values=[\n",
+ " \"Electronics\",\n",
+ " \"Clothing\",\n",
+ " \"Home & Kitchen\",\n",
+ " \"Books\",\n",
+ " \"Home Office\",\n",
+ " ],\n",
+ " ),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"product_subcategory\",\n",
+ " type=P.SamplerType.SUBCATEGORY,\n",
+ " params=P.SubcategorySamplerParams(\n",
+ " category=\"product_category\",\n",
+ " values={\n",
+ " \"Electronics\": [\n",
+ " \"Smartphones\",\n",
+ " \"Laptops\",\n",
+ " \"Headphones\",\n",
+ " \"Cameras\",\n",
+ " \"Accessories\",\n",
+ " ],\n",
+ " \"Clothing\": [\n",
+ " \"Men's Clothing\",\n",
+ " \"Women's Clothing\",\n",
+ " \"Winter Coats\",\n",
+ " \"Activewear\",\n",
+ " \"Accessories\",\n",
+ " ],\n",
+ " \"Home & Kitchen\": [\n",
+ " \"Appliances\",\n",
+ " \"Cookware\",\n",
+ " \"Furniture\",\n",
+ " \"Decor\",\n",
+ " \"Organization\",\n",
+ " ],\n",
+ " \"Books\": [\n",
+ " \"Fiction\",\n",
+ " \"Non-Fiction\",\n",
+ " \"Self-Help\",\n",
+ " \"Textbooks\",\n",
+ " \"Classics\",\n",
+ " ],\n",
+ " \"Home Office\": [\n",
+ " \"Desks\",\n",
+ " \"Chairs\",\n",
+ " \"Storage\",\n",
+ " \"Office Supplies\",\n",
+ " \"Lighting\",\n",
+ " ],\n",
+ " },\n",
+ " ),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"target_age_range\",\n",
+ " type=P.SamplerType.CATEGORY,\n",
+ " params=P.CategorySamplerParams(\n",
+ " values=[\"18-25\", \"25-35\", \"35-50\", \"50-65\", \"65+\"]\n",
+ " ),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.SamplerColumn(\n",
+ " name=\"review_style\",\n",
+ " type=P.SamplerType.CATEGORY,\n",
+ " params=P.CategorySamplerParams(\n",
+ " values=[\"rambling\", \"brief\", \"detailed\", \"structured with bullet points\"],\n",
+ " weights=[1, 2, 2, 1],\n",
+ " ),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "# We can create new columns using Jinja expressions that reference\n",
+ "# existing columns, including attributes of nested objects.\n",
+ "config_builder.add_column(\n",
+ " C.ExpressionColumn(\n",
+ " name=\"customer_name\", expr=\"{{ customer.first_name }} {{ customer.last_name }}\"\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.ExpressionColumn(name=\"customer_age\", expr=\"{{ customer.age }}\")\n",
+ ")\n",
+ "\n",
+ "# Add an `LLMStructuredColumn` column to generate structured outputs.\n",
+ "config_builder.add_column(\n",
+ " C.LLMStructuredColumn(\n",
+ " name=\"product\",\n",
+ " prompt=(\n",
+ " \"Create a product in the '{{ product_category }}' category, focusing on products \"\n",
+ " \"related to '{{ product_subcategory }}'. The target age range of the ideal customer is \"\n",
+ " \"{{ target_age_range }} years old. The product should be priced between $10 and $1000.\"\n",
+ " ),\n",
+ " output_format=Product,\n",
+ " model_alias=model_alias,\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " C.LLMStructuredColumn(\n",
+ " name=\"customer_review\",\n",
+ " prompt=(\n",
+ " \"Your task is to write a review for the following product:\\n\\n\"\n",
+ " \"Product Name: {{ product.name }}\\n\"\n",
+ " \"Product Description: {{ product.description }}\\n\"\n",
+ " \"Price: {{ product.price }}\\n\\n\"\n",
+ " \"Imagine your name is {{ customer_name }} and you are from {{ customer.city }}, {{ customer.state }}. \"\n",
+ " \"Write the review in a style that is '{{ review_style }}'.\"\n",
+ " ),\n",
+ " output_format=ProductReview,\n",
+ " model_alias=model_alias,\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "# Let's add an evaluation report to our dataset.\n",
+ "config_builder.with_evaluation_report().validate()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 👀 Preview the dataset\n",
+ "\n",
+ "- Iteration is key to generating high-quality synthetic data.\n",
+ "\n",
+ "- Use the `preview` method to generate 10 records for inspection.\n",
+ "\n",
+ "- Setting `verbose_logging=True` prints logs within each task of the generation process.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "preview = ndd.preview(config_builder, verbose_logging=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# The preview dataset is available as a pandas DataFrame.\n",
+ "preview.dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Run this cell multiple times to cycle through the 10 preview records.\n",
+ "preview.display_sample_record()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🧬 Generate your dataset\n",
+ "\n",
+ "- Once you are happy with the preview, scale up to a larger dataset.\n",
+ "\n",
+ "- The `create` method will submit your generate job to the microservice and return a results object.\n",
+ "\n",
+ "- If you want to pause and wait for the job to complete, set `wait_until_done=True`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results = ndd.create(config_builder, num_records=20, wait_until_done=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# load the dataset into a pandas DataFrame\n",
+ "dataset = results.load_dataset()\n",
+ "\n",
+ "dataset.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 🔎 View the evaluation report\n",
+ "\n",
+ "- The evaluation report is generated in HTML format and can be viewed in a browser.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import webbrowser\n",
+ "from pathlib import Path\n",
+ "\n",
+ "eval_report_path = Path(\n",
+ " \"./2-structured-outputs-and-jinja-expressions-eval-report.html\"\n",
+ ").resolve()\n",
+ "\n",
+ "results.download_evaluation_report(eval_report_path)\n",
+ "\n",
+ "webbrowser.open_new_tab(f\"file:///{eval_report_path}\");"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": ".venv",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb b/nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb
new file mode 100644
index 000000000..184801ad8
--- /dev/null
+++ b/nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb
@@ -0,0 +1,385 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 🎨 NeMo Data Designer 101: Seeding synthetic data generation with an external dataset\n",
+ "\n",
+ "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n",
+ ">\n",
+ "> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n",
+ "\n",
+ "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.\n",
+ "\n",
+ "#### 💾 Install dependencies\n",
+ "\n",
+ "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If the installation worked, you should be able to make the following imports:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from getpass import getpass\n",
+ "\n",
+ "from nemo_microservices import NeMoMicroservices\n",
+ "from nemo_microservices.beta.data_designer import (\n",
+ " DataDesignerConfigBuilder,\n",
+ " DataDesignerClient,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n",
+ "\n",
+ "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 🏗️ Initialize the Data Designer Config Builder\n",
+ "\n",
+ "- The Data Designer config defines the dataset schema and generation process.\n",
+ "\n",
+ "- The config builder provides an intuitive interface for building this configuration.\n",
+ "\n",
+ "- You must provide a list of model configs to the builder at initialization.\n",
+ "\n",
+ "- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# build.nvidia.com model endpoint\n",
+ "endpoint = \"https://integrate.api.nvidia.com/v1\"\n",
+ "model_id = \"mistralai/mistral-small-24b-instruct\"\n",
+ "\n",
+ "model_alias = \"mistral-small\"\n",
+ "\n",
+ "# You will need to enter your model provider API key to run this notebook.\n",
+ "api_key = getpass(\"Enter model provider API key: \")\n",
+ "\n",
+ "if len(api_key) > 0:\n",
+ " print(\"✅ API key received.\")\n",
+ "else:\n",
+ " print(\"❌ No API key provided. Please enter your model provider API key.\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# You can also load the model configs from a YAML string or file.\n",
+ "\n",
+ "model_configs_yaml = f\"\"\"\\\n",
+ "model_configs:\n",
+ " - alias: \"{model_alias}\"\n",
+ " inference_parameters:\n",
+ " max_tokens: 1024\n",
+ " temperature: 0.5\n",
+ " top_p: 1.0\n",
+ " model:\n",
+ " api_endpoint:\n",
+ " api_key: \"{api_key}\"\n",
+ " model_id: \"{model_id}\"\n",
+ " url: \"{endpoint}\"\n",
+ "\"\"\"\n",
+ "\n",
+ "config_builder = DataDesignerConfigBuilder(model_configs=model_configs_yaml)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🏥 Download a seed dataset\n",
+ "\n",
+ "- For this notebook, we'll change gears and create a synthetic dataset of patient notes.\n",
+ "\n",
+ "- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from datasets import load_dataset\n",
+ "\n",
+ "df_seed = load_dataset(\"gretelai/symptom_to_diagnosis\")[\"train\"].to_pandas()\n",
+ "\n",
+ "# Rename the columns to something more descriptive.\n",
+ "df_seed = df_seed.rename(\n",
+ " columns={\"output_text\": \"diagnosis\", \"input_text\": \"patient_summary\"}\n",
+ ")\n",
+ "\n",
+ "print(f\"Number of records: {len(df_seed)}\")\n",
+ "\n",
+ "# Save the file so we can upload it to the microservice.\n",
+ "df_seed.to_csv(\"symptom_to_diagnosis.csv\", index=False)\n",
+ "\n",
+ "df_seed.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🎨 Designing our synthetic patient notes dataset\n",
+ "\n",
+ "- We set the seed dataset using the `with_seed_dataset` method.\n",
+ "\n",
+ "- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n",
+ "\n",
+ "- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# The repo_id and filename arguments follow the Hugging Face Hub API format.\n",
+ "# Passing the dataset_path argument signals that we need to upload the dataset\n",
+ "# to the datastore. Note we need to pass in the datastore's endpoint, which\n",
+ "# must match the endpoint in the docker-compose file.\n",
+ "config_builder.with_seed_dataset(\n",
+ " repo_id=\"into-tutorials/seeding-with-a-dataset\",\n",
+ " filename=\"symptom_to_diagnosis.csv\",\n",
+ " dataset_path=\"./symptom_to_diagnosis.csv\",\n",
+ " sampling_strategy=\"shuffle\",\n",
+ " with_replacement=False,\n",
+ " datastore={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Since we often just want a few attributes from Person objects, we can use\n",
+ "# Data Designer's `with_person_samplers` method to create multiple person samplers\n",
+ "# at once and drop the person object columns from the final dataset.\n",
+ "\n",
+ "# Empty dictionaries mean use default settings for the person samplers.\n",
+ "config_builder.with_person_samplers({\"patient_sampler\": {}, \"doctor_sampler\": {}})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Here we demonstrate how you can add a column by calling `add_column` with the\n",
+ "# column name, column type, and any parameters for that column type. This is in\n",
+ "# contrast to using the column and parameter type objects, via `C` and `P`, as we\n",
+ "# did in the previous notebooks. Generally, we recommend using the concrete column\n",
+ "# and parameter type objects, but this is a convenient shorthand when you are\n",
+ "# familiar with the required arguments for each type.\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " name=\"patient_id\",\n",
+ " type=\"uuid\",\n",
+ " params={\"prefix\": \"PT-\", \"short_form\": True, \"uppercase\": True},\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " name=\"first_name\",\n",
+ " type=\"expression\",\n",
+ " expr=\"{{ patient_sampler.first_name}} \",\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " name=\"last_name\",\n",
+ " type=\"expression\",\n",
+ " expr=\"{{ patient_sampler.last_name }}\",\n",
+ ")\n",
+ "\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " name=\"dob\", type=\"expression\", expr=\"{{ patient_sampler.birth_date }}\"\n",
+ ")\n",
+ "\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " name=\"patient_email\",\n",
+ " type=\"expression\",\n",
+ " expr=\"{{ patient_sampler.email_address }}\",\n",
+ ")\n",
+ "\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " name=\"symptom_onset_date\",\n",
+ " type=\"datetime\",\n",
+ " params={\"start\": \"2024-01-01\", \"end\": \"2024-12-31\"},\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " name=\"date_of_visit\",\n",
+ " type=\"timedelta\",\n",
+ " params={\"dt_min\": 1, \"dt_max\": 30, \"reference_column_name\": \"symptom_onset_date\"},\n",
+ ")\n",
+ "\n",
+ "config_builder.add_column(\n",
+ " name=\"physician\",\n",
+ " type=\"expression\",\n",
+ " expr=\"Dr. {{ doctor_sampler.last_name }}\",\n",
+ ")\n",
+ "\n",
+ "# Note we have access to the seed data fields.\n",
+ "config_builder.add_column(\n",
+ " name=\"physician_notes\",\n",
+ " prompt=\"\"\"\\\n",
+ "You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},\n",
+ "who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.\n",
+ "The date of today's visit is {{ date_of_visit }}.\n",
+ "\n",
+ "{{ patient_summary }}\n",
+ "\n",
+ "Write careful notes about your visit with {{ first_name }},\n",
+ "as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.\n",
+ "\n",
+ "Format the notes as a busy doctor might.\n",
+ "\"\"\",\n",
+ " model_alias=model_alias,\n",
+ ")\n",
+ "\n",
+ "config_builder.validate()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 👀 Preview the dataset\n",
+ "\n",
+ "- Iteration is key to generating high-quality synthetic data.\n",
+ "\n",
+ "- Use the `preview` method to generate 10 records for inspection.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "preview = ndd.preview(config_builder, verbose_logging=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# The preview dataset is available as a pandas DataFrame.\n",
+ "preview.dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Run this cell multiple times to cycle through the 10 preview records.\n",
+ "preview.display_sample_record()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 🧬 Generate your dataset\n",
+ "\n",
+ "- Once you are happy with the preview, scale up to a larger dataset.\n",
+ "\n",
+ "- The `create` method will submit your generation job to the microservice and return a results object.\n",
+ "\n",
+ "- If you want to wait for the job to complete, set `wait_until_done=True`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results = ndd.create(config_builder, num_records=20, wait_until_done=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# load the dataset into a pandas DataFrame\n",
+ "dataset = results.load_dataset()\n",
+ "\n",
+ "dataset.head()"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": ".venv",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/nemo/NeMo-Data-Designer/pyproject.toml b/nemo/NeMo-Data-Designer/pyproject.toml
new file mode 100644
index 000000000..a57b0f762
--- /dev/null
+++ b/nemo/NeMo-Data-Designer/pyproject.toml
@@ -0,0 +1,13 @@
+[project]
+name = "nemo-data-designer"
+version = "0.0.1"
+description = "NeMo Data Designer tutorial notebooks"
+readme = "README.md"
+requires-python = ">=3.9"
+
+dependencies = [
+ "datasets",
+ "jupyter",
+ "pydantic",
+ "nemo-microservices[data-designer]",
+]