Skip to content

Releases: NVIDIA-NeMo/DataDesigner

v0.5.1 2026-02-20

20 Feb 21:06
8f7a720

Choose a tag to compare

Data Designer now supports image generation!

What's Changed

  • docs: Updated url by @kirit93 in #325
  • docs: deep research trajectories with NDD and MCP tool use by @eric-tramel in #326
  • refactor: callback-based processor design by @andreatgretel in #294
  • feat: add image generation support with multi-modal context by @nabinchha in #317
  • docs: add image generation documentation and image-to-image editing tutorial by @nabinchha in #319
  • chore: move ArtifactStorage to engine/storage/ module by @nabinchha in #321
  • chore: gitignore Cerebro knowledge base files by @johnnygreco in #328
  • feat(engine): env-var switch for async-first models experiment by @eric-tramel in #280
  • docs: Moved nav to left hand side by @kirit93 in #331
  • feat: add --save-results option to preview command by @johnnygreco in #333
  • chore: Improve CLI startup with lazy heavy import cleanup by @johnnygreco in #330
  • feat: add allow_resize for 1:N and N:1 generation patterns by @andreatgretel in #286
  • chore: address Andre's feedback on --save-results and CLI preview by @johnnygreco in #335
  • chore: remove example_allow_resize.py from repo root by @andreatgretel in #337
  • fix: make DropColumnsProcessorConfig idempotent and support reasoning columns by @andreatgretel in #334
  • feat: add push_to_hub_from_folder classmethod for uploading saved datasets by @nabinchha in #340
  • fix: handle bool, int, float in convert_to_row_element by @dhruvnathawani in #336
  • feat: auto-detect ImageContext format for image-to-image generation by @nabinchha in #342

New Contributors

Full Changelog: v0.5.0...v0.5.1

v0.5.0 2026-02-11

11 Feb 22:22
631f1f9

Choose a tag to compare

🎨 NeMo Data Designer – v0.5.0 Release Notes

⚡Highlights

  • 🛠️ MCP Tool Calling: ​​LLM columns can now call external tools during generation via MCP!!

  • ⚛️ Functions as custom column generators: The @custom_column_generator decorator that lets users write their own column generation logic and plug it directly into a pipeline.

  • 🤗 Hugging Face Hub integration: You can now publish generated datasets directly to the Hugging Face Hub with auto-generated dataset cards: results.push_to_hub().

  • 💻 CLI generation commands: You can generate data from the CLI using the new preview, create, and validate commands.

  • 🔍 LLM Observability: Use the new with_trace option on LLM configs to return the TraceType.ALL_MESSAGE or the TraceType.LAST_MESSAGE. You can also selectively extract reasoning content using extract_reasoning_content=True.

⚠️ Breaking Changes

  • with_trace used to be a boolean. It is now a TraceType enum (NONE (default), LAST_MESSAGE, ALL_MESSAGES) instead of a boolean.

  • SingleColumnConfig is now isolated in its own base module data_designer.base.config to protect against circular imports during plugin discovery.

What's Changed

  • feat: MCP (Model Context Protocol) tool calling integration for LLM columns by @eric-tramel in #248
  • fix: normalize license header year format in mcp module by @johnnygreco in #279
  • chore: configure independent pytest settings per subpackage by @johnnygreco in #278
  • fix: normalize trace content blocks to prevent parquet write crashes by @eric-tramel in #283
  • feat: Add TraceType enum for granular trace control by @eric-tramel in #284
  • docs: add deployment, performance tuning guides and streamline gettin… by @kirit93 in #277
  • chore: update tutorial notebooks to use dd. notation consistently by @andreatgretel in #288
  • feat: add extract_reasoning_content option to LLM columns by @eric-tramel in #285
  • chore: add greptile.json to reduce review verbosity by @andreatgretel in #289
  • feat: switch from hatch-vcs to uv-dynamic-versioning by @johnnygreco in #282
  • revert: Remove RunConfig debug_trace_override by @eric-tramel in #290
  • perf: implement lazy loading for config module exports by @johnnygreco in #291
  • refactor: move SingleColumnConfig to config.base module by @johnnygreco in #287
  • feat: Add CustomColumnGenerator for user-defined column generation by @andreatgretel in #254
  • chore: standardize recipe script metadata and docstrings by @johnnygreco in #292
  • chore: enable status check in greptile.json by @dakshgup in #295
  • feat: add HuggingFace Hub integration for dataset publishing by @nabinchha in #275
  • docs: Added images for deployment options by @kirit93 in #297
  • docs: Add RQA dataset blog post and improve blog navigation by @kirit93 in #296
  • chore: quiet tool call logs and add tool usage statistics by @johnnygreco in #293
  • docs: Added documentation for seed datasets by @kirit93 in #300
  • docs: updated usage chart by @kirit93 in #304
  • docs: Update README.md by @kirit93 in #305
  • chore: update HF card citation copy and add library version to builder config by @johnnygreco in #303
  • chore: add tokens generated badge to README by @johnnygreco in #306
  • test: add provider health checks script and CI workflow by @andreatgretel in #301
  • chore: bump pytest, nbconvert, and pyjwt for vulnerability fixes by @johnnygreco in #312
  • fix: allow BuilderConfig round-trip serialization by @johnnygreco in #311
  • chore: export ConstraintType and InequalityOperator from config init by @johnnygreco in #308
  • docs: restructure plugin docs with multi-file layout and seed reader type by @johnnygreco in #302
  • docs: Added cat emoji sequence by @kirit93 in #316
  • fix: use reasoning_effort for gpt-5 inference params by @andreatgretel in #315
  • docs: New post on SDG design principles by @kirit93 in #318
  • feat: add preview, create, and validate CLI commands by @johnnygreco in #313
  • feat: support loading config files from HTTP(S) URLs by @johnnygreco in #323
  • fix: include CUSTOM type in execution DAG and warn on generator errors by @andreatgretel in #324
  • fix: trim LLM response content before parsing by @johnnygreco in #322

New Contributors

Full Changelog: v0.4.0...v0.5.0

v0.4.0 2026-01-30

31 Jan 03:43
754ff71

Choose a tag to compare

🎨 NeMo Data Designer v0.4.0 Release Notes

✨ What's New

  • Message Traces: he full conversation history during LLM generation, giving you access to system prompts, rendered user prompts, and model reasoning for downstream use cases.. Enable per-column with with_trace=True or globally via RunConfig.

  • Multi-Image Support: Pass multiple images per column in multi-modal contexts for richer vision-based generation.

  • Expanded Code Languages: Added support for Bash, C, C++, C#, and COBOL in LLMCodeColumnConfig.

  • Progress Logging: Progress updates during LLM-column generation for better visibility into long-running jobs.


💥 Breaking Change: Import structure

The essentials module has been removed in favor of a cleaner import pattern. Configuration classes are now accessed via data_designer.config and the main interface via data_designer.interface.

Before (v0.3.x):

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    LLMTextColumnConfig,
    SamplerColumnConfig,
    SamplerType,
)

data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()

After (v0.4.x):

import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Configuration classes are accessed via the `dd` namespace
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(values=["A", "B"]),
    )
)

💥 Breaking Change: Reasoning traces → Message traces

The automatic __reasoning_trace columns have been replaced with opt-in message traces that capture the full conversation history.

Key changes:

  • Column postfix renamed from __reasoning_trace to __trace
  • Traces are now opt-in rather than automatic
  • Traces capture the full message history (system/user/assistant), including retry conversations

Before (v0.3.x):

Reasoning traces were automatically generated as side-effect columns for extended thinking models:

# Traces were automatic - no configuration needed
# Column "answer" would automatically produce "answer__reasoning_trace"

After (v0.4.x):

Enable traces explicitly per-column or globally:

Per-column (recommended):

import data_designer.config as dd

config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="answer",
        prompt="Answer: {{ question }}",
        model_alias="nvidia-text",
        with_trace=True,  # Opt-in to trace capture
    )
)
# Produces "answer" and "answer__trace" columns

Global debug override:

import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()
data_designer.set_run_config(
    dd.RunConfig(debug_override_save_all_column_traces=True)
)

The trace data structure is now a list[dict] capturing the ordered message history:

[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4", "reasoning_content": None}
]

What's Changed

  • feat: Add /create-pr skill for well-formatted GitHub PRs by @johnnygreco in #247
  • docs: Fix mkdocs syntax and update person sampling documentation by @johnnygreco in #249
  • refactor: slim package refactor into three subpackages by @johnnygreco in #240
  • chore: add publish script and update license headers by @johnnygreco in #253
  • chore: add CODEOWNERS for automatic PR review assignment by @andreatgretel in #251
  • feat: allow skipping health checks by @nabinchha in #244
  • chore: copy README to data-designer package during install by @johnnygreco in #256
  • feat: support multiple images per column in image context by @nabinchha in #257
  • fix: escape special characters in SchemaTransformProcessor JSON templates by @andreatgretel in #250
  • chore: update telemetry by @johntmyers in #261
  • feat: add /update-pr skill and improve /create-pr file linking by @johnnygreco in #258
  • feat: Add /commit skill for conventional commit messages by @johnnygreco in #252
  • fix: automate README sync for data-designer package builds by @andreatgretel in #266
  • chore: simplify publish script by removing redundant rebuild step by @johnnygreco in #268
  • feat: add job progress logging for cell-by-cell generation by @eric-tramel in #259
  • feat: add message trace support for LLM generation by @johnnygreco in #272
  • chore: add animated emoji progress indicators to progress tracker by @johnnygreco in #273
  • feat: Add Phase 1 languages (Bash, C, C++, C#, COBOL) to CodeLang by @kirit93 in #271
  • fix: ensure 100% progress is logged exactly once by @johnnygreco in #276

Full Changelog: v0.3.8...v0.4.0

v0.3.8 2026-01-26

27 Jan 02:28
5402b7d

Choose a tag to compare

👀 New Nemotron-Personas Datasets

PersonSampler supports two new locales:

What's Changed

Full Changelog: v0.3.7...v0.3.8

v0.3.7 2026-01-17 Latest

17 Jan 21:53
a3a9313

Choose a tag to compare

🎨 NeMo Data Designer v0.3.6 Release Notes

  • Restores lazy load changes introduced in PR-222 to litellm_overrides.py that led to intermittent import issues.

What's Changed

  • fix: restore lazy load litellm overrides changes by @nabinchha in #229

Full Changelog: v0.3.6...v0.3.7

v0.3.6 2026-01-16 Latest

17 Jan 00:55
eb5ef27

Choose a tag to compare

🎨 NeMo Data Designer v0.3.6 Release Notes

  • Fixes a regression introduced in PR-222 that wasn't caught by our tests.

What's Changed

  • fix: incorrect litellm lazy load for class extension by @nabinchha in #228

Full Changelog: v0.3.5...v0.3.6

v0.3.5 2026-01-16

16 Jan 22:47
d97d52c

Choose a tag to compare

🎨 NeMo Data Designer v0.3.5 Release Notes

##💥 Breaking Change: Plugins

We have made some updates to the task and column generation abstractions, which come with some breaking changes for plugin developers.

  1. No more GeneratorMetadata

We have completely removed the GeneratorMetadata object (as well as it's parent ConfigurableTaskMetadata object). This means you no longer need to define a metadata static method when creating a column generator implementation.

As part of this refactor, we have added two new subclasses to use for different generation strategies:

  • data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumn
  • data_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell

Before (v0.3.4)

from data_designer.engine.column_generators.generators.base import (
    ColumnGenerator,
    GenerationStrategy,
    GeneratorMetadata,
)

class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
    @staticmethod
    def metadata() -> GeneratorMetadata:
        """Define metadata about this generator."""
        return GeneratorMetadata(
            name="index-multiplier",
            description="Generates values by multiplying the row index by a user-specified multiplier",
            generation_strategy=GenerationStrategy.FULL_COLUMN,
        )
    
    # implementation below
   ...

After (v0.3.5)

from data_designer.engine.column_generators.generators.base import ColumnGeneratorFullColumn

class IndexMultiplierColumnGenerator(ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]):

    # implementation below
    ...
  1. required_columns and side_effect_columns now must be explicitly defined on classes that inherit from SingleColumnConfig

Before (v0.3.4)

from data_defrom data_designer.config.column_configs import SingleColumnConfig

class IndexMultiplierColumnConfig(SingleColumnConfig):
    """Configuration for the index multiplier column generator."""

    # Configurable parameter for this plugin
    multiplier: int = 2

    # Required: discriminator field with a unique Literal type
    # This value identifies your plugin and becomes its column_type
    column_type: Literal["index-multiplier"] = "index-multiplier"

After (v0.3.5)

from data_designer.config.column_configs import SingleColumnConfig

class IndexMultiplierColumnConfig(SingleColumnConfig):
    """Configuration for the index multiplier column generator."""

    # Configurable parameter for this plugin
    multiplier: int = 2

    # Required: discriminator field with a unique Literal type
    # This value identifies your plugin and becomes its column_type
    column_type: Literal["index-multiplier"] = "index-multiplier"

    @property
    def required_columns(self) -> list[str]:
        return []

    @property
    def side_effect_columns(self) -> list[str]:
        return []

While the updated version is more verbose, it will ensure column generator developers are aware of these properties, which are essential for building a working generator.

  1. Removed emoji from the Plugin object

Now that plugins support more that column generators, the emoji field is not always applicable.

Before (v0.3.4)

from data_designer.plugins import Plugin, PluginType

# Plugin instance - this is what gets loaded via entry point
plugin = Plugin(
    impl_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnGenerator",
    config_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnConfig",
    plugin_type=PluginType.COLUMN_GENERATOR,
    emoji="🔌",
)

After (v0.3.5)

from data_designer.plugins import Plugin, PluginType

# Plugin instance - this is what gets loaded via entry point
plugin = Plugin(
    impl_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnGenerator",
    config_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnConfig",
    plugin_type=PluginType.COLUMN_GENERATOR,
)

What's Changed

Full Changelog: v0.3.4...v0.3.5

v0.3.4 2026-01-14

14 Jan 22:37
3f18e75

Choose a tag to compare

What's Changed

Full Changelog: v0.3.3...v0.3.4

v0.3.3 2026-01-12

12 Jan 22:59
1f9de2c

Choose a tag to compare

What's Changed

Full Changelog: v0.3.2...v0.3.3

v0.3.2 2026-01-09

09 Jan 22:46
f8c201e

Choose a tag to compare

Breaking plugin changes

  • required_resources has been removed from task metadata objects.

  • There are two new column generator base classes to streamline model usage:

In data_designer.engine.column_generators.generators.base:

class ColumnGeneratorWithModelRegistry(ColumnGenerator[TaskConfigT], ABC):
    @property
    def model_registry(self) -> ModelRegistry:
        return self.resource_provider.model_registry

    def get_model(self, model_alias: str) -> ModelFacade:
        return self.model_registry.get_model(model_alias=model_alias)

    def get_model_config(self, model_alias: str) -> ModelConfig:
        return self.model_registry.get_model_config(model_alias=model_alias)

    def get_model_provider_name(self, model_alias: str) -> str:
        provider = self.model_registry.get_model_provider(model_alias=model_alias)
        return provider.name


class ColumnGeneratorWithModel(ColumnGeneratorWithModelRegistry[TaskConfigT], ABC):
    @functools.cached_property
    def model(self) -> ModelFacade:
        return self.get_model(model_alias=self.config.model_alias)

    @functools.cached_property
    def model_config(self) -> ModelConfig:
        return self.get_model_config(model_alias=self.config.model_alias)

    @functools.cached_property
    def inference_parameters(self) -> BaseInferenceParams:
        return self.model_config.inference_parameters

    def log_pre_generation(self) -> None:
        logger.info(f"{self.config.column_type} model configuration for generating column '{self.config.name}'")
        logger.info(f"  |-- model: {self.model_config.model!r}")
        logger.info(f"  |-- model alias: {self.config.model_alias!r}")
        logger.info(f"  |-- model provider: {self.get_model_provider_name(model_alias=self.config.model_alias)!r}")
        logger.info(f"  |-- inference parameters: {self.inference_parameters.format_for_display()}")

If you need to use models in your generator, subclass from one of these base classes.

What's Changed

Full Changelog: v0.3.1...v0.3.2