Releases: NVIDIA-NeMo/DataDesigner
v0.5.1 2026-02-20
Data Designer now supports image generation!
What's Changed
- docs: Updated url by @kirit93 in #325
- docs: deep research trajectories with NDD and MCP tool use by @eric-tramel in #326
- refactor: callback-based processor design by @andreatgretel in #294
- feat: add image generation support with multi-modal context by @nabinchha in #317
- docs: add image generation documentation and image-to-image editing tutorial by @nabinchha in #319
- chore: move ArtifactStorage to engine/storage/ module by @nabinchha in #321
- chore: gitignore Cerebro knowledge base files by @johnnygreco in #328
- feat(engine): env-var switch for async-first models experiment by @eric-tramel in #280
- docs: Moved nav to left hand side by @kirit93 in #331
- feat: add --save-results option to preview command by @johnnygreco in #333
- chore: Improve CLI startup with lazy heavy import cleanup by @johnnygreco in #330
- feat: add allow_resize for 1:N and N:1 generation patterns by @andreatgretel in #286
- chore: address Andre's feedback on --save-results and CLI preview by @johnnygreco in #335
- chore: remove example_allow_resize.py from repo root by @andreatgretel in #337
- fix: make DropColumnsProcessorConfig idempotent and support reasoning columns by @andreatgretel in #334
- feat: add push_to_hub_from_folder classmethod for uploading saved datasets by @nabinchha in #340
- fix: handle bool, int, float in convert_to_row_element by @dhruvnathawani in #336
- feat: auto-detect ImageContext format for image-to-image generation by @nabinchha in #342
New Contributors
- @dhruvnathawani made their first contribution in #336
Full Changelog: v0.5.0...v0.5.1
v0.5.0 2026-02-11
🎨 NeMo Data Designer – v0.5.0 Release Notes
⚡Highlights
-
🛠️ MCP Tool Calling: LLM columns can now call external tools during generation via MCP!!
-
⚛️ Functions as custom column generators: The @custom_column_generator decorator that lets users write their own column generation logic and plug it directly into a pipeline.
-
🤗 Hugging Face Hub integration: You can now publish generated datasets directly to the Hugging Face Hub with auto-generated dataset cards:
results.push_to_hub().- Huge thank you to @davidberenstein1957 for starting the design and work on this feature, as well as @davanstrien and @Wauplin for their help pushing it over the finish line!
-
💻 CLI generation commands: You can generate data from the CLI using the new
preview,create, andvalidatecommands. -
🔍 LLM Observability: Use the new with_trace option on LLM configs to return the
TraceType.ALL_MESSAGEor theTraceType.LAST_MESSAGE. You can also selectively extract reasoning content usingextract_reasoning_content=True.
⚠️ Breaking Changes
-
with_traceused to be a boolean. It is now aTraceTypeenum (NONE(default),LAST_MESSAGE,ALL_MESSAGES) instead of a boolean. -
SingleColumnConfigis now isolated in its own base moduledata_designer.base.configto protect against circular imports during plugin discovery.
What's Changed
- feat: MCP (Model Context Protocol) tool calling integration for LLM columns by @eric-tramel in #248
- fix: normalize license header year format in mcp module by @johnnygreco in #279
- chore: configure independent pytest settings per subpackage by @johnnygreco in #278
- fix: normalize trace content blocks to prevent parquet write crashes by @eric-tramel in #283
- feat: Add TraceType enum for granular trace control by @eric-tramel in #284
- docs: add deployment, performance tuning guides and streamline gettin… by @kirit93 in #277
- chore: update tutorial notebooks to use dd. notation consistently by @andreatgretel in #288
- feat: add extract_reasoning_content option to LLM columns by @eric-tramel in #285
- chore: add greptile.json to reduce review verbosity by @andreatgretel in #289
- feat: switch from hatch-vcs to uv-dynamic-versioning by @johnnygreco in #282
- revert: Remove RunConfig debug_trace_override by @eric-tramel in #290
- perf: implement lazy loading for config module exports by @johnnygreco in #291
- refactor: move SingleColumnConfig to config.base module by @johnnygreco in #287
- feat: Add CustomColumnGenerator for user-defined column generation by @andreatgretel in #254
- chore: standardize recipe script metadata and docstrings by @johnnygreco in #292
- chore: enable status check in greptile.json by @dakshgup in #295
- feat: add HuggingFace Hub integration for dataset publishing by @nabinchha in #275
- docs: Added images for deployment options by @kirit93 in #297
- docs: Add RQA dataset blog post and improve blog navigation by @kirit93 in #296
- chore: quiet tool call logs and add tool usage statistics by @johnnygreco in #293
- docs: Added documentation for seed datasets by @kirit93 in #300
- docs: updated usage chart by @kirit93 in #304
- docs: Update README.md by @kirit93 in #305
- chore: update HF card citation copy and add library version to builder config by @johnnygreco in #303
- chore: add tokens generated badge to README by @johnnygreco in #306
- test: add provider health checks script and CI workflow by @andreatgretel in #301
- chore: bump pytest, nbconvert, and pyjwt for vulnerability fixes by @johnnygreco in #312
- fix: allow BuilderConfig round-trip serialization by @johnnygreco in #311
- chore: export ConstraintType and InequalityOperator from config init by @johnnygreco in #308
- docs: restructure plugin docs with multi-file layout and seed reader type by @johnnygreco in #302
- docs: Added cat emoji sequence by @kirit93 in #316
- fix: use reasoning_effort for gpt-5 inference params by @andreatgretel in #315
- docs: New post on SDG design principles by @kirit93 in #318
- feat: add preview, create, and validate CLI commands by @johnnygreco in #313
- feat: support loading config files from HTTP(S) URLs by @johnnygreco in #323
- fix: include CUSTOM type in execution DAG and warn on generator errors by @andreatgretel in #324
- fix: trim LLM response content before parsing by @johnnygreco in #322
New Contributors
- @dakshgup made their first contribution in #295
- @davanstrien
- @davidberenstein1957
- @Wauplin
Full Changelog: v0.4.0...v0.5.0
v0.4.0 2026-01-30
🎨 NeMo Data Designer v0.4.0 Release Notes
✨ What's New
-
Message Traces: he full conversation history during LLM generation, giving you access to system prompts, rendered user prompts, and model reasoning for downstream use cases.. Enable per-column with
with_trace=Trueor globally viaRunConfig. -
Multi-Image Support: Pass multiple images per column in multi-modal contexts for richer vision-based generation.
-
Expanded Code Languages: Added support for Bash, C, C++, C#, and COBOL in
LLMCodeColumnConfig. -
Progress Logging: Progress updates during LLM-column generation for better visibility into long-running jobs.
💥 Breaking Change: Import structure
The essentials module has been removed in favor of a cleaner import pattern. Configuration classes are now accessed via data_designer.config and the main interface via data_designer.interface.
Before (v0.3.x):
from data_designer.essentials import (
CategorySamplerParams,
DataDesigner,
DataDesignerConfigBuilder,
LLMTextColumnConfig,
SamplerColumnConfig,
SamplerType,
)
data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()After (v0.4.x):
import data_designer.config as dd
from data_designer.interface import DataDesigner
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()
# Configuration classes are accessed via the `dd` namespace
config_builder.add_column(
dd.SamplerColumnConfig(
name="category",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(values=["A", "B"]),
)
)💥 Breaking Change: Reasoning traces → Message traces
The automatic __reasoning_trace columns have been replaced with opt-in message traces that capture the full conversation history.
Key changes:
- Column postfix renamed from
__reasoning_traceto__trace - Traces are now opt-in rather than automatic
- Traces capture the full message history (system/user/assistant), including retry conversations
Before (v0.3.x):
Reasoning traces were automatically generated as side-effect columns for extended thinking models:
# Traces were automatic - no configuration needed
# Column "answer" would automatically produce "answer__reasoning_trace"After (v0.4.x):
Enable traces explicitly per-column or globally:
Per-column (recommended):
import data_designer.config as dd
config_builder.add_column(
dd.LLMTextColumnConfig(
name="answer",
prompt="Answer: {{ question }}",
model_alias="nvidia-text",
with_trace=True, # Opt-in to trace capture
)
)
# Produces "answer" and "answer__trace" columnsGlobal debug override:
import data_designer.config as dd
from data_designer.interface import DataDesigner
data_designer = DataDesigner()
data_designer.set_run_config(
dd.RunConfig(debug_override_save_all_column_traces=True)
)The trace data structure is now a list[dict] capturing the ordered message history:
[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4", "reasoning_content": None}
]What's Changed
- feat: Add /create-pr skill for well-formatted GitHub PRs by @johnnygreco in #247
- docs: Fix mkdocs syntax and update person sampling documentation by @johnnygreco in #249
- refactor: slim package refactor into three subpackages by @johnnygreco in #240
- chore: add publish script and update license headers by @johnnygreco in #253
- chore: add CODEOWNERS for automatic PR review assignment by @andreatgretel in #251
- feat: allow skipping health checks by @nabinchha in #244
- chore: copy README to data-designer package during install by @johnnygreco in #256
- feat: support multiple images per column in image context by @nabinchha in #257
- fix: escape special characters in SchemaTransformProcessor JSON templates by @andreatgretel in #250
- chore: update telemetry by @johntmyers in #261
- feat: add /update-pr skill and improve /create-pr file linking by @johnnygreco in #258
- feat: Add /commit skill for conventional commit messages by @johnnygreco in #252
- fix: automate README sync for data-designer package builds by @andreatgretel in #266
- chore: simplify publish script by removing redundant rebuild step by @johnnygreco in #268
- feat: add job progress logging for cell-by-cell generation by @eric-tramel in #259
- feat: add message trace support for LLM generation by @johnnygreco in #272
- chore: add animated emoji progress indicators to progress tracker by @johnnygreco in #273
- feat: Add Phase 1 languages (Bash, C, C++, C#, COBOL) to CodeLang by @kirit93 in #271
- fix: ensure 100% progress is logged exactly once by @johnnygreco in #276
Full Changelog: v0.3.8...v0.4.0
v0.3.8 2026-01-26
👀 New Nemotron-Personas Datasets
PersonSampler supports two new locales:
- Nemotron-Personas-Singapore (
locale = en_SG) - Nemotron-Personas-Brazil (
locale = pt_BR)
What's Changed
- fix: unblock generation when no from-scratch-generator is configured by @nabinchha in #231
- fix: do not attempt to deserialize llm text response by @nabinchha in #233
- docs: Updated recipe card by @kirit93 in #153
- fix: no api key warning on default model providers by @nabinchha in #238
- feat: Support for Claude Skills (DevX and Generation) by @eric-tramel in #239
- feat: Elevate non-LLM concurrency limits to
RunConfigby @eric-tramel in #242 - feat: wire up pt_GB and en_SG personas by @johnnygreco in #245
Full Changelog: v0.3.7...v0.3.8
v0.3.7 2026-01-17 Latest
🎨 NeMo Data Designer v0.3.6 Release Notes
- Restores lazy load changes introduced in PR-222 to
litellm_overrides.pythat led to intermittent import issues.
What's Changed
- fix: restore lazy load litellm overrides changes by @nabinchha in #229
Full Changelog: v0.3.6...v0.3.7
v0.3.6 2026-01-16 Latest
🎨 NeMo Data Designer v0.3.6 Release Notes
- Fixes a regression introduced in PR-222 that wasn't caught by our tests.
What's Changed
- fix: incorrect litellm lazy load for class extension by @nabinchha in #228
Full Changelog: v0.3.5...v0.3.6
v0.3.5 2026-01-16
🎨 NeMo Data Designer v0.3.5 Release Notes
##💥 Breaking Change: Plugins
We have made some updates to the task and column generation abstractions, which come with some breaking changes for plugin developers.
- No more
GeneratorMetadata
We have completely removed the GeneratorMetadata object (as well as it's parent ConfigurableTaskMetadata object). This means you no longer need to define a metadata static method when creating a column generator implementation.
As part of this refactor, we have added two new subclasses to use for different generation strategies:
data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumndata_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell
Before (v0.3.4)
from data_designer.engine.column_generators.generators.base import (
ColumnGenerator,
GenerationStrategy,
GeneratorMetadata,
)
class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
@staticmethod
def metadata() -> GeneratorMetadata:
"""Define metadata about this generator."""
return GeneratorMetadata(
name="index-multiplier",
description="Generates values by multiplying the row index by a user-specified multiplier",
generation_strategy=GenerationStrategy.FULL_COLUMN,
)
# implementation below
...After (v0.3.5)
from data_designer.engine.column_generators.generators.base import ColumnGeneratorFullColumn
class IndexMultiplierColumnGenerator(ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]):
# implementation below
...required_columnsandside_effect_columnsnow must be explicitly defined on classes that inherit fromSingleColumnConfig
Before (v0.3.4)
from data_defrom data_designer.config.column_configs import SingleColumnConfig
class IndexMultiplierColumnConfig(SingleColumnConfig):
"""Configuration for the index multiplier column generator."""
# Configurable parameter for this plugin
multiplier: int = 2
# Required: discriminator field with a unique Literal type
# This value identifies your plugin and becomes its column_type
column_type: Literal["index-multiplier"] = "index-multiplier"After (v0.3.5)
from data_designer.config.column_configs import SingleColumnConfig
class IndexMultiplierColumnConfig(SingleColumnConfig):
"""Configuration for the index multiplier column generator."""
# Configurable parameter for this plugin
multiplier: int = 2
# Required: discriminator field with a unique Literal type
# This value identifies your plugin and becomes its column_type
column_type: Literal["index-multiplier"] = "index-multiplier"
@property
def required_columns(self) -> list[str]:
return []
@property
def side_effect_columns(self) -> list[str]:
return []While the updated version is more verbose, it will ensure column generator developers are aware of these properties, which are essential for building a working generator.
- Removed
emojifrom thePluginobject
Now that plugins support more that column generators, the emoji field is not always applicable.
Before (v0.3.4)
from data_designer.plugins import Plugin, PluginType
# Plugin instance - this is what gets loaded via entry point
plugin = Plugin(
impl_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnGenerator",
config_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnConfig",
plugin_type=PluginType.COLUMN_GENERATOR,
emoji="🔌",
)After (v0.3.5)
from data_designer.plugins import Plugin, PluginType
# Plugin instance - this is what gets loaded via entry point
plugin = Plugin(
impl_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnGenerator",
config_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnConfig",
plugin_type=PluginType.COLUMN_GENERATOR,
)What's Changed
- fix: dataset metadata should be optional in
PreviewResultsby @andreatgretel in #223 - refactor: remove task metadata property by @johnnygreco in #216
- refactor: update single column base class by @johnnygreco in #206
- chore: lazy 3rd party imports by @nabinchha in #222
- fix: post merge issues by @nabinchha in #224
- chore: minor readme updates by @johnnygreco in #225
- chore: streamline generation metadata + consolidate sdg json by @nabinchha in #226
Full Changelog: v0.3.4...v0.3.5
v0.3.4 2026-01-14
What's Changed
- fix: hard-disable early shutdown when RunConfig.disable_early_shutdown=true by @eric-tramel in #203
- chore: upgrade numpy in uv.lock by @johnnygreco in #202
- fix: update example runner command with notebooks dep group by @johnnygreco in #204
- chore: Add SkipJsonSchema annotation to DF seed source by @mikeknep in #205
- feat: Plumb LLM retry controls through RunConfig by @eric-tramel in #208
- feat: move buffer size control to RunConfig by @eric-tramel in #209
- fix: seed columns do not show up in display_sample_record by @andreatgretel in #213
- docs: Added top models pie chart by @kirit93 in #217
- chore: rename e2e_tests to tests_e2e by @johnnygreco in #214
- chore: Set json_schema_mode_override validation on ConfigBase by @mikeknep in #220
- fix: gracefully handle empty buffers in the dataset builder by @johnnygreco in #221
Full Changelog: v0.3.3...v0.3.4
v0.3.3 2026-01-12
What's Changed
- chore: Bump rich to 14.x series by @mikeknep in #196
- chore: add isssue templates by @johnnygreco in #197
- chore: minor issue template tweaks by @johnnygreco in #198
- chore: add make commands to run examples as e2e tests by @johnnygreco in #199
- fix: early shutdown race condition by @nabinchha in #201
Full Changelog: v0.3.2...v0.3.3
v0.3.2 2026-01-09
Breaking plugin changes
-
required_resourceshas been removed from task metadata objects. -
There are two new column generator base classes to streamline model usage:
In data_designer.engine.column_generators.generators.base:
class ColumnGeneratorWithModelRegistry(ColumnGenerator[TaskConfigT], ABC):
@property
def model_registry(self) -> ModelRegistry:
return self.resource_provider.model_registry
def get_model(self, model_alias: str) -> ModelFacade:
return self.model_registry.get_model(model_alias=model_alias)
def get_model_config(self, model_alias: str) -> ModelConfig:
return self.model_registry.get_model_config(model_alias=model_alias)
def get_model_provider_name(self, model_alias: str) -> str:
provider = self.model_registry.get_model_provider(model_alias=model_alias)
return provider.name
class ColumnGeneratorWithModel(ColumnGeneratorWithModelRegistry[TaskConfigT], ABC):
@functools.cached_property
def model(self) -> ModelFacade:
return self.get_model(model_alias=self.config.model_alias)
@functools.cached_property
def model_config(self) -> ModelConfig:
return self.get_model_config(model_alias=self.config.model_alias)
@functools.cached_property
def inference_parameters(self) -> BaseInferenceParams:
return self.model_config.inference_parameters
def log_pre_generation(self) -> None:
logger.info(f"{self.config.column_type} model configuration for generating column '{self.config.name}'")
logger.info(f" |-- model: {self.model_config.model!r}")
logger.info(f" |-- model alias: {self.config.model_alias!r}")
logger.info(f" |-- model provider: {self.get_model_provider_name(model_alias=self.config.model_alias)!r}")
logger.info(f" |-- inference parameters: {self.inference_parameters.format_for_display()}")If you need to use models in your generator, subclass from one of these base classes.
What's Changed
- refactor: update required resources treatment and use subclasses over mixins by @johnnygreco in #184
- feat: Seed dataset plugins by @mikeknep in #191
- chore: update header script to check for diffs by @johnnygreco in #195
Full Changelog: v0.3.1...v0.3.2