20 Feb 21:06

johnnygreco

8f7a720

v0.5.1 2026-02-20 Latest

Latest

Data Designer now supports image generation!

What's Changed

docs: Updated url by @kirit93 in #325
docs: deep research trajectories with NDD and MCP tool use by @eric-tramel in #326
refactor: callback-based processor design by @andreatgretel in #294
feat: add image generation support with multi-modal context by @nabinchha in #317
docs: add image generation documentation and image-to-image editing tutorial by @nabinchha in #319
chore: move ArtifactStorage to engine/storage/ module by @nabinchha in #321
chore: gitignore Cerebro knowledge base files by @johnnygreco in #328
feat(engine): env-var switch for async-first models experiment by @eric-tramel in #280
docs: Moved nav to left hand side by @kirit93 in #331
feat: add --save-results option to preview command by @johnnygreco in #333
chore: Improve CLI startup with lazy heavy import cleanup by @johnnygreco in #330
feat: add allow_resize for 1:N and N:1 generation patterns by @andreatgretel in #286
chore: address Andre's feedback on --save-results and CLI preview by @johnnygreco in #335
chore: remove example_allow_resize.py from repo root by @andreatgretel in #337
fix: make DropColumnsProcessorConfig idempotent and support reasoning columns by @andreatgretel in #334
feat: add push_to_hub_from_folder classmethod for uploading saved datasets by @nabinchha in #340
fix: handle bool, int, float in convert_to_row_element by @dhruvnathawani in #336
feat: auto-detect ImageContext format for image-to-image generation by @nabinchha in #342

New Contributors

@dhruvnathawani made their first contribution in #336

Full Changelog: v0.5.0...v0.5.1

Contributors

eric-tramel, nabinchha, and 4 other contributors

Assets 2

11 Feb 22:22

johnnygreco

v0.5.0

631f1f9

v0.5.0 2026-02-11

🎨 NeMo Data Designer – v0.5.0 Release Notes

⚡Highlights

🛠️ MCP Tool Calling: LLM columns can now call external tools during generation via MCP!!
⚛️ Functions as custom column generators: The @custom_column_generator decorator that lets users write their own column generation logic and plug it directly into a pipeline.
🤗 Hugging Face Hub integration: You can now publish generated datasets directly to the Hugging Face Hub with auto-generated dataset cards: results.push_to_hub().
- Huge thank you to @davidberenstein1957 for starting the design and work on this feature, as well as @davanstrien and @Wauplin for their help pushing it over the finish line!
💻 CLI generation commands: You can generate data from the CLI using the new preview, create, and validate commands.
🔍 LLM Observability: Use the new with_trace option on LLM configs to return the TraceType.ALL_MESSAGE or the TraceType.LAST_MESSAGE. You can also selectively extract reasoning content using extract_reasoning_content=True.

⚠️ Breaking Changes

with_trace used to be a boolean. It is now a TraceType enum (NONE (default), LAST_MESSAGE, ALL_MESSAGES) instead of a boolean.
SingleColumnConfig is now isolated in its own base module data_designer.base.config to protect against circular imports during plugin discovery.

What's Changed

feat: MCP (Model Context Protocol) tool calling integration for LLM columns by @eric-tramel in #248
fix: normalize license header year format in mcp module by @johnnygreco in #279
chore: configure independent pytest settings per subpackage by @johnnygreco in #278
fix: normalize trace content blocks to prevent parquet write crashes by @eric-tramel in #283
feat: Add TraceType enum for granular trace control by @eric-tramel in #284
docs: add deployment, performance tuning guides and streamline gettin… by @kirit93 in #277
chore: update tutorial notebooks to use dd. notation consistently by @andreatgretel in #288
feat: add extract_reasoning_content option to LLM columns by @eric-tramel in #285
chore: add greptile.json to reduce review verbosity by @andreatgretel in #289
feat: switch from hatch-vcs to uv-dynamic-versioning by @johnnygreco in #282
revert: Remove RunConfig debug_trace_override by @eric-tramel in #290
perf: implement lazy loading for config module exports by @johnnygreco in #291
refactor: move SingleColumnConfig to config.base module by @johnnygreco in #287
feat: Add CustomColumnGenerator for user-defined column generation by @andreatgretel in #254
chore: standardize recipe script metadata and docstrings by @johnnygreco in #292
chore: enable status check in greptile.json by @dakshgup in #295
feat: add HuggingFace Hub integration for dataset publishing by @nabinchha in #275
docs: Added images for deployment options by @kirit93 in #297
docs: Add RQA dataset blog post and improve blog navigation by @kirit93 in #296
chore: quiet tool call logs and add tool usage statistics by @johnnygreco in #293
docs: Added documentation for seed datasets by @kirit93 in #300
docs: updated usage chart by @kirit93 in #304
docs: Update README.md by @kirit93 in #305
chore: update HF card citation copy and add library version to builder config by @johnnygreco in #303
chore: add tokens generated badge to README by @johnnygreco in #306
test: add provider health checks script and CI workflow by @andreatgretel in #301
chore: bump pytest, nbconvert, and pyjwt for vulnerability fixes by @johnnygreco in #312
fix: allow BuilderConfig round-trip serialization by @johnnygreco in #311
chore: export ConstraintType and InequalityOperator from config init by @johnnygreco in #308
docs: restructure plugin docs with multi-file layout and seed reader type by @johnnygreco in #302
docs: Added cat emoji sequence by @kirit93 in #316
fix: use reasoning_effort for gpt-5 inference params by @andreatgretel in #315
docs: New post on SDG design principles by @kirit93 in #318
feat: add preview, create, and validate CLI commands by @johnnygreco in #313
feat: support loading config files from HTTP(S) URLs by @johnnygreco in #323
fix: include CUSTOM type in execution DAG and warn on generator errors by @andreatgretel in #324
fix: trim LLM response content before parsing by @johnnygreco in #322

New Contributors

@dakshgup made their first contribution in #295
@davanstrien
@davidberenstein1957
@Wauplin

Full Changelog: v0.4.0...v0.5.0

Contributors

eric-tramel, nabinchha, and 7 other contributors

Assets 3

31 Jan 03:43

johnnygreco

v0.4.0

754ff71

v0.4.0 2026-01-30

🎨 NeMo Data Designer v0.4.0 Release Notes

✨ What's New

Message Traces: he full conversation history during LLM generation, giving you access to system prompts, rendered user prompts, and model reasoning for downstream use cases.. Enable per-column with with_trace=True or globally via RunConfig.
Multi-Image Support: Pass multiple images per column in multi-modal contexts for richer vision-based generation.
Expanded Code Languages: Added support for Bash, C, C++, C#, and COBOL in LLMCodeColumnConfig.
Progress Logging: Progress updates during LLM-column generation for better visibility into long-running jobs.

💥 Breaking Change: Import structure

The essentials module has been removed in favor of a cleaner import pattern. Configuration classes are now accessed via data_designer.config and the main interface via data_designer.interface.

Before (v0.3.x):

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    LLMTextColumnConfig,
    SamplerColumnConfig,
    SamplerType,
)

data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()

After (v0.4.x):

import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Configuration classes are accessed via the `dd` namespace
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(values=["A", "B"]),
    )
)

💥 Breaking Change: Reasoning traces → Message traces

The automatic __reasoning_trace columns have been replaced with opt-in message traces that capture the full conversation history.

Key changes:

Column postfix renamed from __reasoning_trace to __trace
Traces are now opt-in rather than automatic
Traces capture the full message history (system/user/assistant), including retry conversations

Before (v0.3.x):

Reasoning traces were automatically generated as side-effect columns for extended thinking models:

# Traces were automatic - no configuration needed
# Column "answer" would automatically produce "answer__reasoning_trace"

After (v0.4.x):

Enable traces explicitly per-column or globally:

Per-column (recommended):

import data_designer.config as dd

config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="answer",
        prompt="Answer: {{ question }}",
        model_alias="nvidia-text",
        with_trace=True,  # Opt-in to trace capture
    )
)
# Produces "answer" and "answer__trace" columns

Global debug override:

import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()
data_designer.set_run_config(
    dd.RunConfig(debug_override_save_all_column_traces=True)
)

The trace data structure is now a list[dict] capturing the ordered message history:

[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4", "reasoning_content": None}
]

What's Changed

feat: Add /create-pr skill for well-formatted GitHub PRs by @johnnygreco in #247
docs: Fix mkdocs syntax and update person sampling documentation by @johnnygreco in #249
refactor: slim package refactor into three subpackages by @johnnygreco in #240
chore: add publish script and update license headers by @johnnygreco in #253
chore: add CODEOWNERS for automatic PR review assignment by @andreatgretel in #251
feat: allow skipping health checks by @nabinchha in #244
chore: copy README to data-designer package during install by @johnnygreco in #256
feat: support multiple images per column in image context by @nabinchha in #257
fix: escape special characters in SchemaTransformProcessor JSON templates by @andreatgretel in #250
chore: update telemetry by @johntmyers in #261
feat: add /update-pr skill and improve /create-pr file linking by @johnnygreco in #258
feat: Add /commit skill for conventional commit messages by @johnnygreco in #252
fix: automate README sync for data-designer package builds by @andreatgretel in #266
chore: simplify publish script by removing redundant rebuild step by @johnnygreco in #268
feat: add job progress logging for cell-by-cell generation by @eric-tramel in #259
feat: add message trace support for LLM generation by @johnnygreco in #272
chore: add animated emoji progress indicators to progress tracker by @johnnygreco in #273
feat: Add Phase 1 languages (Bash, C, C++, C#, COBOL) to CodeLang by @kirit93 in #271
fix: ensure 100% progress is logged exactly once by @johnnygreco in #276

Full Changelog: v0.3.8...v0.4.0

Contributors

eric-tramel, nabinchha, and 4 other contributors

Assets 3

27 Jan 02:28

johnnygreco

v0.3.8

5402b7d

v0.3.8 2026-01-26

👀 New Nemotron-Personas Datasets

PersonSampler supports two new locales:

Nemotron-Personas-Singapore (locale = en_SG)
Nemotron-Personas-Brazil (locale = pt_BR)

What's Changed

fix: unblock generation when no from-scratch-generator is configured by @nabinchha in #231
fix: do not attempt to deserialize llm text response by @nabinchha in #233
docs: Updated recipe card by @kirit93 in #153
fix: no api key warning on default model providers by @nabinchha in #238
feat: Support for Claude Skills (DevX and Generation) by @eric-tramel in #239
feat: Elevate non-LLM concurrency limits to RunConfig by @eric-tramel in #242
feat: wire up pt_GB and en_SG personas by @johnnygreco in #245

Full Changelog: v0.3.7...v0.3.8

Contributors

eric-tramel, nabinchha, and 2 other contributors

Assets 3

17 Jan 21:53

nabinchha

v0.3.7

a3a9313

v0.3.7 2026-01-17 Latest

🎨 NeMo Data Designer v0.3.6 Release Notes

Restores lazy load changes introduced in PR-222 to litellm_overrides.py that led to intermittent import issues.

What's Changed

fix: restore lazy load litellm overrides changes by @nabinchha in #229

Full Changelog: v0.3.6...v0.3.7

Contributors

nabinchha

Assets 3

17 Jan 00:55

nabinchha

v0.3.6

eb5ef27

v0.3.6 2026-01-16 Latest

🎨 NeMo Data Designer v0.3.6 Release Notes

Fixes a regression introduced in PR-222 that wasn't caught by our tests.

What's Changed

fix: incorrect litellm lazy load for class extension by @nabinchha in #228

Full Changelog: v0.3.5...v0.3.6

Contributors

nabinchha

Assets 3

16 Jan 22:47

johnnygreco

v0.3.5

d97d52c

v0.3.5 2026-01-16

🎨 NeMo Data Designer v0.3.5 Release Notes

##💥 Breaking Change: Plugins

We have made some updates to the task and column generation abstractions, which come with some breaking changes for plugin developers.

No more GeneratorMetadata

We have completely removed the GeneratorMetadata object (as well as it's parent ConfigurableTaskMetadata object). This means you no longer need to define a metadata static method when creating a column generator implementation.

As part of this refactor, we have added two new subclasses to use for different generation strategies:

data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumn
data_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell

Before (v0.3.4)

from data_designer.engine.column_generators.generators.base import (
    ColumnGenerator,
    GenerationStrategy,
    GeneratorMetadata,
)

class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
    @staticmethod
    def metadata() -> GeneratorMetadata:
        """Define metadata about this generator."""
        return GeneratorMetadata(
            name="index-multiplier",
            description="Generates values by multiplying the row index by a user-specified multiplier",
            generation_strategy=GenerationStrategy.FULL_COLUMN,
        )
    
    # implementation below
   ...

After (v0.3.5)

from data_designer.engine.column_generators.generators.base import ColumnGeneratorFullColumn

class IndexMultiplierColumnGenerator(ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]):

    # implementation below
    ...

required_columns and side_effect_columns now must be explicitly defined on classes that inherit from SingleColumnConfig

Before (v0.3.4)

from data_defrom data_designer.config.column_configs import SingleColumnConfig

class IndexMultiplierColumnConfig(SingleColumnConfig):
    """Configuration for the index multiplier column generator."""

    # Configurable parameter for this plugin
    multiplier: int = 2

    # Required: discriminator field with a unique Literal type
    # This value identifies your plugin and becomes its column_type
    column_type: Literal["index-multiplier"] = "index-multiplier"

After (v0.3.5)

from data_designer.config.column_configs import SingleColumnConfig

class IndexMultiplierColumnConfig(SingleColumnConfig):
    """Configuration for the index multiplier column generator."""

    # Configurable parameter for this plugin
    multiplier: int = 2

    # Required: discriminator field with a unique Literal type
    # This value identifies your plugin and becomes its column_type
    column_type: Literal["index-multiplier"] = "index-multiplier"

    @property
    def required_columns(self) -> list[str]:
        return []

    @property
    def side_effect_columns(self) -> list[str]:
        return []

While the updated version is more verbose, it will ensure column generator developers are aware of these properties, which are essential for building a working generator.

Removed emoji from the Plugin object

Now that plugins support more that column generators, the emoji field is not always applicable.

Before (v0.3.4)

from data_designer.plugins import Plugin, PluginType

# Plugin instance - this is what gets loaded via entry point
plugin = Plugin(
    impl_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnGenerator",
    config_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnConfig",
    plugin_type=PluginType.COLUMN_GENERATOR,
    emoji="🔌",
)

After (v0.3.5)

from data_designer.plugins import Plugin, PluginType

# Plugin instance - this is what gets loaded via entry point
plugin = Plugin(
    impl_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnGenerator",
    config_qualified_name="data_designer_index_multiplier.plugin.IndexMultiplierColumnConfig",
    plugin_type=PluginType.COLUMN_GENERATOR,
)

What's Changed

fix: dataset metadata should be optional in PreviewResults by @andreatgretel in #223
refactor: remove task metadata property by @johnnygreco in #216
refactor: update single column base class by @johnnygreco in #206
chore: lazy 3rd party imports by @nabinchha in #222
fix: post merge issues by @nabinchha in #224
chore: minor readme updates by @johnnygreco in #225
chore: streamline generation metadata + consolidate sdg json by @nabinchha in #226

Full Changelog: v0.3.4...v0.3.5

Contributors

nabinchha, johnnygreco, and andreatgretel

Assets 3

14 Jan 22:37

johnnygreco

v0.3.4

3f18e75

v0.3.4 2026-01-14

What's Changed

fix: hard-disable early shutdown when RunConfig.disable_early_shutdown=true by @eric-tramel in #203
chore: upgrade numpy in uv.lock by @johnnygreco in #202
fix: update example runner command with notebooks dep group by @johnnygreco in #204
chore: Add SkipJsonSchema annotation to DF seed source by @mikeknep in #205
feat: Plumb LLM retry controls through RunConfig by @eric-tramel in #208
feat: move buffer size control to RunConfig by @eric-tramel in #209
fix: seed columns do not show up in display_sample_record by @andreatgretel in #213
docs: Added top models pie chart by @kirit93 in #217
chore: rename e2e_tests to tests_e2e by @johnnygreco in #214
chore: Set json_schema_mode_override validation on ConfigBase by @mikeknep in #220
fix: gracefully handle empty buffers in the dataset builder by @johnnygreco in #221

Full Changelog: v0.3.3...v0.3.4

Contributors

eric-tramel, mikeknep, and 3 other contributors

Assets 3

12 Jan 22:59

johnnygreco

v0.3.3

1f9de2c

v0.3.3 2026-01-12

What's Changed

chore: Bump rich to 14.x series by @mikeknep in #196
chore: add isssue templates by @johnnygreco in #197
chore: minor issue template tweaks by @johnnygreco in #198
chore: add make commands to run examples as e2e tests by @johnnygreco in #199
fix: early shutdown race condition by @nabinchha in #201

Full Changelog: v0.3.2...v0.3.3

Contributors

mikeknep, nabinchha, and johnnygreco

Assets 3

09 Jan 22:46

johnnygreco

v0.3.2

f8c201e

v0.3.2 2026-01-09

Breaking plugin changes

required_resources has been removed from task metadata objects.
There are two new column generator base classes to streamline model usage:

In data_designer.engine.column_generators.generators.base:

class ColumnGeneratorWithModelRegistry(ColumnGenerator[TaskConfigT], ABC):
    @property
    def model_registry(self) -> ModelRegistry:
        return self.resource_provider.model_registry

    def get_model(self, model_alias: str) -> ModelFacade:
        return self.model_registry.get_model(model_alias=model_alias)

    def get_model_config(self, model_alias: str) -> ModelConfig:
        return self.model_registry.get_model_config(model_alias=model_alias)

    def get_model_provider_name(self, model_alias: str) -> str:
        provider = self.model_registry.get_model_provider(model_alias=model_alias)
        return provider.name


class ColumnGeneratorWithModel(ColumnGeneratorWithModelRegistry[TaskConfigT], ABC):
    @functools.cached_property
    def model(self) -> ModelFacade:
        return self.get_model(model_alias=self.config.model_alias)

    @functools.cached_property
    def model_config(self) -> ModelConfig:
        return self.get_model_config(model_alias=self.config.model_alias)

    @functools.cached_property
    def inference_parameters(self) -> BaseInferenceParams:
        return self.model_config.inference_parameters

    def log_pre_generation(self) -> None:
        logger.info(f"{self.config.column_type} model configuration for generating column '{self.config.name}'")
        logger.info(f"  |-- model: {self.model_config.model!r}")
        logger.info(f"  |-- model alias: {self.config.model_alias!r}")
        logger.info(f"  |-- model provider: {self.get_model_provider_name(model_alias=self.config.model_alias)!r}")
        logger.info(f"  |-- inference parameters: {self.inference_parameters.format_for_display()}")

If you need to use models in your generator, subclass from one of these base classes.

What's Changed

refactor: update required resources treatment and use subclasses over mixins by @johnnygreco in #184
feat: Seed dataset plugins by @mikeknep in #191
chore: update header script to check for diffs by @johnnygreco in #195

Full Changelog: v0.3.1...v0.3.2

Contributors

mikeknep and johnnygreco

Assets 3

Releases: NVIDIA-NeMo/DataDesigner

v0.5.1 2026-02-20

What's Changed

New Contributors

Contributors

Uh oh!

v0.5.0 2026-02-11

🎨 NeMo Data Designer – v0.5.0 Release Notes

⚡Highlights

⚠️ Breaking Changes

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.0 2026-01-30

🎨 NeMo Data Designer v0.4.0 Release Notes

✨ What's New

💥 Breaking Change: Import structure

Before (v0.3.x):

After (v0.4.x):

💥 Breaking Change: Reasoning traces → Message traces

Before (v0.3.x):

After (v0.4.x):

What's Changed

Contributors

Uh oh!

v0.3.8 2026-01-26

👀 New Nemotron-Personas Datasets

What's Changed

Contributors

Uh oh!

v0.3.7 2026-01-17 Latest

What's Changed

Contributors

Uh oh!

v0.3.6 2026-01-16 Latest

What's Changed

Contributors

Uh oh!

v0.3.5 2026-01-16

What's Changed

Contributors

Uh oh!

v0.3.4 2026-01-14

What's Changed

Contributors

Uh oh!

v0.3.3 2026-01-12

What's Changed

Contributors

Uh oh!

v0.3.2 2026-01-09

Breaking plugin changes

What's Changed

Contributors

Uh oh!