add FIM dataset support #2066

dimapihtar · 2026-01-26T15:15:25Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Introduced Fill-in-the-Middle (FIM) dataset support with configurable training parameters including FIM rate, prefix handling, and token splitting for enhanced model training capabilities.
Tests
- Added comprehensive unit tests for FIM dataset configuration to ensure proper initialization, validation, and inheritance of configuration attributes.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: dimapihtar <[email protected]>

copy-pr-bot · 2026-01-26T15:15:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: dimapihtar <[email protected]>

dimapihtar · 2026-01-26T22:53:41Z

/ok to test c129e61

coderabbitai · 2026-01-26T23:07:09Z

📝 Walkthrough

Walkthrough

Introduces support for Fill-In-the-Middle (FIM) dataset configuration by adding GPTFIMDatasetConfig class, integrating it into the dataset provider registry, and updating dataset selection logic to recognize FIM-specific attributes during provisioning.

Changes

Cohort / File(s)	Summary
Configuration Implementation `src/megatron/bridge/training/config.py`	Added new GPTFIMDatasetConfig class extending MCoreGPTFIMDatasetConfig and DataloaderConfig with deferred post-init pattern. Includes seq_length property, fim_data initialization, finalize() method for validation, and assertions for reset_position_ids, reset_attention_mask, and eod_mask_loss fields.
Dataset Provider Integration `src/megatron/bridge/data/utils.py`	Added GPTFIMDataset and GPTFIMDatasetConfig imports. Extended pretrain_train_valid_test_datasets_provider with conditional branch to select GPTFIMDataset when dataset_config has fim_data attribute. Registered GPTFIMDatasetConfig in provider registry mapping.
Unit Tests `tests/unit_tests/training/test_config.py`	Added test cases validating GPTFIMDatasetConfig initialization, type inheritance checks against GPTDatasetConfig and core classes, and verification of FIM-related fields (fim_rate, fim_no_prefix, fim_split_sample, fim_extra_tokens). Note: Duplicate test block identified in file.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR introduces major FIM dataset feature with significant code additions but lacks test results documentation and has duplicate test class name preventing test execution.	Fix duplicate TestMockGPTDatasetConfig class name, run full test suite, and document comprehensive test results in PR description including validation checks.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'add FIM dataset support' directly and specifically describes the main change—introducing support for the FIM (Fill-In-the-Middle) dataset across multiple files (config, data utils, and tests).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/training/config.py`:
- Around line 453-486: The dataset config GPTFIMDatasetConfig can bypass
deferred finalization and seq-length validation because ConfigContainer.validate
only finalizes GPTDatasetConfig; update ConfigContainer.validate to explicitly
handle GPTFIMDatasetConfig by calling dataset.finalize() for instances of
(GPTDatasetConfig, GPTFIMDatasetConfig) and include GPTFIMDatasetConfig in the
seq-length check alongside GPTDatasetConfig and FinetuningDatasetConfig (i.e.,
ensure data_seq_length reads from self.dataset.seq_length for
GPTFIMDatasetConfig as well), or alternatively make GPTFIMDatasetConfig inherit
from GPTDatasetConfig so existing checks apply.

In `@tests/unit_tests/training/test_config.py`:
- Around line 313-352: The file defines two classes with the same name
TestMockGPTDatasetConfig causing the latter to overwrite the former and hide the
GPTFIM tests; rename the second TestMockGPTDatasetConfig class to a unique name
(e.g., TestMockGPTDatasetConfigFallback or TestMockGPTDatasetConfigAlternate) so
both classes are distinct, update any internal references if present, and run
the tests to ensure the GPTFIM tests (the class that instantiates
GPTFIMDatasetConfig and asserts its mixins) are collected.

coderabbitai · 2026-01-26T23:07:12Z

src/megatron/bridge/training/config.py

+@dataclass
+class GPTFIMDatasetConfig(MCoreGPTFIMDatasetConfig, DataloaderConfig):
+    """Megatron Core GPTFIMDatasetConfig with deferred post-init.
+
+    This class inherits from MCore's GPTFIMDatasetConfig and DataloaderConfig but defers the
+    execution of post_init() until finalize() is explicitly called. This allows
+    for field modifications after construction but before computed fields are calculated.
+    """
+
+    def __init__(
+        self,
+        seq_length: int | None = None,
+        skip_getting_attention_mask_from_dataset: bool = True,
+        *args,
+        **kwargs,
+    ):
+        """
+        Args:
+            seq_length (int | None): the sequence length. If not provided, `sequence_length` must be in kwargs.
+            skip_getting_attention_mask_from_dataset (bool): if set, the dataset will pass a None attention mask
+                and the attention mask is autogenerated from the attn backend.
+        """
+        self.fim_data = True
+        self.skip_getting_attention_mask_from_dataset = skip_getting_attention_mask_from_dataset
+
+        if seq_length is not None:
+            kwargs["sequence_length"] = seq_length
+        elif "sequence_length" not in kwargs:
+            raise ValueError("Either `seq_length` or `sequence_length` must be provided.")
+
+        dataloader_kwargs = {k: kwargs.pop(k) for k in list(kwargs) if k in DataloaderConfig.__dataclass_fields__}
+        MCoreGPTFIMDatasetConfig.__init__(self, *args, **kwargs)
+        DataloaderConfig.__init__(self, **dataloader_kwargs)
+


⚠️ Potential issue | 🟠 Major

GPTFIMDatasetConfig may skip deferred finalization/validation.

ConfigContainer.validate() only finalizes GPTDatasetConfig and only runs the seq-length check for (GPTDatasetConfig, FinetuningDatasetConfig). Because GPTFIMDatasetConfig doesn’t inherit GPTDatasetConfig, it can bypass deferred __post_init__ and seq-length validation. Consider explicitly including GPTFIMDatasetConfig in those checks (or make it extend GPTDatasetConfig if feasible).

# in ConfigContainer.validate() if isinstance(self.dataset, (GPTDatasetConfig, GPTFIMDatasetConfig)): self.dataset.finalize() # ... if isinstance(self.dataset, (GPTDatasetConfig, GPTFIMDatasetConfig, FinetuningDatasetConfig)): data_seq_length = self.dataset.seq_length

🧰 Tools

🪛 Ruff (0.14.14)

481-481: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/config.py` around lines 453 - 486, The dataset config GPTFIMDatasetConfig can bypass deferred finalization and seq-length validation because ConfigContainer.validate only finalizes GPTDatasetConfig; update ConfigContainer.validate to explicitly handle GPTFIMDatasetConfig by calling dataset.finalize() for instances of (GPTDatasetConfig, GPTFIMDatasetConfig) and include GPTFIMDatasetConfig in the seq-length check alongside GPTDatasetConfig and FinetuningDatasetConfig (i.e., ensure data_seq_length reads from self.dataset.seq_length for GPTFIMDatasetConfig as well), or alternatively make GPTFIMDatasetConfig inherit from GPTDatasetConfig so existing checks apply.

coderabbitai · 2026-01-26T23:07:12Z

tests/unit_tests/training/test_config.py

+class TestMockGPTDatasetConfig:
+    """Tests desired behavior for GPTFIMDatasetConfig."""
+
+    def test_initialization(self):
+        config = GPTFIMDatasetConfig(
+            random_seed=1234,
+            seq_length=512,
+            fim_rate=0.1,
+            fim_no_prefix="test",
+            fim_extra_tokens={"middle": "<middle>"},
+            fim_split_sample="test sample",
+            reset_position_ids=False,
+            reset_attention_mask=False,
+            eod_mask_loss=False,
+        )
+        config.finalize()
+
+        # Should be an instance of both GPTDatasetConfig and GPTFIMDatasetConfig
+        from megatron.core.datasets.blended_megatron_dataset_config import BlendedMegatronDatasetConfig
+        from megatron.training.datasets.fim_dataset import GPTFIMDatasetConfig as MCoreGPTFIMDatasetConfig
+
+        assert isinstance(config, GPTFIMDatasetConfig)
+        assert isinstance(config, GPTDatasetConfig)
+        assert isinstance(config, MCoreGPTFIMDatasetConfig)
+        assert isinstance(config, BlendedMegatronDatasetConfig)
+
+        # Should have all the expected fields from parent class
+        assert hasattr(config, "random_seed")
+        assert hasattr(config, "seq_length")
+        assert hasattr(config, "path_to_cache")
+
+        # Verify have all the expected fields were set proeprly
+        assert config.fim_data
+        assert config.fim_rate == 0.1
+        assert config.fim_no_prefix == "test"
+        assert config.fim_split_sample == "test sample"
+        assert config.fim_extra_tokens["middle"] == "<middle>"
+
+
 class TestMockGPTDatasetConfig:


⚠️ Potential issue | 🟡 Minor

Duplicate test class name hides GPTFIM tests.

There are two TestMockGPTDatasetConfig classes; the latter overwrites the former at import time, so the GPTFIM tests are not collected. Rename the new class to avoid masking.

✅ Suggested rename

-class TestMockGPTDatasetConfig: +class TestGPTFIMDatasetConfig:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class TestMockGPTDatasetConfig:

"""Tests desired behavior for GPTFIMDatasetConfig."""

def test_initialization(self):

config = GPTFIMDatasetConfig(

random_seed=1234,

seq_length=512,

fim_rate=0.1,

fim_no_prefix="test",

fim_extra_tokens={"middle": "<middle>"},

fim_split_sample="test sample",

reset_position_ids=False,

reset_attention_mask=False,

eod_mask_loss=False,

)

config.finalize()

# Should be an instance of both GPTDatasetConfig and GPTFIMDatasetConfig

from megatron.core.datasets.blended_megatron_dataset_config import BlendedMegatronDatasetConfig

from megatron.training.datasets.fim_dataset import GPTFIMDatasetConfig as MCoreGPTFIMDatasetConfig

assert isinstance(config, GPTFIMDatasetConfig)

assert isinstance(config, GPTDatasetConfig)

assert isinstance(config, MCoreGPTFIMDatasetConfig)

assert isinstance(config, BlendedMegatronDatasetConfig)

# Should have all the expected fields from parent class

assert hasattr(config, "random_seed")

assert hasattr(config, "seq_length")

assert hasattr(config, "path_to_cache")

# Verify have all the expected fields were set proeprly

assert config.fim_data

assert config.fim_rate == 0.1

assert config.fim_no_prefix == "test"

assert config.fim_split_sample == "test sample"

assert config.fim_extra_tokens["middle"] == "<middle>"

class TestMockGPTDatasetConfig:

class TestMockGPTDatasetConfig:

"""Tests desired behavior for GPTFIMDatasetConfig."""

def test_initialization(self):

config = GPTFIMDatasetConfig(

random_seed=1234,

seq_length=512,

fim_rate=0.1,

fim_no_prefix="test",

fim_extra_tokens={"middle": "<middle>"},

fim_split_sample="test sample",

reset_position_ids=False,

reset_attention_mask=False,

eod_mask_loss=False,

)

config.finalize()

# Should be an instance of both GPTDatasetConfig and GPTFIMDatasetConfig

from megatron.core.datasets.blended_megatron_dataset_config import BlendedMegatronDatasetConfig

from megatron.training.datasets.fim_dataset import GPTFIMDatasetConfig as MCoreGPTFIMDatasetConfig

assert isinstance(config, GPTFIMDatasetConfig)

assert isinstance(config, GPTDatasetConfig)

assert isinstance(config, MCoreGPTFIMDatasetConfig)

assert isinstance(config, BlendedMegatronDatasetConfig)

# Should have all the expected fields from parent class

assert hasattr(config, "random_seed")

assert hasattr(config, "seq_length")

assert hasattr(config, "path_to_cache")

# Verify have all the expected fields were set proeprly

assert config.fim_data

assert config.fim_rate == 0.1

assert config.fim_no_prefix == "test"

assert config.fim_split_sample == "test sample"

assert config.fim_extra_tokens["middle"] == "<middle>"

class TestGPTFIMDatasetConfig:

🧰 Tools

🪛 Ruff (0.14.14)

352-352: Redefinition of unused TestMockGPTDatasetConfig from line 313: TestMockGPTDatasetConfig redefined here

(F811)

🤖 Prompt for AI Agents

In `@tests/unit_tests/training/test_config.py` around lines 313 - 352, The file defines two classes with the same name TestMockGPTDatasetConfig causing the latter to overwrite the former and hide the GPTFIM tests; rename the second TestMockGPTDatasetConfig class to a unique name (e.g., TestMockGPTDatasetConfigFallback or TestMockGPTDatasetConfigAlternate) so both classes are distinct, update any internal references if present, and run the tests to ensure the GPTFIM tests (the class that instantiates GPTFIMDatasetConfig and asserts its mixins) are collected.

add FIM support

0998722

Signed-off-by: dimapihtar <[email protected]>

dimapihtar added 2 commits January 26, 2026 14:49

add FIM dataset unit test

ebb7325

Signed-off-by: dimapihtar <[email protected]>

fix docstring

c129e61

Signed-off-by: dimapihtar <[email protected]>

dimapihtar marked this pull request as ready for review January 26, 2026 22:53

copy-pr-bot bot temporarily deployed to nemo-ci January 26, 2026 22:54 Inactive

copy-pr-bot bot temporarily deployed to test January 26, 2026 22:54 Inactive

dimapihtar requested a review from yaoyu-33 January 26, 2026 22:54

copy-pr-bot bot temporarily deployed to nemo-ci January 26, 2026 22:59 Inactive

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

copy-pr-bot bot had a problem deploying to nemo-ci January 26, 2026 23:07 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add FIM dataset support #2066

add FIM dataset support #2066

dimapihtar commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Jan 26, 2026

Uh oh!

dimapihtar commented Jan 26, 2026

Uh oh!

coderabbitai bot commented Jan 26, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 26, 2026

Uh oh!

coderabbitai bot Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add FIM dataset support #2066

Are you sure you want to change the base?

add FIM dataset support #2066

Conversation

dimapihtar commented Jan 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 26, 2026

Uh oh!

dimapihtar commented Jan 26, 2026

Uh oh!

coderabbitai bot commented Jan 26, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimapihtar commented Jan 26, 2026 •

edited by coderabbitai bot

Loading