[data, docs] feat: Add fast dataloading configs & documentation by asolergi-nv · Pull Request #3351 · NVIDIA-NeMo/Megatron-Bridge

asolergi-nv · 2026-04-16T09:11:02Z

What does this PR do ?

Expose the three dataloader initialization acceleration features added in Megatron-LM PR #2445 to Megatron-Bridge users. These features can reduce dataset initialization time from minutes to seconds on multi-node clusters by eliminating filesystem checks, synchronization barriers, and redundant index file reads.

Changelog

src/megatron/bridge/training/config.py
- Added per_dataset_sequences_path field to GPTDatasetConfig — a convenience path to a JSON file containing precomputed sequence and document counts per dataset. The JSON is loaded automatically during finalize() and stored as sequences_per_dataset on the MCore config, following the same pattern as the existing data_path → blend conversion.
src/megatron/bridge/training/setup.py
- Added token_dtype_code computation after the tokenizer is set on the dataset config. Bridge intentionally skips MCoreGPTDatasetConfig.__post_init__() (the tokenizer is unavailable at finalize() time), which means token_dtype_code — required by MCore's _IndexReader when sequences_per_dataset is provided — was never computed. The new code derives it from tokenizer.vocab_size right after the tokenizer is assigned.
docs/performance-guide.md
- Added a new Dataloader Initialization Performance section documenting all three features with their constraints

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Documentation
- Added dataloader initialization performance optimization guide covering caching, memory-mapping, and sequence count configuration.
New Features
- Support for loading per-dataset sequence counts from JSON files.
- Automatic token data type handling based on vocabulary size.

Signed-off-by: Antoni-Joan Solergibert <[email protected]>

copy-pr-bot · 2026-04-16T09:11:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-16T09:16:11Z

📝 Walkthrough

Walkthrough

These changes implement dataloader initialization performance optimizations for Megatron-Bridge. A new configuration field enables loading pre-computed sequence counts from JSON, initialization logic derives token dtype from tokenizer vocabulary size, and comprehensive documentation explains the optimization techniques and their constraints.

Changes

Cohort / File(s)	Summary
Documentation `docs/performance-guide.md`	Added "Dataloader Initialization Performance" section describing three optimization techniques: fast cache loading, deferred memory-mapping of index files, and precomputed sequence/document counts. Includes configuration examples and extends the tuning knobs reference index.
Dataset Configuration `src/megatron/bridge/training/config.py`	Added `json` import and new `per_dataset_sequences_path` field to `GPTDatasetConfig`. Updated `finalize()` to load JSON file containing sequence counts when the field is set and `sequences_per_dataset` is uninitialized.
Training Setup `src/megatron/bridge/training/setup.py`	Added initialization logic to derive `token_dtype_code` from tokenizer's `vocab_size` using NumPy uint16 capacity threshold, assigning `4` for larger vocabularies and `8` otherwise.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	❓ Inconclusive	The PR appears to be a performance-related feature addition to config.py, but the provided context does not contain evidence of test results, test files, or testing documentation being included in the PR.	Request the actual PR content, test files, or test results documentation to verify whether testing information was included with this performance feature addition.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding fast dataloading configurations and documentation to Megatron-Bridge.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fast_dataloading

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/training/setup.py`:
- Around line 198-204: The current branch that skips setting
cfg.dataset.token_dtype_code when tokenizer.vocab_size is None should fail fast:
if cfg.dataset.token_dtype_code is None and tokenizer.vocab_size is None, raise
a clear exception (e.g., ValueError) explaining that vocab_size is required to
compute token_dtype_code for sequence-count/precomputed-sequence-metadata
optimization; update the logic around cfg.dataset.token_dtype_code and
tokenizer.vocab_size to perform this check and raise the error rather than
silently continuing so callers of sequence-count optimization get an immediate,
actionable message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d284a0ea-3981-4d8b-ab32-d804c41e75e0

📥 Commits

Reviewing files that changed from the base of the PR and between b8b13d3 and eae3555.

📒 Files selected for processing (3)

docs/performance-guide.md
src/megatron/bridge/training/config.py
src/megatron/bridge/training/setup.py

yaoyu-33 · 2026-04-16T18:44:54Z

+   >    > ```bash
+   >    > python3 tools/build_sequences_per_dataset.py \
+   >    >   --per-split-data-args-path my-dataset-blend.json \
+   >    >   --per-dataset-sequences-path my-sequences-per-dataset.json


how do I get my-sequences-per-dataset.json, given a blend.json?

[data, docs] feat: Add fast dataloading configs & documentation

eae3555

Signed-off-by: Antoni-Joan Solergibert <[email protected]>

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Comment thread src/megatron/bridge/training/setup.py

yaoyu-33 added feature New capabilities, enhancements, or enablement work area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels Apr 16, 2026

yaoyu-33 reviewed Apr 16, 2026

View reviewed changes

yaoyu-33 added needs-follow-up Issue needs follow-up and removed needs-review PR is ready for code review and waiting on a reviewer labels Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data, docs] feat: Add fast dataloading configs & documentation#3351

[data, docs] feat: Add fast dataloading configs & documentation#3351
asolergi-nv wants to merge 1 commit intomainfrom
fast_dataloading

asolergi-nv commented Apr 16, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Apr 16, 2026

Uh oh!

coderabbitai bot commented Apr 16, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

yaoyu-33 Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

asolergi-nv commented Apr 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 16, 2026

Uh oh!

coderabbitai bot commented Apr 16, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yaoyu-33 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asolergi-nv commented Apr 16, 2026 •

edited by coderabbitai bot

Loading