Skip to content

[data, docs] feat: Add fast dataloading configs & documentation#3351

Open
asolergi-nv wants to merge 1 commit intomainfrom
fast_dataloading
Open

[data, docs] feat: Add fast dataloading configs & documentation#3351
asolergi-nv wants to merge 1 commit intomainfrom
fast_dataloading

Conversation

@asolergi-nv
Copy link
Copy Markdown

@asolergi-nv asolergi-nv commented Apr 16, 2026

What does this PR do ?

Expose the three dataloader initialization acceleration features added in Megatron-LM PR #2445 to Megatron-Bridge users. These features can reduce dataset initialization time from minutes to seconds on multi-node clusters by eliminating filesystem checks, synchronization barriers, and redundant index file reads.

Changelog

  • src/megatron/bridge/training/config.py
    • Added per_dataset_sequences_path field to GPTDatasetConfig — a convenience path to a JSON file containing precomputed sequence and document counts per dataset. The JSON is loaded automatically during finalize() and stored as sequences_per_dataset on the MCore config, following the same pattern as the existing data_pathblend conversion.
  • src/megatron/bridge/training/setup.py
    • Added token_dtype_code computation after the tokenizer is set on the dataset config. Bridge intentionally skips MCoreGPTDatasetConfig.__post_init__() (the tokenizer is unavailable at finalize() time), which means token_dtype_code — required by MCore's _IndexReader when sequences_per_dataset is provided — was never computed. The new code derives it from tokenizer.vocab_size right after the tokenizer is assigned.
  • docs/performance-guide.md
    • Added a new Dataloader Initialization Performance section documenting all three features with their constraints

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • Documentation

    • Added dataloader initialization performance optimization guide covering caching, memory-mapping, and sequence count configuration.
  • New Features

    • Support for loading per-dataset sequence counts from JSON files.
    • Automatic token data type handling based on vocabulary size.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

These changes implement dataloader initialization performance optimizations for Megatron-Bridge. A new configuration field enables loading pre-computed sequence counts from JSON, initialization logic derives token dtype from tokenizer vocabulary size, and comprehensive documentation explains the optimization techniques and their constraints.

Changes

Cohort / File(s) Summary
Documentation
docs/performance-guide.md
Added "Dataloader Initialization Performance" section describing three optimization techniques: fast cache loading, deferred memory-mapping of index files, and precomputed sequence/document counts. Includes configuration examples and extends the tuning knobs reference index.
Dataset Configuration
src/megatron/bridge/training/config.py
Added json import and new per_dataset_sequences_path field to GPTDatasetConfig. Updated finalize() to load JSON file containing sequence counts when the field is set and sequences_per_dataset is uninitialized.
Training Setup
src/megatron/bridge/training/setup.py
Added initialization logic to derive token_dtype_code from tokenizer's vocab_size using NumPy uint16 capacity threshold, assigning 4 for larger vocabularies and 8 otherwise.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Results For Major Changes ❓ Inconclusive The PR appears to be a performance-related feature addition to config.py, but the provided context does not contain evidence of test results, test files, or testing documentation being included in the PR. Request the actual PR content, test files, or test results documentation to verify whether testing information was included with this performance feature addition.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding fast dataloading configurations and documentation to Megatron-Bridge.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fast_dataloading

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/training/setup.py`:
- Around line 198-204: The current branch that skips setting
cfg.dataset.token_dtype_code when tokenizer.vocab_size is None should fail fast:
if cfg.dataset.token_dtype_code is None and tokenizer.vocab_size is None, raise
a clear exception (e.g., ValueError) explaining that vocab_size is required to
compute token_dtype_code for sequence-count/precomputed-sequence-metadata
optimization; update the logic around cfg.dataset.token_dtype_code and
tokenizer.vocab_size to perform this check and raise the error rather than
silently continuing so callers of sequence-count optimization get an immediate,
actionable message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d284a0ea-3981-4d8b-ab32-d804c41e75e0

📥 Commits

Reviewing files that changed from the base of the PR and between b8b13d3 and eae3555.

📒 Files selected for processing (3)
  • docs/performance-guide.md
  • src/megatron/bridge/training/config.py
  • src/megatron/bridge/training/setup.py

Comment thread src/megatron/bridge/training/setup.py
@yaoyu-33 yaoyu-33 added feature New capabilities, enhancements, or enablement work area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels Apr 16, 2026
Comment thread docs/performance-guide.md
> > ```bash
> > python3 tools/build_sequences_per_dataset.py \
> > --per-split-data-args-path my-dataset-blend.json \
> > --per-dataset-sequences-path my-sequences-per-dataset.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do I get my-sequences-per-dataset.json, given a blend.json?

@yaoyu-33 yaoyu-33 added needs-follow-up Issue needs follow-up and removed needs-review PR is ready for code review and waiting on a reviewer labels Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset builders, preprocessing, and samplers feature New capabilities, enhancements, or enablement work needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants