[data, docs] feat: Add fast dataloading configs & documentation#3351
[data, docs] feat: Add fast dataloading configs & documentation#3351asolergi-nv wants to merge 1 commit intomainfrom
Conversation
Signed-off-by: Antoni-Joan Solergibert <[email protected]>
📝 WalkthroughWalkthroughThese changes implement dataloader initialization performance optimizations for Megatron-Bridge. A new configuration field enables loading pre-computed sequence counts from JSON, initialization logic derives token dtype from tokenizer vocabulary size, and comprehensive documentation explains the optimization techniques and their constraints. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/training/setup.py`:
- Around line 198-204: The current branch that skips setting
cfg.dataset.token_dtype_code when tokenizer.vocab_size is None should fail fast:
if cfg.dataset.token_dtype_code is None and tokenizer.vocab_size is None, raise
a clear exception (e.g., ValueError) explaining that vocab_size is required to
compute token_dtype_code for sequence-count/precomputed-sequence-metadata
optimization; update the logic around cfg.dataset.token_dtype_code and
tokenizer.vocab_size to perform this check and raise the error rather than
silently continuing so callers of sequence-count optimization get an immediate,
actionable message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d284a0ea-3981-4d8b-ab32-d804c41e75e0
📒 Files selected for processing (3)
docs/performance-guide.mdsrc/megatron/bridge/training/config.pysrc/megatron/bridge/training/setup.py
| > > ```bash | ||
| > > python3 tools/build_sequences_per_dataset.py \ | ||
| > > --per-split-data-args-path my-dataset-blend.json \ | ||
| > > --per-dataset-sequences-path my-sequences-per-dataset.json |
There was a problem hiding this comment.
how do I get my-sequences-per-dataset.json, given a blend.json?
What does this PR do ?
Expose the three dataloader initialization acceleration features added in Megatron-LM PR #2445 to Megatron-Bridge users. These features can reduce dataset initialization time from minutes to seconds on multi-node clusters by eliminating filesystem checks, synchronization barriers, and redundant index file reads.
Changelog
src/megatron/bridge/training/config.pyper_dataset_sequences_pathfield toGPTDatasetConfig— a convenience path to a JSON file containing precomputed sequence and document counts per dataset. The JSON is loaded automatically duringfinalize()and stored assequences_per_dataseton the MCore config, following the same pattern as the existingdata_path→blendconversion.src/megatron/bridge/training/setup.pytoken_dtype_codecomputation after the tokenizer is set on the dataset config. Bridge intentionally skipsMCoreGPTDatasetConfig.__post_init__()(the tokenizer is unavailable atfinalize()time), which meanstoken_dtype_code— required by MCore's_IndexReaderwhensequences_per_datasetis provided — was never computed. The new code derives it fromtokenizer.vocab_sizeright after the tokenizer is assigned.docs/performance-guide.mdGitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Documentation
New Features