Summary
In mlpstorage 3.0.26 the BaseBenchmark.metadata writer and the
submission_checker readers disagree about the top-level keys under which
the merged DLIO parameters and user-override parameters are stored in
*_metadata.json. The writer was renamed as part of the fix for #365
(combined_params → parameters, params_dict → override_parameters)
but several reader sites in submission_checker/checks/ still look up the
old names. As a result, every check that depends on these dicts silently
sees {} and either:
- emits a spurious VIOLATION ("dataset parameters not found in
metadata", "record length is 0", "datagen size 0.00GiB < required …"), or
- silently no-ops a CLOSED/OPEN allow-list check that should have fired,
depending on whether the check fails-open or fails-closed when the dict is
empty.
This blocks mlpstorage validate from passing on any locally-produced
training submission, even when the run itself is correct.
Environment
|
|
mlpstorage version |
3.0.26 |
| Install |
pip install mlpstorage into a venv |
| Affected commands |
mlpstorage validate, mlpstorage reports reportgen (partial — reportgen also uses parameters, so it works) |
Reproduction
- Run any training benchmark to completion (e.g.
mlpstorage training run --model unet3d ...) so that
closed/<orgname>/results/<system>/training/unet3d/run/<ts>/training_unet3d_metadata.json is
written.
- Confirm the JSON contains the new key names:
python -c 'import json,sys; m=json.load(open(sys.argv[1])); print(sorted(k for k in m if "param" in k))' \
closed/<orgname>/results/<system>/training/unet3d/run/<ts>/training_unet3d_metadata.json
# → ['override_parameters', 'parameters']
- Run
mlpstorage validate <results-dir>.
- Observe spurious violations:
[3.1.1] trainingVerifyDatasizeUsage: dataset parameters not found in metadata
[3.1.2] trainingRecalculateDatasetSize: record length is 0, cannot calculate dataset size
[3.2.1] trainingDatagenMinimumSize: datagen size 0.00GiB is less than required 0.00GiB
Root cause
The writer at
mlpstorage_py/benchmarks/base.py lines 408–420
deliberately stores the merged config under parameters and the user-overrides under
override_parameters (this rename is referenced in the inline comment that
calls out #365):
# Parameters - YAML defaults with CLI overrides folded in (fixes #365).
# combined_params alone omits overrides added after __init__ (e.g.
# checkpoint.num_checkpoints_*), causing split-phase runs to double-count.
if hasattr(self, 'combined_params'):
metadata['parameters'] = self._apply_dotted_overrides(
self.combined_params, getattr(self, 'params_dict', {}))
else:
metadata['parameters'] = {}
# Override parameters - user-specified overrides only
if hasattr(self, 'params_dict'):
metadata['override_parameters'] = self.params_dict
else:
metadata['override_parameters'] = {}
But the reader-side checks under mlpstorage_py/submission_checker/checks/
still call metadata.get("combined_params", {}) / metadata.get("params_dict", {}).
Because the lookups silently default to {}, the downstream code paths
either crash with a record_length == 0 early-exit or pass the wrong
allow-list check.
Affected reader sites
| File |
Line |
Current lookup |
Should be |
submission_checker/checks/training_checks.py |
227 |
metadata.get("combined_params", {}) |
metadata.get("parameters", {}) |
submission_checker/checks/training_checks.py |
238 |
metadata.get("combined_params", {}) |
metadata.get("parameters", {}) |
submission_checker/checks/training_checks.py |
579 |
metadata.get("params_dict", {}) |
metadata.get("override_parameters", {}) |
submission_checker/checks/training_checks.py |
642 |
metadata.get("params_dict", {}) |
metadata.get("override_parameters", {}) |
submission_checker/checks/checkpointing_checks.py |
160 |
metadata.get("combined_params", {}) |
metadata.get("parameters", {}) |
submission_checker/checks/checkpointing_checks.py |
223 |
metadata.get("params_dict", {}) |
metadata.get("override_parameters", {}) |
submission_checker/checks/checkpointing_checks.py |
504 |
metadata.get("params_dict", {}) |
metadata.get("override_parameters", {}) |
submission_checker/checks/vdb_checks.py |
880 |
metadata.get("params_dict", {}) |
metadata.get("override_parameters", {}) |
submission_checker/checks/vdb_checks.py |
941 |
metadata.get("params_dict", {}) |
metadata.get("override_parameters", {}) |
The tests under tests/test_training_check_tool_injected_params.py and
tests/test_vdb_checks.py construct fixture metadata using the old
key names (params_dict, combined_params), so the existing test suite
does not catch this — the tests test the checker against itself, not
against the actual writer output.
Expected behaviour
mlpstorage validate against a clean, locally-produced submission tree
should pass all "parameter shape" checks (3.1.1, 3.1.2, 3.2.1, 3.6.2,
3.6.3 in the training suite; the corresponding checkpointing/VDB rules)
without manual JSON post-processing.
Summary
In
mlpstorage3.0.26 theBaseBenchmark.metadatawriter and thesubmission_checkerreaders disagree about the top-level keys under whichthe merged DLIO parameters and user-override parameters are stored in
*_metadata.json. The writer was renamed as part of the fix for #365(
combined_params→parameters,params_dict→override_parameters)but several reader sites in
submission_checker/checks/still look up theold names. As a result, every check that depends on these dicts silently
sees
{}and either:metadata", "record length is 0", "datagen size 0.00GiB < required …"), or
depending on whether the check fails-open or fails-closed when the dict is
empty.
This blocks
mlpstorage validatefrom passing on any locally-producedtraining submission, even when the run itself is correct.
Environment
mlpstorageversionpip install mlpstorageinto a venvmlpstorage validate,mlpstorage reports reportgen(partial — reportgen also usesparameters, so it works)Reproduction
mlpstorage training run --model unet3d ...) so thatclosed/<orgname>/results/<system>/training/unet3d/run/<ts>/training_unet3d_metadata.jsoniswritten.
mlpstorage validate <results-dir>.[3.1.1] trainingVerifyDatasizeUsage: dataset parameters not found in metadata[3.1.2] trainingRecalculateDatasetSize: record length is 0, cannot calculate dataset size[3.2.1] trainingDatagenMinimumSize: datagen size 0.00GiB is less than required 0.00GiBRoot cause
The writer at
mlpstorage_py/benchmarks/base.pylines 408–420deliberately stores the merged config under
parametersand the user-overrides underoverride_parameters(this rename is referenced in the inline comment thatcalls out #365):
But the reader-side checks under
mlpstorage_py/submission_checker/checks/still call
metadata.get("combined_params", {})/metadata.get("params_dict", {}).Because the lookups silently default to
{}, the downstream code pathseither crash with a
record_length == 0early-exit or pass the wrongallow-list check.
Affected reader sites
submission_checker/checks/training_checks.pymetadata.get("combined_params", {})metadata.get("parameters", {})submission_checker/checks/training_checks.pymetadata.get("combined_params", {})metadata.get("parameters", {})submission_checker/checks/training_checks.pymetadata.get("params_dict", {})metadata.get("override_parameters", {})submission_checker/checks/training_checks.pymetadata.get("params_dict", {})metadata.get("override_parameters", {})submission_checker/checks/checkpointing_checks.pymetadata.get("combined_params", {})metadata.get("parameters", {})submission_checker/checks/checkpointing_checks.pymetadata.get("params_dict", {})metadata.get("override_parameters", {})submission_checker/checks/checkpointing_checks.pymetadata.get("params_dict", {})metadata.get("override_parameters", {})submission_checker/checks/vdb_checks.pymetadata.get("params_dict", {})metadata.get("override_parameters", {})submission_checker/checks/vdb_checks.pymetadata.get("params_dict", {})metadata.get("override_parameters", {})The tests under
tests/test_training_check_tool_injected_params.pyandtests/test_vdb_checks.pyconstruct fixture metadata using the oldkey names (
params_dict,combined_params), so the existing test suitedoes not catch this — the tests test the checker against itself, not
against the actual writer output.
Expected behaviour
mlpstorage validateagainst a clean, locally-produced submission treeshould pass all "parameter shape" checks (3.1.1, 3.1.2, 3.2.1, 3.6.2,
3.6.3 in the training suite; the corresponding checkpointing/VDB rules)
without manual JSON post-processing.