Summary
Rule 3.1.2 recomputes the minimum required dataset size from total cluster host memory. It derives "total host memory" by reading only the first element of the per-host host_memory_GB array and multiplying it by num_hosts:
# training_checks.py
host_memory_gb = summary.get("host_memory_GB", [0])[0] # L164 -> index [0] only
...
total_host_memory = num_hosts * host_memory_gb # L180 -> assumes ALL hosts == host_memory_GB[0]
min_samples_memory = (total_host_memory * HOST_MEMORY_MULTIPLIER * 1024**3 / record_length) # L181
This assumes the host_memory_GB array is homogeneous (every host equals index [0]). In practice the DLIO-generated summary.json array is not a clean one-value-per-host list — it contains duplicated (2×) and zero entries whose sum equals the true total. When host_memory_GB[0] happens to be an inflated (doubled) entry, num_hosts * host_memory_GB[0] yields exactly double the real cluster memory, and the minimum-dataset-size requirement doubles.
The correct total is sum(host_memory_GB), not num_hosts * host_memory_GB[0].
Environment / evidence (real submission)
- Cluster: 15 client hosts, 30 accelerators (2 ranks/host),
reader.batch_size=7.
- Per-host RAM (verified independently via
/proc/meminfo on every host, and in the tool's own collector-staging/cluster_info.json): MemTotal = 197,223,052 kB = 188.09 GiB on all 15 hosts. True total = 15 × 188.09 ≈ 2,821 GiB.
record_length_bytes = 146,600,628, num_samples_per_file = 1, HOST_MEMORY_MULTIPLIER = 5.
- Dataset generated:
num_files_train = 105,000.
summary.json host_memory_GB (per run):
[376.18, 376.18, 376.18, 376.18, 376.18, 376.18, 376.18, 187.90, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Note: 376.18 ≈ 2 × 188.09 and sum(array) = 2,821 GiB = the correct cluster total, but array[0] = 376.18 is a doubled value.
The miscalculation
| Quantity |
Checker (buggy) |
Correct |
| Total host memory |
num_hosts × array[0] = 15 × 376.18 = 5,642.7 GiB |
sum(array) = 2,821 GiB |
min_samples_memory = mem × 5 × 1024³ / record_length |
206,641 files |
103,313 files |
| Verdict vs 105,000 generated |
FAIL ("actual files 105000 < minimum required 206641") |
PASS |
The cluster memory is inflated by exactly 2×, so the required dataset is inflated by exactly 2×. The submission is genuinely compliant (105,000 ≥ 103,313) but is falsely reported invalid.
Expected vs. actual behavior
- Expected: total host memory used by rule 3.1.2 equals the real aggregate cluster RAM (~2,821 GiB), giving a minimum of 103,313 files → PASS.
- Actual: total host memory is
num_hosts × host_memory_GB[0] (~5,643 GiB), giving 206,641 files → FAIL.
Root cause
training_checks.py (rule 3.1.2) treats host_memory_GB as a homogeneous per-host scalar and scales host_memory_GB[0] by num_hosts. The array it receives from DLIO's summary.json is a per-rank/reduced array where entries are not one-per-host: some hosts appear at 2× (rank-doubled), some at 0. Its sum is correct; index [0] × num_hosts is not.
There are effectively two defects; the checker-side one is decisive:
- (Primary, checker)
training_checks.py L164/L180: should aggregate the whole array (sum(host_memory_GB)), not num_hosts * host_memory_GB[0].
- (Secondary, DLIO summary)
host_memory_GB in summary.json is emitted as a non-uniform per-rank array (duplicated/zero entries) rather than one clean value per physical host. Even if the checker is fixed, this array is misleading to any consumer that indexes it positionally.
Proposed fix
In mlpstorage_py/submission_checker/checks/training_checks.py, rule 3.1.2, replace the index-[0]-times-num_hosts logic with a sum over the reported per-host values:
# Before
host_memory_gb = summary.get("host_memory_GB", [0])[0]
...
total_host_memory = num_hosts * host_memory_gb
# After
host_memory_list = summary.get("host_memory_GB", []) or []
total_host_memory = sum(host_memory_list) # aggregate real cluster RAM
# (optionally) guard: if not total_host_memory: fall back / warn
sum(host_memory_GB) yields the correct 2,821 GiB regardless of how DLIO distributes the values across array positions.
Additionally (secondary), DLIO's summary.json should emit host_memory_GB as one value per unique host (length == num_hosts, no zero padding, no rank doubling) so positional consumers are robust.
Workaround (submitter side)
None that is clean without altering the CLOSED code hash. Options: (a) generate a dataset sized to the inflated requirement (~206,641 files, roughly 2× storage — wasteful and only masks the bug), or (b) obtain a waiver/exception given the verified true memory and the corrected calculation above.
Reproduction
- Run MLPerf Storage v3 training / UNet3D (CLOSED,
file backend) on a multi-rank-per-host cluster (e.g., 2 accelerators/host) so DLIO's host_memory_GB array contains doubled/zero entries.
- Size the dataset to the correct memory-based minimum (
sum(host_memory_GB) × 5 × 1024³ / record_length).
- Run
mlpstorage validate <results-dir>.
- Rule 3.1.2 fails with
dataset size mismatch: actual files N < minimum required 2N, where 2N reflects num_hosts × host_memory_GB[0] rather than sum(host_memory_GB).
Suggested references for the tracker
- File:
mlpstorage_py/submission_checker/checks/training_checks.py, rule trainingRecalculateDatasetSize (decorator L128), lines L137 (HOST_MEMORY_MULTIPLIER = 5), L164 (host_memory_GB[0]), L180 (num_hosts * host_memory_gb), L181 (min_samples_memory).
- DLIO
summary.json field host_memory_GB population (per-rank vs per-host).
Summary
Rule 3.1.2 recomputes the minimum required dataset size from total cluster host memory. It derives "total host memory" by reading only the first element of the per-host
host_memory_GBarray and multiplying it bynum_hosts:This assumes the
host_memory_GBarray is homogeneous (every host equals index[0]). In practice the DLIO-generatedsummary.jsonarray is not a clean one-value-per-host list — it contains duplicated (2×) and zero entries whose sum equals the true total. Whenhost_memory_GB[0]happens to be an inflated (doubled) entry,num_hosts * host_memory_GB[0]yields exactly double the real cluster memory, and the minimum-dataset-size requirement doubles.The correct total is
sum(host_memory_GB), notnum_hosts * host_memory_GB[0].Environment / evidence (real submission)
reader.batch_size=7./proc/meminfoon every host, and in the tool's owncollector-staging/cluster_info.json):MemTotal = 197,223,052 kB = 188.09 GiBon all 15 hosts. True total = 15 × 188.09 ≈ 2,821 GiB.record_length_bytes = 146,600,628,num_samples_per_file = 1,HOST_MEMORY_MULTIPLIER = 5.num_files_train = 105,000.summary.jsonhost_memory_GB(per run):Note:
376.18 ≈ 2 × 188.09andsum(array) = 2,821 GiB= the correct cluster total, butarray[0] = 376.18is a doubled value.The miscalculation
num_hosts × array[0]= 15 × 376.18 = 5,642.7 GiBsum(array)= 2,821 GiBmin_samples_memory= mem × 5 × 1024³ / record_lengthThe cluster memory is inflated by exactly 2×, so the required dataset is inflated by exactly 2×. The submission is genuinely compliant (105,000 ≥ 103,313) but is falsely reported invalid.
Expected vs. actual behavior
num_hosts × host_memory_GB[0](~5,643 GiB), giving 206,641 files → FAIL.Root cause
training_checks.py(rule 3.1.2) treatshost_memory_GBas a homogeneous per-host scalar and scaleshost_memory_GB[0]bynum_hosts. The array it receives from DLIO'ssummary.jsonis a per-rank/reduced array where entries are not one-per-host: some hosts appear at 2× (rank-doubled), some at 0. Its sum is correct; index [0] × num_hosts is not.There are effectively two defects; the checker-side one is decisive:
training_checks.pyL164/L180: should aggregate the whole array (sum(host_memory_GB)), notnum_hosts * host_memory_GB[0].host_memory_GBinsummary.jsonis emitted as a non-uniform per-rank array (duplicated/zero entries) rather than one clean value per physical host. Even if the checker is fixed, this array is misleading to any consumer that indexes it positionally.Proposed fix
In
mlpstorage_py/submission_checker/checks/training_checks.py, rule 3.1.2, replace the index-[0]-times-num_hostslogic with a sum over the reported per-host values:sum(host_memory_GB)yields the correct 2,821 GiB regardless of how DLIO distributes the values across array positions.Additionally (secondary), DLIO's
summary.jsonshould emithost_memory_GBas one value per unique host (length ==num_hosts, no zero padding, no rank doubling) so positional consumers are robust.Workaround (submitter side)
None that is clean without altering the CLOSED code hash. Options: (a) generate a dataset sized to the inflated requirement (~206,641 files, roughly 2× storage — wasteful and only masks the bug), or (b) obtain a waiver/exception given the verified true memory and the corrected calculation above.
Reproduction
filebackend) on a multi-rank-per-host cluster (e.g., 2 accelerators/host) so DLIO'shost_memory_GBarray contains doubled/zero entries.sum(host_memory_GB) × 5 × 1024³ / record_length).mlpstorage validate <results-dir>.dataset size mismatch: actual files N < minimum required 2N, where2Nreflectsnum_hosts × host_memory_GB[0]rather thansum(host_memory_GB).Suggested references for the tracker
mlpstorage_py/submission_checker/checks/training_checks.py, ruletrainingRecalculateDatasetSize(decorator L128), lines L137 (HOST_MEMORY_MULTIPLIER = 5), L164 (host_memory_GB[0]), L180 (num_hosts * host_memory_gb), L181 (min_samples_memory).summary.jsonfieldhost_memory_GBpopulation (per-rank vs per-host).