Skip to content

mlpstorage multi-run score aggregation #645

Description

@FileSystemGuy

@wolfgang-desalvador asked how the scores are aggregated across and within invocations of DLIO for training, here is what Claude tells me:

❯ Excellent research, thank you. So, the per-DLIO-invocation aggregation is the mean() of the scores from the 5
epochs within that invocation. The mlpstorage tool doesn't currently do any aggregation of the scores returned
by each of the 5 (I'm excluding the warmup run) invocations of DLIO, but it needs to. Is that an accurate
summary?

● Yes, that's an accurate summary — with one small precision worth calling out on the per-invocation side.

Per-DLIO-invocation aggregation (single summary.json) — accurate:

  • The headline train_au_mean_percentage is np.mean(train_au) across the 5 per-epoch AUs.
  • The headline B/W (train_throughput_mean_samples_per_second, train_io_mean_MB_per_second) is
    np.mean(train_throughput) across the 5 per-epoch throughputs (converted to MiB/s via record_size).
  • Small precision: those per-epoch AU/throughput values that DLIO takes the mean of are themselves step-weighted
    means of the block-level values within the epoch (statscounter.py:end_train, au = np.sum(au*steps)/np.sum(steps)).
    So the invocation-level score is a mean of step-weighted-mean-per-epoch values — you can accurately call it "the
    mean of the scores from the 5 epochs" for user-facing purposes; just noting the epoch-level number isn't a naive
    average of blocks.

Cross-invocation aggregation (across the 5 measured invocations) — accurate:

  • Neither DLIO nor mlpstorage currently reduces the five per-invocation values into a single per-submission AU% or
    B/W.
  • DLIO never sees more than one invocation, so this is properly mlpstorage's responsibility.
  • mlpstorage has two stub sites where it belongs:
    • mlpstorage_py/report_generator.py:423 — workload-group Result built with metrics={} # TODO: Add function to
      aggregate metrics
    • mlpstorage_py/submission_checker/results.py:56–69 — ResultExporter.add_result() emits empty strings for the
      3D-Unet - Read B/W (GiB/s) column and its peers
  • And the aggregation method itself is undefined in Rules.md — §2.1.17 only says the warmup "will not be included
    in the reported performance," but doesn't specify whether the remaining 5 are combined by mean, min, geometric
    mean, or something else.

So "mlpstorage needs to aggregate the 5 measured invocations" is correct, and there are actually two open
questions: (1) implement the code, and (2) decide/document in Rules.md which reduction to use — the current
codebase, the current spec, and the current advisor folklore ("it's the min") all disagree, so a written rule is a
prerequisite to the implementation.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions