Add FOM by michaelmckinsey1 · Pull Request #71 · LBANN/ScaFFold

michaelmckinsey1 · 2026-05-27T00:46:01Z

Define the benchmark FOM as inverse total training time:

  FOM = 1 / total_train_time

where total_train_time is the sum of epoch durations from train_stats.csv, as computed in ScaFFold/worker.py.

Output looks like:

FOM = 0.007299 (1 / total_train_time=137.002749 seconds). This FOM is specific to problem_scale=6, target_dice=0.95, seed=42.

This makes the FOM specific to the benchmark run configuration, including:

  - problem_scale
  - target_dice
  - seed

The FOM log explicitly prints those values alongside the score.

This PR also keeps the minibatch timing work from #66/#70. Minibatch timing is measured with CUDA/ROCM events and aggregated as:

  median over epochs(
    median over full batches in epoch(
      max over ranks(rank_minibatch_time_s)
    )
  )

where rank_minibatch_time_s is the CUDA/ROCM event GPU elapsed time on one rank for one full training minibatch, from the batch’s device-transfer/training work through the optimizer update and gradient reset.

Each epoch also prints its minibatch timing:

  median over full batches in epoch(
    max over ranks(rank_minibatch_time_s)
  )

This minibatch timing is reported separately as minibatch_time_s; it is no longer used to define the FOM.

depends on Enable timing minibatch #66
depends on Add metadata #70

michaelmckinsey1 · 2026-05-28T22:08:25Z

@ndryden @PatrickRMiles Per our discussion today, I changed the FOM definition, and still kept the minibatch timer (via CUDA events).

michaelmckinsey1 and others added 9 commits May 6, 2026 12:51

fix dtypes for torch

e583d85

Add per minibatch timer

3dfbd13

Merge remote-tracking branch 'origin/fix-dtypes' into per-minibatch

47a4812

cleanup

c9ef075

Merge remote-tracking branch 'origin/main' into per-minibatch

ed246f5

Add adiak metadata

c0ba9c9

Update worker.py

1f7af32

Merge remote-tracking branch 'origin/per-minibatch' into adiak

1fee7d3

Define FOM

fa113f1

michaelmckinsey1 self-assigned this May 27, 2026

rm gbs, divide by epochs

1e9df19

michaelmckinsey1 changed the title ~~[WIP] FOM~~ Add FOM May 27, 2026

michaelmckinsey1 marked this pull request as ready for review May 27, 2026 23:59

michaelmckinsey1 mentioned this pull request May 28, 2026

Enable timing minibatch #66

Merged

michaelmckinsey1 added 4 commits May 28, 2026 11:55

Merge remote-tracking branch 'origin/main' into FOM

ef23c72

lint

cc7961c

Redefine FOM

d0a61f4

Cleanup

e1a7def

michaelmckinsey1 requested review from PatrickRMiles and ndryden May 28, 2026 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FOM#71

Add FOM#71
michaelmckinsey1 wants to merge 14 commits into
LBANN:mainfrom
michaelmckinsey1:FOM

michaelmckinsey1 commented May 27, 2026 •

edited

Loading

Uh oh!

michaelmckinsey1 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelmckinsey1 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelmckinsey1 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelmckinsey1 commented May 27, 2026 •

edited

Loading