Skip to content

Add FOM#71

Open
michaelmckinsey1 wants to merge 14 commits into
LBANN:mainfrom
michaelmckinsey1:FOM
Open

Add FOM#71
michaelmckinsey1 wants to merge 14 commits into
LBANN:mainfrom
michaelmckinsey1:FOM

Conversation

@michaelmckinsey1
Copy link
Copy Markdown
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented May 27, 2026

Define the benchmark FOM as inverse total training time:

  FOM = 1 / total_train_time

where total_train_time is the sum of epoch durations from train_stats.csv, as computed in ScaFFold/worker.py.

Output looks like:

FOM = 0.007299 (1 / total_train_time=137.002749 seconds). This FOM is specific to problem_scale=6, target_dice=0.95, seed=42.

This makes the FOM specific to the benchmark run configuration, including:

  - problem_scale
  - target_dice
  - seed

The FOM log explicitly prints those values alongside the score.

This PR also keeps the minibatch timing work from #66/#70. Minibatch timing is measured with CUDA/ROCM events and aggregated as:

  median over epochs(
    median over full batches in epoch(
      max over ranks(rank_minibatch_time_s)
    )
  )

where rank_minibatch_time_s is the CUDA/ROCM event GPU elapsed time on one rank for one full training minibatch, from the batch’s device-transfer/training work through the optimizer update and gradient reset.

Each epoch also prints its minibatch timing:

  median over full batches in epoch(
    max over ranks(rank_minibatch_time_s)
  )

This minibatch timing is reported separately as minibatch_time_s; it is no longer used to define the FOM.

@michaelmckinsey1 michaelmckinsey1 self-assigned this May 27, 2026
@michaelmckinsey1 michaelmckinsey1 changed the title [WIP] FOM Add FOM May 27, 2026
@michaelmckinsey1 michaelmckinsey1 marked this pull request as ready for review May 27, 2026 23:59
@michaelmckinsey1
Copy link
Copy Markdown
Collaborator Author

@ndryden @PatrickRMiles Per our discussion today, I changed the FOM definition, and still kept the minibatch timer (via CUDA events).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant