Add FOM#71
Open
michaelmckinsey1 wants to merge 14 commits into
Open
Conversation
Merged
Collaborator
Author
|
@ndryden @PatrickRMiles Per our discussion today, I changed the FOM definition, and still kept the minibatch timer (via CUDA events). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Define the benchmark FOM as inverse total training time:
where total_train_time is the sum of epoch durations from train_stats.csv, as computed in ScaFFold/worker.py.
Output looks like:
This makes the FOM specific to the benchmark run configuration, including:
The FOM log explicitly prints those values alongside the score.
This PR also keeps the minibatch timing work from #66/#70. Minibatch timing is measured with CUDA/ROCM events and aggregated as:
where rank_minibatch_time_s is the CUDA/ROCM event GPU elapsed time on one rank for one full training minibatch, from the batch’s device-transfer/training work through the optimizer update and gradient reset.
Each epoch also prints its minibatch timing:
This minibatch timing is reported separately as minibatch_time_s; it is no longer used to define the FOM.