BenchMark

Pipeline-aware resource monitoring for ancient DNA and metagenomic classification tools.

BenchMark attaches to a running screen terminal session and tracks CPU time, memory, disk I/O, and wall-clock time across every step of a pipeline — pausing the timer automatically when the session is idle between commands. Output is a per-step CSV that can be merged across multiple tool runs and rendered into publication-quality comparison plots.

Developed at the Hakai Institute by Tyler Murchie.

Quick start

# Install
cd ~/software/BenchMark && bash install.sh && source ~/.bashrc

# 1. Start your screen session
screen -S fillet_run

# 2. Start monitoring (runs in background immediately)
BenchMark go \
  --screen fillet_run \
  --tool   Fillet \
  --dataset calculus_sim_100k \
  --output ~/results/benchmark/

# 3. Label each step before running it in screen (optional but recommended)
BenchMark mark "adapter_trim"
# → run trimming in screen ...
BenchMark mark "alignment"
# → run alignment in screen ...
BenchMark mark "classification"
# → run classification in screen ...

# 4. Stop and write CSV
BenchMark stop

# 5. Combine runs from multiple tools
BenchMark merge fillet.csv kraken2.csv megan7.csv holi.csv \
  -o comparison.csv --pipeline-version v1.0

# 6. Generate publication plots
BenchMark analyse comparison.csv \
  --output-dir ./plots \
  --title "Ancient DNA Classifier Benchmark"

Installation

git clone https://github.com/tjmurchie/BenchMark ~/software/BenchMark
cd ~/software/BenchMark && bash install.sh && source ~/.bashrc

install.sh will:

Check Python ≥ 3.8
Install psutil via pip if not present
Make BenchMark executable and add it to PATH
Check R and install required packages (ggplot2, dplyr, tidyr, scales, gridExtra, RColorBrewer)
Create ~/.benchmark/ state directory

Requirements:

Python ≥ 3.8
psutil ≥ 5.9 (installed automatically)
screen (GNU Screen)
R ≥ 4.0 with ggplot2 (for BenchMark analyse)
/usr/bin/time (GNU time; for accuracy validation only)

Commands

`BenchMark go` — start monitoring

BenchMark go --tool NAME --dataset NAME [options]

Required:
  --tool, -t NAME       Tool or pipeline name (label only, e.g. "Kraken2")
  --dataset, -d NAME    Dataset name (e.g. "calculus_sim_100k")

Optional:
  --screen, -s NAME     Screen session to monitor
                        (auto-detected if only one session is active)
  --output, -o DIR      Directory for CSV output (default: current directory)
  --notes, -n TEXT      Free-text notes — database version, parameters, etc.

Starts a background daemon that monitors the named screen session and returns immediately. The daemon waits up to 5 minutes for the screen session to appear if it hasn't started yet.

Step detection is automatic. When a command starts running in screen, a new step begins. When it finishes and the shell returns to the prompt (idle for 3 s), the step ends and its metrics are recorded. No workflow changes are required.

Example:

BenchMark go --screen kr2 --tool Kraken2 --dataset dental_calculus \
             --notes "k=35, db=2024-01" --output ~/bench_results/

`BenchMark mark` — label the next step

BenchMark mark [LABEL] [--screen NAME]

Queues a label for the next auto-detected step. Call this before running the step in screen. Also forces a step boundary if a step is currently active.

BenchMark mark "database_build"
# then run the database-building command in screen
BenchMark mark "read_classification"
# then run classification in screen

If your pipeline is a single invocation (e.g. python3 fillet.py), you can also have Fillet call BenchMark mark internally:

import subprocess
subprocess.run(["BenchMark", "mark", "adapter_trim"], check=False)

The check=False means it silently does nothing if BenchMark is not running.

`BenchMark rename` — rename a completed step

BenchMark rename STEP_NUM NEW_NAME [--screen NAME]

Retroactively rename a step by number (visible in BenchMark status).

BenchMark rename 1 "adapter_trim"
BenchMark rename 2 "bowtie2_alignment"

`BenchMark stop` — stop monitoring and write CSV

BenchMark stop [--screen NAME] [--output DIR]

Sends stop signal to the daemon, waits for it to finalise, prints a summary table, and writes the CSV to the output directory.

`BenchMark status` — show active sessions

BenchMark status

Shows all currently monitored sessions with tool name, dataset, step count, and daemon health.

`BenchMark run` — wrap a single command

For simple single-step tools that don't need screen session monitoring:

BenchMark run --tool NAME --dataset NAME [--output DIR] [--notes TEXT] -- COMMAND

BenchMark run --tool Kraken2 --dataset test_100 -- \
  kraken2 --db /databases/k2_standard --output out.tsv reads.fastq.gz

`BenchMark merge` — combine multiple run CSVs

BenchMark merge FILE [FILE ...] -o OUTPUT [options]

Options:
  -o, --output FILE           Output path (default: benchmark_merged.csv)
  --pipeline-version VERSION  Version tag added to all rows
  --run-label LABEL           Run label tag added to all rows

Merges per-session CSVs from multiple tool runs into one file for comparison. Automatically adds source_file, cpu_efficiency, and total_io_mb columns.

BenchMark merge \
  results/fillet_v1_calculus.csv \
  results/kraken2_calculus.csv \
  results/megan7_calculus.csv \
  results/holi_calculus.csv \
  -o results/comparison_v1.csv \
  --pipeline-version "v1.0" \
  --run-label "dental_calculus_dataset"

`BenchMark analyse` — generate R comparison plots

BenchMark analyse INPUT_CSV [--output-dir DIR] [--title TITLE]

Runs the R analysis script and generates:

Output	Description
`benchmark_comparison.pdf`	All plots in one PDF (title page + 8 figures)
`01_wall_time.png`	Active wall-clock time per tool
`02_cpu_time.png`	Total CPU time per tool
`03_peak_memory.png`	Peak RSS memory per tool
`04_cpu_efficiency.png`	CPU efficiency (parallelism ratio)
`05_resource_heatmap.png`	Normalised resource comparison heatmap
`06_speed_vs_memory.png`	Speed vs. memory scatter
`07_step_breakdown.png`	Wall time stacked by step (if steps are labelled)
`08_io_footprint.png`	Disk read/write per tool
`benchmark_summary_table.csv`	Clean summary table for supplementary materials

All plots use ggplot2 at 300 DPI with clean, publication-appropriate styling.

BenchMark analyse comparison.csv \
  --output-dir ./plots \
  --title "Ancient DNA Classifier Benchmark — Dental Calculus Dataset"

What is measured

Each CSV row corresponds to one pipeline step (or a TOTAL summary row where is_summary = True).

Column	Description
`wall_time_s`	Active wall-clock seconds (idle time excluded)
`cpu_user_s`	User-mode CPU seconds, summed across all threads
`cpu_system_s`	Kernel-mode CPU seconds, summed across all threads
`cpu_total_s`	Total CPU time (user + system)
`cpu_efficiency`	`cpu_total / wall_time` — values >1 indicate multi-threading
`peak_mem_mb`	Maximum RSS memory observed during the step
`avg_mem_mb`	Mean RSS memory during the active period
`max_threads`	Maximum concurrent thread count
`peak_processes`	Maximum concurrent process count
`disk_read_mb`	Total data read from disk
`disk_write_mb`	Total data written to disk
`total_io_mb`	`disk_read + disk_write` (added on merge)

System metadata recorded per session: hostname, OS, CPU model, logical/physical core count, total RAM. This ensures measurements from different machines are clearly identified in merged datasets.

How idle detection works

BenchMark pauses the step timer when the screen session is idle:

Only shell processes (bash, zsh, etc.) remain in the session's process tree
AND aggregate CPU usage is below 0.5% for ≥3 seconds

The next command run in screen starts a new step automatically. This means time spent reading output, reviewing intermediate results, or preparing the next command is excluded from all measurements.

The 3-second debounce prevents brief pauses within a tool (e.g., between index-building and classification phases) from being split into separate steps unless the tool truly returns to the shell prompt.

Per-step breakdown vs. total comparison

Automatic step splitting works when each pipeline stage is run as a separate command in the screen session (e.g., a shell script where each step is run individually, or a workflow where the user manually starts each stage). Each command = one step row in the CSV.

Manual step marking via BenchMark mark is required for pipelines that run as a single invocation (e.g., python3 fillet.py) to get per-stage breakdowns. Without marks, a single-invocation pipeline will appear as one step.

Either way, the TOTAL summary row always gives correct end-to-end metrics for cross-tool comparison, regardless of how many internal steps were detected.

Session state

BenchMark stores session state in ~/.benchmark/sessions/<session>_<timestamp>/:

state.json       — live session state (flushed every 15 s)
daemon.pid       — daemon process ID
daemon.log       — daemon log
pending_mark.json — queued step label (transient)

Accuracy validation

BenchMark was validated against GNU time (/usr/bin/time -v) across five controlled workloads representing bioinformatics resource profiles (N=3 replicates each). For workloads with durations representative of production pipeline steps (>2 s), mean absolute deviations were:

Wall-clock time: 2.7% (absolute overhead ~0.06 s)
CPU time: 2.4%
Peak RSS memory: 1.0%

For a 30-minute pipeline step, the absolute wall-time overhead of ~0.06 s represents a 0.003% measurement error. Full validation methodology and results are in docs/supplementary_validation.md.

To reproduce the validation:

python3 tests/validation_study.py --reps 3 --output-csv validation_results.csv

Adding TP/FP/FN accuracy data

Once your classifier accuracy scripts produce precision/recall/F1 scores, add them as additional columns to the merged CSV. The R analysis script automatically includes any extra numeric columns it finds in the heatmap and summary table. No changes to BenchMark are required.

Running tests

# Unit tests (26 tests, <1 s)
python3 -m pytest tests/test_core.py -v

# Integration test (real subprocesses, merge, CSV round-trip)
python3 tests/simulate_workflow.py

# Accuracy validation (requires GNU time; ~15 min for N=3)
python3 tests/validation_study.py --reps 3

Planned additions

TP/FP/TN/FN integration (classification accuracy alongside resource metrics)
Multi-replicate averaging with error bars in R plots
Optional GPU monitoring via nvidia-smi
nf-core/Snakemake workflow integration

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
R		R
benchmark		benchmark
docs		docs
tests		tests
.gitignore		.gitignore
BenchMark		BenchMark
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchMark

Quick start

Installation

Commands

`BenchMark go` — start monitoring

`BenchMark mark` — label the next step

`BenchMark rename` — rename a completed step

`BenchMark stop` — stop monitoring and write CSV

`BenchMark status` — show active sessions

`BenchMark run` — wrap a single command

`BenchMark merge` — combine multiple run CSVs

`BenchMark analyse` — generate R comparison plots

What is measured

How idle detection works

Per-step breakdown vs. total comparison

Session state

Accuracy validation

Adding TP/FP/FN accuracy data

Running tests

Planned additions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BenchMark

Quick start

Installation

Commands

BenchMark go — start monitoring

BenchMark mark — label the next step

BenchMark rename — rename a completed step

BenchMark stop — stop monitoring and write CSV

BenchMark status — show active sessions

BenchMark run — wrap a single command

BenchMark merge — combine multiple run CSVs

BenchMark analyse — generate R comparison plots

What is measured

How idle detection works

Per-step breakdown vs. total comparison

Session state

Accuracy validation

Adding TP/FP/FN accuracy data

Running tests

Planned additions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`BenchMark go` — start monitoring

`BenchMark mark` — label the next step

`BenchMark rename` — rename a completed step

`BenchMark stop` — stop monitoring and write CSV

`BenchMark status` — show active sessions

`BenchMark run` — wrap a single command

`BenchMark merge` — combine multiple run CSVs

`BenchMark analyse` — generate R comparison plots

Packages