Skip to content

AI-Ocean/gpu-usage-audit

Repository files navigation

gpu-usage-audit

Single-host NVIDIA GPU usage audit for finding idle-held GPUs: cards that look idle by utilization, but are still held by a process through GPU memory.

PyPI Python 3.12+ License GitHub Release

English · 한국어 · Releases · Issues


About

gpu-usage-audit records local NVIDIA/NVML telemetry into SQLite and renders a retrospective report that separates GPU card-ticks into:

  • active: utilization is doing real work
  • idle-held: utilization is low, but a process still holds GPU memory
  • truly-idle: no meaningful GPU process memory is present

The second category is the point. A notebook can sit at 1% SM utilization while keeping an 8 GB tensor allocated. Conventional dashboards usually flatten that into “idle”; this tool shows that the card is effectively unavailable.

Features

  • Single-host, bare-metal NVIDIA GPU audit
  • gua doctor readiness check for /dev/nvidia*, nvidia-smi, NVML, and DB path
  • Background collector with gua daemon, gua status, and gua stop
  • SQLite history database at ~/.gua/gua.db by default
  • Report sections for headline split, idle capacity, per-GPU state, top identities, and time-of-day heatmap
  • Daemon interval metadata stored per run, so reports compute GPU-hours correctly across mixed 30s / 10s runs
  • GPU-less gua demo command with deterministic fake telemetry
  • No cluster runtime dependency; no Kubernetes, Slurm, Docker, or remote-node scan in the 1.0 scope

Installation

The recommended install path is PyPI via uv:

uv tool install gpu-usage-audit

Update or remove it with:

uv tool upgrade gpu-usage-audit
uv tool uninstall gpu-usage-audit

Manual wheel downloads are available from GitHub Releases:

BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.3"
WHEEL="gpu_usage_audit-1.0.3-py3-none-any.whl"

curl -fsSLO "$BASE/$WHEEL"
curl -fsSLO "$BASE/SHA256SUMS"
sha256sum -c SHA256SUMS --ignore-missing

uvx --from "./$WHEEL" gua doctor

Quick Start

On an NVIDIA GPU host:

gua doctor
gua daemon --interval 30s
gua status
gua report --since 1h
gua stop

gua doctor is read-only. It does not need sudo; run it as the same user that will run the daemon.

Default local state lives under ~/.gua/:

Path Purpose
~/.gua/gua.db SQLite history database
~/.gua/gua.pid background daemon PID file
~/.gua/gua.log daemon stdout/stderr log

The default DB is an appendable local history database. Later daemon runs append to it. If you pass a custom --db PATH, daemon still refuses an existing file to avoid mixing ad hoc runs by accident.

Report Preview

$ gua report --since 1h
gua — lab-a100 (bare, driver 560.35.05)  Window: 1:00:00

§1 Headline
  basis: one sample = one GPU card at one daemon tick
  rules: active >=10% util; idle-held <10% util with >100 MB process memory
  active       █   15.7%
  idle-held    ▒   45.1%
  truly-idle   ░   39.2%
  (51 samples)

§2 Idle capacity
  converted from card-ticks to GPU-hours using recorded daemon interval
  idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
  truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free

§3 Per-GPU
§4 Top identities
§5 Time-of-day heatmap (UTC)

Reports can run while the daemon is writing; SQLite WAL mode handles concurrent reads. Reports also work after the daemon has stopped, as long as the DB file exists.

Commands

Command Description
gua doctor Check local NVIDIA/NVML readiness and DB path status
gua daemon Start background collection on the local NVIDIA host
gua start Alias for gua daemon
gua status Show whether the managed background collector is running
gua stop Stop the managed background collector
gua report Render the retrospective report from SQLite
gua demo Generate a fake local report without a GPU
gua version Print version

Important Options

gua daemon [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
gua daemon --foreground [--db PATH] [--interval D]
gua report [--db PATH] [--since D] [--interval D] [--width N]
gua demo [--db PATH] [--ticks N] [--interval D]
  • --interval on daemon controls sampling cadence. Default: 30s.
  • --interval on report is optional. New DB rows use the interval recorded by each daemon run. Use report --interval D only as an override or for legacy rows without interval metadata.
  • --since accepts ms, s, m, h, and d, with no upper bound.
  • --foreground is intended for systemd and debugging.

Demo Without a GPU

gua demo

The demo records deterministic fake telemetry and immediately prints the report shape.

Systemd Example

[Unit]
Description=gua daemon
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/gua daemon --foreground --db /var/lib/gua/gua.db --interval 30s
Restart=on-failure
User=gua

[Install]
WantedBy=multi-user.target

Then run:

systemctl enable --now gpu-usage-audit

Classification Rules

Each daemon tick records per-card utilization and per-process GPU memory. The report classifies each GPU card at each tick with these rules:

util >= 10                  -> active
util <  10 AND mem >  100   -> idle-held
util <  10 AND mem <= 100   -> truly-idle

The 100 MB threshold absorbs runtime baselines such as importing PyTorch or TensorFlow.

Development

git clone https://github.com/AI-Ocean/gpu-usage-audit
cd gpu-usage-audit
uv sync
uv run python -m pytest
uv run ruff check
uv run ruff format --check
uv run python -m mypy
uv run gua demo

CI runs ruff, format check, mypy, pytest, build, and wheel smoke tests. Tag pushes (v*) build release assets and publish to PyPI through Trusted Publishing.

Non-goals

This is a single-host retrospective tool. Live dashboards, multi-host aggregation, quotas, Kubernetes cluster scans, Slurm joins, Docker/Podman runtime fallback, and pod-name resolution are outside the bare-metal 1.0 scope.

The Go v0.1.0 implementation remains available at tag v0.1.0 and branch go-archive.

License

Apache License 2.0. See LICENSE.

About

Single-host daemon that surfaces 'idle-held' NVIDIA GPU memory — the embarrassing category conventional dashboards miss.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors