AI model automation and benchmarking platform for local and distributed execution
madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep Learning models across local and distributed environments. Built for the MAD (Model Automation and Dashboarding) ecosystem, it provides seamless execution from single GPUs to multi-node clusters.
- Key Features
- Quick Start
- Commands
- Documentation
- Architecture
- Feature Matrix
- Usage Examples
- Model Discovery
- Performance Profiling
- Reporting and Database
- Installation
- Tips & Best Practices
- Contributing
- License
- Links & Resources
- 🚀 Modern CLI - Rich terminal output with Typer and Rich
- 🎯 Simple Deployment - Run locally or deploy to Kubernetes/SLURM via configuration
- 🔧 Distributed Launchers - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang
- 🐳 Container-Native - Docker-based execution with GPU support (ROCm, CUDA)
- 📂 ROCm Path - Support for non-default ROCm installs via
--rocm-pathorROCM_PATH(e.g. Rock, pip) - 📊 Performance Tools - Integrated profiling with rocprof/rocprofv3, rocblas, MIOpen, RCCL tracing
- 🎯 ROCprofv3 Profiles - 8 pre-configured profiles for compute/memory/communication bottleneck analysis
- 🔍 Environment Validation - TheRock ROCm detection and validation tools
- ⚙️ Intelligent Defaults - Minimal K8s configs with automatic preset application
- 📋 Configurable log scan - Optional
--additional-contextkeys to disable or tune post-run log substring checks (see Log error pattern scan)
# Install madengine
pip install git+https://github.com/ROCm/madengine.git
# Clone MAD package (required for models)
git clone https://github.com/ROCm/MAD.git && cd MAD
# Discover available models
madengine discover --tags dummy
# Run locally (full workflow: discover/build/run as configured by the model)
madengine run --tags dummy
# Or with explicit configuration
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'Note: For build operations,
gpu_vendordefaults toAMDandguest_osdefaults toUBUNTUif not specified. For production deployments or non-AMD/Ubuntu environments, explicitly specify these values.
If ROCm is not installed under /opt/rocm (e.g. Rock or pip install), use --rocm-path or set ROCM_PATH:
madengine run --tags dummy --rocm-path /path/to/rocm
# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummyResults: Performance data is written to perf.csv (and optionally perf_entry.csv). The file is created automatically if missing. Failed runs (including pre-run setup failures) are recorded with status FAILURE so every attempted model appears in the table. See Exit Codes for CI/script usage.
madengine provides five main commands for model automation and benchmarking:
| Command | Description | Use Case |
|---|---|---|
| discover | Find available models | Model exploration and validation |
| build | Build Docker images | Create containerized models |
| run | Execute models | Local and distributed execution |
| report | Generate HTML reports | Convert CSV to viewable reports |
| database | Upload to MongoDB | Store results in database |
Quick Start:
# Discover models
madengine discover --tags dummy
# Build image (uses AMD/UBUNTU defaults)
madengine build --tags dummy
# Run model
madengine run --tags dummy
# For non-AMD/Ubuntu environments, specify explicitly:
# madengine build --tags dummy --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}'
# Generate report
madengine report to-html --csv-file perf_entry.csv
# Upload results
madengine database --csv-file perf_entry.csv --db mydb --collection resultsFor detailed command options, see the CLI Command Reference.
| Guide | Description |
|---|---|
| Installation | Complete installation instructions |
| Usage Guide | Commands, workflows, and examples (--skip-model-run) |
| CLI Reference | Detailed command options and examples |
| Deployment | Kubernetes and SLURM deployment |
| Configuration | Advanced options; run log error pattern scan |
| Batch Build | Selective builds for CI/CD |
| Launchers | Distributed training frameworks |
| Profiling | Performance analysis tools |
| Contributing | How to contribute |
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│ madengine CLI v2.0 (Typer + Rich) │
│ discover │ build │ run │ report │ database │
└─────────────────────────────────────────────────────────────────────────────────────────┘
│ │ │
│ │ ▼
│ │ ┌─────────────────────── Orchestration Layer ───────────────────────────┐
│ │ │ Model Discovery (models.json / scripts/ get_models) │
│ │ │ BuildOrchestrator · RunOrchestrator │
│ └──│ |
└──────────────┴───────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────┼──────────────────── Infrastructure Layer ─────────────┐
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Local │ │ Kubernetes │ │ SLURM │ │
│ │ Docker │ │ Jobs │ │ Jobs │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ └──────────────────┼──────────────────┘ │
└─────────────────────────────────┼───────────────────────────────────────────────────────┘
▼
┌──────────────────────────────────── Launcher Layer (Distribution) ──────────────────────┐
│ Train: torchrun · DeepSpeed · TorchTitan · Megatron-LM │
│ Infer: vLLM · SGLang · SGLang Disagg │
└─────────────────────────────────┬───────────────────────────────────────────────────────┘
▼
┌─────────────────────────────┐
│ Performance (CSV/JSON) │
└─────────────┬───────────────┘
│
┌───────────────────┴───────────────────┐
▼ ▼
┌───────────────────┐ ┌──────────────────┐
│ report │ │ database │
│ to-html, to-email │ │ MongoDB upload │
└───────────────────┘ └──────────────────┘
Component Flow:
- CLI Layer - User interface with 5 commands (discover, build, run, report, database)
- Model Discovery - Find and validate models from MAD package
- Orchestration - BuildOrchestrator & RunOrchestrator manage workflows
- Execution Targets - Local Docker, Kubernetes Jobs, or SLURM Jobs
- Distributed Launchers - Training (torchrun, DeepSpeed, TorchTitan, Megatron-LM) and Inference (vLLM, SGLang)
- Performance Output - CSV/JSON results with metrics
- Post-Processing - Report generation (HTML/Email) and database upload (MongoDB)
| Launcher | Local | Kubernetes | SLURM | Type | Key Features |
|---|---|---|---|---|---|
| torchrun | ✅ | ✅ | ✅ | Training | PyTorch DDP/FSDP, elastic training |
| DeepSpeed | ✅ | ✅ | ✅ | Training | ZeRO optimization, pipeline parallelism |
| Megatron-LM | ✅ | ✅ | ✅ | Training | Tensor+Pipeline parallel, large transformers |
| TorchTitan | ✅ | ✅ | ✅ | Training | FSDP2+TP+PP+CP, Llama 3.1 (8B-405B) |
| vLLM | ✅ | ✅ | ✅ | Inference | v1 engine, PagedAttention, Ray cluster |
| SGLang | ✅ | ✅ | ✅ | Inference | RadixAttention, structured generation |
| SGLang Disagg | ❌ | ✅ | ✅ | Inference | Disaggregated prefill/decode, Mooncake, 3+ nodes |
Note: All launchers support single-GPU, multi-GPU (single node), and multi-node (where infrastructure allows). See Launchers Guide for details.
| Launcher | Tensor Parallel (TP) | Pipeline Parallel (PP) | Data Parallel (DP) | Context Parallel (CP) | FSDP/ZeRO | Expert Parallel (EP) | Primary Use Case |
|---|---|---|---|---|---|---|---|
| torchrun | ❗Manual | ❌No | ❗Manual (DDP) | ❌No | ❗Manual (FSDP) | ❌No | General distributed training |
| TorchTitan | ✅Auto | ✅Auto | ✅Auto (FSDP2) | ❗Manual | ✅Auto (FSDP2) | ❌No | Large-scale LLM pre-training |
| DeepSpeed | ❗Manual | ❗Manual | ✅Auto (ZeRO) | ❌No | ✅Auto (ZeRO) | ❌No | Memory-efficient training |
| Megatron-LM | ✅Auto | ✅Auto | ✅Implicit | ✅Auto | ❌No | ❌No | Large transformer training |
| vLLM | ✅Auto | SLURM: ✅Auto (Multi) / K8s: ❗Disabled | ✅Auto (Replicas) | ❌No | ❌No | ❗Manual | High-throughput inference |
| SGLang | ✅Auto | SLURM: ✅Auto (Multi) / K8s: ❗Disabled | ❗Limited | ❌No | ❌No | ❌No | Inference + structured gen |
| SGLang PD Disagg | ✅Auto | ❌No | ✅Role-based | ❌No | ❌No | ❌No | Optimized prefill/decode |
Legend: ✅Auto = supported and configured by madengine; ❗Manual = supported by launcher but requires user configuration; ❗Limited / ❗Disabled = launcher or platform limitation. See Launchers Guide and Configuration for details.
| Feature | Local | Kubernetes | SLURM |
|---|---|---|---|
| Execution | Docker containers | K8s Jobs | SLURM jobs |
| Multi-Node | ❌ | ✅ Indexed Jobs | ✅ Job arrays |
| Resource Mgmt | Manual | Declarative (YAML) | Batch scheduler |
| Monitoring | Docker logs | kubectl/dashboard | squeue/scontrol |
| Auto-scaling | ❌ | ✅ | ❌ |
| Network | Host | CNI plugin | InfiniBand/Ethernet |
# Single GPU
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Multi-GPU with torchrun (DDP/FSDP)
madengine run --tags model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_gpus": "0,1,2,3",
"distributed": {
"launcher": "torchrun",
"nproc_per_node": 4
}
}'
# With DeepSpeed (ZeRO optimization)
madengine run --tags model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_gpus": "all",
"distributed": {
"launcher": "deepspeed",
"nproc_per_node": 8
}
}'# Minimal config (auto-defaults applied)
madengine run --tags model \
--additional-context '{"k8s": {"gpu_count": 2}}'
# Multi-node inference with vLLM
madengine run --tags model \
--additional-context '{
"k8s": {
"namespace": "ml-team",
"gpu_count": 8
},
"distributed": {
"launcher": "vllm",
"nnodes": 2,
"nproc_per_node": 4
}
}'
# SGLang with structured generation
madengine run --tags model \
--additional-context '{
"k8s": {"gpu_count": 4},
"distributed": {
"launcher": "sglang",
"nproc_per_node": 4
}
}'# Build phase (local or CI)
madengine build --tags model \
--registry gcr.io/myproject \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Deploy phase (on SLURM login node)
madengine run --manifest-file build_manifest.json \
--additional-context '{
"slurm": {
"partition": "gpu",
"nodes": 4,
"gpus_per_node": 8,
"time": "24:00:00"
},
"distributed": {
"launcher": "torchtitan",
"nnodes": 4,
"nproc_per_node": 8
}
}'To run on specific nodes, set nodelist (comma-separated node names). When set, the job is restricted to those nodes and automatic node health preflight is skipped. Example: "slurm": { "nodelist": "node01,node02", "nodes": 2, ... }. See Configuration and examples/slurm-configs/basic/03-multi-node-basic-nodelist.json.
Development → Testing → Production:
# 1. Develop locally with single GPU
madengine run --tags model \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# 2. Test multi-GPU locally
madengine run --tags model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_gpus": "0,1",
"distributed": {"launcher": "torchrun", "nproc_per_node": 2}
}'
# 3. Build and push to registry
madengine build --tags model \
--registry docker.io/myorg \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# 4. Deploy to Kubernetes
madengine run --manifest-file build_manifest.jsonCI/CD Pipeline:
# Batch build (selective rebuilds)
madengine build --batch-manifest batch.json \
--registry docker.io/myorg
# Run tests
madengine run --manifest-file build_manifest.json \
--additional-context '{"k8s": {"namespace": "ci-test"}}'
# Generate and email reports
madengine report to-email --directory ./results --output ci_report.html
# Upload to database
madengine database --csv-file perf_entry.csv \
--database-name ci_db --collection-name test_resultsSee Usage Guide, Configuration Guide, and CLI Reference for more examples.
# Build single model
madengine build --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Build with registry (for distributed deployment)
madengine build --tags model1 model2 \
--registry localhost:5000 \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Build for multiple GPU architectures
madengine build --tags model \
--target-archs gfx908 gfx90a gfx942 \
--registry gcr.io/myproject
# Batch build mode (selective builds for CI/CD)
madengine build --batch-manifest examples/build-manifest/batch.json \
--registry docker.io/myorg
# Clean rebuild (no Docker cache)
madengine build --tags model --clean-docker-cache \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'Output: Creates build_manifest.json with built image names and configurations.
See Batch Build Guide and examples in examples/build-manifest/.
madengine discovers models from the MAD package using three methods:
# Root models (models.json)
madengine discover --tags pyt_huggingface_bert
# Directory-specific (scripts/{dir}/models.json)
madengine discover --tags dummy2:dummy_2
# Dynamic with parameters (scripts/{dir}/get_models_json.py)
madengine discover --tags dummy3:dummy_3:batch_size=512madengine includes integrated profiling tools for AMD ROCm:
# GPU profiling with rocprof
madengine run --tags model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [{"name": "rocprof"}]
}'
# ROCprofv3 (ROCm 7.0+) - Advanced profiling with pre-configured profiles
madengine run --tags model \
--additional-context '{"tools": [{"name": "rocprofv3_compute"}]}'
# Use configuration files for complex setups
madengine run --tags model \
--additional-context-file examples/profiling-configs/rocprofv3_multi_gpu.json
# Library tracing (rocBLAS, MIOpen, Tensile, RCCL)
madengine run --tags model \
--additional-context '{"tools": [{"name": "rocblas_trace"}]}'
# Power and VRAM monitoring
madengine run --tags model \
--additional-context '{"tools": [
{"name": "gpu_info_power_profiler"},
{"name": "gpu_info_vram_profiler"}
]}'
# Multiple tools (stackable)
madengine run --tags model \
--additional-context '{"tools": [
{"name": "rocprofv3_memory"},
{"name": "rocblas_trace"},
{"name": "gpu_info_power_profiler"}
]}'Available Tools:
| Tool | Purpose | Output |
|---|---|---|
rocprof |
GPU kernel profiling | Kernel timings, occupancy |
rocprofv3_compute |
Compute-bound analysis (ROCm 7.0+) | ALU metrics, wave execution |
rocprofv3_memory |
Memory-bound analysis (ROCm 7.0+) | Cache hits, bandwidth |
rocprofv3_communication |
Multi-GPU communication (ROCm 7.0+) | RCCL traces, inter-GPU transfers |
rocprofv3_lightweight |
Minimal overhead profiling (ROCm 7.0+) | HIP and kernel traces |
rocblas_trace |
rocBLAS library calls | Function calls, arguments |
miopen_trace |
MIOpen library calls | Conv/pooling operations |
tensile_trace |
Tensile GEMM library | Matrix multiply details |
rccl_trace |
RCCL collective ops | Communication patterns |
gpu_info_power_profiler |
GPU power consumption | Power usage over time |
gpu_info_vram_profiler |
GPU memory usage | VRAM utilization |
therock_check |
TheRock ROCm validation | Installation detection |
ROCprofv3 Profiles (ROCm 7.0+):
madengine provides 8 pre-configured ROCprofv3 profiles for different bottleneck scenarios:
rocprofv3_compute- Compute-bound workloads (transformers, dense ops)rocprofv3_memory- Memory-bound workloads (large batches, high-res)rocprofv3_communication- Multi-GPU distributed trainingrocprofv3_full- Comprehensive profiling (all metrics, high overhead)rocprofv3_lightweight- Minimal overhead (production-friendly)rocprofv3_perfetto- Perfetto UI compatible tracesrocprofv3_api_overhead- API call timing analysisrocprofv3_pc_sampling- Kernel hotspot identification
See examples/profiling-configs/ for ready-to-use configuration files.
TheRock Validation:
# Validate TheRock installation (AMD's pip-based ROCm)
madengine run --tags dummy_therock \
--additional-context '{"tools": [{"name": "therock_check"}]}'See Profiling Guide for detailed usage and analysis.
Convert performance CSV files to HTML reports:
# Single CSV to HTML
madengine report to-html --csv-file perf_entry.csv
# Consolidated email report (all CSVs in directory)
madengine report to-email --directory ./results --output summary.htmlStore performance results in MongoDB:
# Set MongoDB connection
export MONGO_HOST=mongodb.example.com
export MONGO_PORT=27017
export MONGO_USER=myuser
export MONGO_PASSWORD=mypassword
# Upload CSV to MongoDB
madengine database --csv-file perf_entry.csv \
--database-name performance_db \
--collection-name model_runsUse Cases:
- Track performance over time
- Compare results across different configurations
- Build performance dashboards
- Automated CI/CD reporting
See CLI Reference for complete options.
# Basic installation
pip install git+https://github.com/ROCm/madengine.git
# With Kubernetes support
pip install "madengine[kubernetes] @ git+https://github.com/ROCm/madengine.git"
# Development installation
git clone https://github.com/ROCm/madengine.git
cd madengine && pip install -e ".[dev]"See Installation Guide for detailed instructions.
- Use configuration files for complex setups instead of long command lines
- Test locally first with single GPU before scaling to multi-node
- Enable verbose logging (
--verbose) when debugging issues - Use
--live-outputfor real-time monitoring of long-running operations
After a local Docker run, madengine can scan the captured run log for common failure substrings (for example RuntimeError:, CUDA out of memory, Traceback). That helps catch hard failures when exit codes are ambiguous, but some workloads log benign RuntimeError: text while tests still pass.
- Disable the scan when another signal is authoritative (e.g. pytest/JUnit inside the image): set
"log_error_pattern_scan": falsein--additional-contextor in the model entry inmodels.json. See Configuration — Run phase: log error pattern scan. - Extend exclusions with
log_error_benign_patterns(list of strings), or replace the default pattern list withlog_error_patterns(non-empty list of strings) for advanced cases.
- Exit codes: The CLI uses fixed exit codes (
ExitCodeinmadengine.cli.constants, e.g.SUCCESS=0,RUN_FAILURE=3,INVALID_ARGS=4). Pipelines should treat non-zero as failure; no log scraping is required for pass/fail. - Streaming: In Jenkins, avoid redirecting stdout only to a file (
> file) withoutteeif you want the console to update during the run. Prefer... 2>&1 | tee madengine.run.logwithbash -o pipefailso the step exit code is still frommadengine. - Unbuffered Python: If output still appears in chunks, set
PYTHONUNBUFFERED=1(orpython -u) for themadengineprocess.
- Separate build and run phases for distributed deployments
- Build without executing:
madengine run --tags … --skip-model-runskips container execution after a build in that same invocation (ignored when using an existing--manifest-file). See Usage — Skip model run after build. - Use registries for multi-node execution (K8s/SLURM)
- Use batch build mode for CI/CD to optimize build times
- Specify
--target-archswhen building for multiple GPU architectures
- Start with small timeouts and increase as needed
- Use profiling tools to identify bottlenecks
- Monitor GPU utilization with
gpu_info_power_profiler - Profile library calls with rocBLAS/MIOpen tracing
madengine uses consistent exit codes for scripts and CI (e.g. Jenkins): 0 = success, 1 = general failure, 2 = build failure, 3 = one or more run failures, 4 = invalid arguments. Failed runs are still written to perf.csv with status FAILURE. See CLI Reference — Exit Codes for the full table and examples.
# Check model is available
madengine discover --tags your_model
# Verbose output for debugging
madengine run --tags model --verbose --live-output
# Keep container alive for inspection
madengine run --tags model --keep-alive
# Clean rebuild if build fails
madengine build --tags model --clean-docker-cache --verboseROCm not in /opt/rocm: If you use a custom ROCm location (e.g. TheRock or pip), set ROCM_PATH or pass --rocm-path to madengine run so GPU detection and container env use the correct paths.
Common Issues:
- False failures with profiling: If models show FAILURE but have performance metrics, see Profiling Troubleshooting
- False failures from
RuntimeError:in logs: If the workload logs expected exception text but tests pass, disable or tune the scan withlog_error_pattern_scan/log_error_benign_patterns— see Configuration - ROCProf log errors: Messages like
E20251230are informational logs, not errors (fixed in v2.0+) - Configuration errors: Validate JSON with
python -m json.tool your-config.json
We welcome contributions! See Contributing Guide for details.
git clone https://github.com/ROCm/madengine.git
cd madengine
python3 -m venv venv && source venv/bin/activate
pip install -e ".[dev]"
# Run all tests
pytest
# Run specific test module
pytest tests/unit/test_error_handling.py -v
# Run error pattern tests
pytest tests/unit/test_error_handling.py::TestErrorPatternMatching -vMIT License - see LICENSE file for details.
- CLI Reference - Complete command options
- Usage Guide - Workflows and examples
- Deployment Guide - Kubernetes/SLURM deployment
- Configuration Guide - Advanced configuration
- All Docs - Complete documentation index
- MAD Package: https://github.com/ROCm/MAD
- Issues & Support: https://github.com/ROCm/madengine/issues
- ROCm Documentation: https://rocm.docs.amd.com/
Command Help:
madengine --help # Main help
madengine <command> --help # Command-specific help
madengine report --help # Sub-app help
madengine report to-html --help # Sub-command helpQuick Checks:
# Verify installation
madengine --version
# Discover available models
madengine discover
# Check specific model
madengine discover --tags your_model --verboseTroubleshooting:
- Check CLI Reference for all command options
- Enable
--verboseflag for detailed error messages - See Usage Guide troubleshooting section
- Report issues: https://github.com/ROCm/madengine/issues
The CLI has been unified! Starting from v2.0.0:
- ✅ Use
madengine(unified modern CLI with K8s, SLURM, distributed support) - ❌ Legacy v1.x CLI has been removed
Code Quality: Clean codebase with no dead code, comprehensive test coverage, and following Python best practices.