GitHub - advpropsys/rustwood: rustwood is a histogram-based, oblivious-tree gradient-boosting library for NVIDIA GPUs and experimental CPU support. Beats LightGBM and xgboost by speed for up to 1000x on inference.

rustwood — GPU gradient boosting with every kernel in pure Rust

Highlights · Results · How it works · Build & run · Features

rustwood is a histogram-based, oblivious-tree gradient-boosting library for NVIDIA GPUs (XGBoost-style histograms, CatBoost-style symmetric trees). Every GPU kernel is written in pure Rust and compiled to PTX by NVlabs cuda-oxide — no CUDA C, no FFI shims. Built and benchmarked on a Blackwell B300 (sm_103).

Oblivious trees use one (feature, threshold) split per depth, so a depth-D tree is D splits and 2^D leaves. Split-finding is a single argmax per level, and inference is branchless — the source of rustwood's order-of-magnitude faster predictions.

Try it → notebooks/rustwood_vs_lightgbm.ipynb, a Colab demo on an sklearn dataset: build, train, compare with LightGBM, and .rwood save/load.

Highlights

Fastest training of the GPU boosters tested — 1.5–6× faster than XGBoost-GPU and 5–30× faster than LightGBM-GPU at matched hyperparameters.
40–800× faster inference — branchless oblivious scoring, ~0.6 ns/row at peak (≈1.7 billion rows/s) and 289 ns single-row latency on the CPU path.
Competitive accuracy under fair benchmarking (warm timing, per-library learning-rate tuning, native categoricals): wins Titanic and Credit-G, ties Bank-Marketing and Adult, trails on large all-numeric Covertype.
Fully async on one CUDA stream — gradients, histograms, split selection, leaf values, and the model all live on the GPU; the host syncs once at the end.
Tiny: ~3.2k lines of Rust against XGBoost's 86.7k and LightGBM's 63k of C++/CUDA (~20–27× smaller). GPU and CPU in one codebase; pure-Rust kernels, no separate CUDA C++.
Instant model I/O — the .rwood binary saves in 0.17 ms and loads in 0.044 ms (500 trees), 88–1600× faster and 16–24× smaller than XGBoost/LightGBM.
GPU-free --device cpu — a rayon CPU trainer, bit-identical to the GPU path, and the fastest CPU oblivious-tree trainer measured.
Target encoding, monotonic constraints, GOSS, feature importance, PGBM prediction intervals, 4-bit bin packing, and f16/int atomic histograms.

Results

Tabular datasets — test AUC, depth 6 / 200 trees

Five OpenML/sklearn datasets under strict fairness controls (scripts/dataset_bench.py): warm timing (cold-start paid once up front), per-library learning-rate tuning on a held-out validation split, and each library's native categorical handling (XGBoost enable_categorical, LightGBM category dtype, rustwood out-of-fold target encoding). Missing values imputed identically.

dataset	feat (cat)	rustwood	XGBoost-GPU	LightGBM-GPU	train (rw / xgb / lgbm)
Titanic	7 (2)	0.9010	0.8828	0.8915	0.16 / 0.30 / 0.75 s
Credit-G	20 (13)	0.7930	0.7431	0.7435	0.18 / 0.57 / 0.72 s
Bank-Marketing	16 (9)	0.9322	0.9343	0.9346	0.17 / 0.50 / 1.34 s
Adult	14 (8)	0.9289	0.9299	0.9294	0.17 / 0.57 / 1.21 s
Covertype 581k	54 (0)	0.9473	0.9731	0.9719	0.39 / 0.42 / 1.82 s

With every library tuned and using native categoricals, rustwood wins the categorical and small tabular sets (Titanic, Credit-G), ties Bank-Marketing and Adult within ~0.2%, and trains fastest in every case. It trails on the large all-numeric Covertype: depth-6 oblivious means 6 splits per tree against XGBoost's 63 — a structural limit, not a tuning gap.

From the tuning sweep: oblivious trees are weak learners, so the conventional lr=0.1 under-converges them. rustwood prefers higher learning rates (0.2–0.3) on large clean datasets; dataset_bench.py tunes the learning rate per library so the comparison is fair.

Synthetic scaling

Training is fastest at every size, and inference is one to two orders of magnitude faster end to end. The iso-accuracy frontier shows rustwood reaching XGBoost-class accuracy in comparable wall-clock time.

Where the GPU time goes

Per-kernel flamegraphs at nanosecond resolution (--profile-out). After the async rewrite, histogram privatization by replication, and histogram subtraction, the histogram build is the sole hotspot.

Compactness

The whole library — GPU kernels, the CPU trainer, and the host driver — is one small Rust codebase. The pure-Rust kernels compile to PTX, so there is no separate CUDA C++ / CPU C++ split to maintain.

library	core training source	lines
XGBoost	`src/` + `include/` (C++/CUDA)	86,660
LightGBM	`src/` + `include/` (C++/CUDA)	63,002
rustwood	`src/` (Rust: GPU kernels + CPU + host)	3,170

rustwood is ~20–27× smaller. It is more focused (oblivious trees, L2 + logistic), so this is not feature-for-feature — but the core histogram-GBDT, on GPU and CPU, fits in 3k lines. (GPU kernels 1,176 · CPU trainer 307 · host/driver 1,687.)

Model I/O — the `.rwood` format

Models serialize to a compact little-endian .rwood binary (header + raw f32/u32 array blits + target-encoder maps). I/O is a direct slice blit, with no parsing.

format (500 trees, depth 6)	file	save	load
XGBoost (json)	3597 KB	36.0 ms	72.0 ms
XGBoost (ubj, binary)	2411 KB	20.0 ms	6.26 ms
LightGBM (txt)	2849 KB	15.2 ms	5.84 ms
rustwood (`.rwood`)	148 KB	0.17 ms	0.044 ms

88–1600× faster to load, 16–24× smaller (compact oblivious trees + raw binary). Loaded models predict bit-identically, categorical encoders included.

./target/release/rustwood --data /tmp/d --objective l2 --trees 500 --save-model model.rwood
./target/release/rustwood --data /tmp/d --load-model model.rwood     # no training, no GPU

CPU device

--device cpu runs a rayon-parallel host trainer (histogram subtraction, column-major apply, branchless gain-eval, subsample/GOSS) with no CUDA context, so the binary runs on a GPU-free box. It produces a bit-identical model to the GPU path (verified: same R²/AUC for both objectives) and is the fastest CPU oblivious-tree trainer measured. The GPU path is the fast path; the CPU device is for portability and no-GPU environments.

./target/release/rustwood --data /tmp/d --objective l2 --trees 100 --device cpu

How it works

features ──quantize(GPU)──▶ u8 bins ─┐
                                     ▼
  ┌──────────────── per boosting round · one stream, no host sync ───────────────┐
  │ grad/hess ─▶ build histograms ─▶ reduce ─▶ split_gain ─▶ argmax_split ─▶ apply │
  │ (oblivious: same split for every node at a depth)   leaf_hist ─▶ leaf values   │
  └──────────────────────────────────────────────────────────────────────────────┘
                                     ▼
                       device-resident model ──▶ branchless inference

Performance engineering, in order of impact:

Fully async loop. Split selection (argmax_split), leaf values, and the chosen (feature, threshold) are written to device buffers; the host only enqueues kernels.
Histogram privatization by replication. Many global accumulators indexed by block, then a parallel reduce, to cut atomic contention (near-optimal on B300).
Histogram subtraction. Build only odd children, derive even = parent − odd (~1.3×).
Prefix-scan split evaluation, GPU argmax, preallocated buffers, and pinned + spin-sync inference.

Build & run

cuda-oxide is vendored as a git submodule at external/cuda-oxide, so the repo is self-contained. Requires CUDA 13 and the pinned Rust nightly (auto-selected via rust-toolchain.toml).

git clone --recursive <repo>      # or: git submodule update --init --recursive
./build.sh                        # -> target/release/rustwood   (sm_103 / B300)
ARCH=sm_90 ./build.sh             # target a different GPU
./build.sh --features f64-hist    # opt-in f64 histogram accumulation

# CPU-only build on machines without CUDA
cargo build --release --no-default-features

# generate data and train
python scripts/gen_data.py --out /tmp/d --n 1000000
./target/release/rustwood --data /tmp/d --objective l2 --trees 300 --depth 6 --lr 0.1
./target/release/rustwood --device cpu --data /tmp/d --objective l2 --trees 100  # CPU-only binary

Benchmark harnesses (need xgboost, lightgbm):

python scripts/bench.py          # synthetic scaling vs XGBoost / LightGBM
python scripts/dataset_bench.py  # named datasets (Titanic, Adult, Bank, Credit-G, Covertype)
python scripts/latency_bench.py --rustwood-bin target/release/rustwood   # latency sweep
python scripts/viz.py ...        # plots + flamegraphs

Python API

A thin sklearn-style wrapper (python/) shells out to the binary — GPU and CPU, no FFI. pip install ./python (the binary is found via RUSTWOOD_BIN or the repo build).

from rustwood import RustwoodRegressor, RustwoodClassifier, load

m = RustwoodRegressor(n_trees=500, depth=6, device="gpu").fit(X, y)   # or device="cpu"
preds = m.predict(X_test)
m.save("model.rwood")
m2 = load("model.rwood")            # predicts on the host (CPU), no GPU needed

clf = RustwoodClassifier(n_trees=300, device="cpu").fit(X, y)
proba = clf.predict_proba(X_test)[:, 1]

Features

All off by default; the default path is the fast f32 booster.

feature	flag	note
Categorical target encoding (out-of-fold)	`--categorical 2,5,…`	biggest accuracy lever on categorical data
Unique-quantile bins	`--unique-bins 1`	each bin a distinct value range
Hashing trick	`--cat-hash-buckets 64`	bound cardinality for huge categoricals
Monotonic constraints	`--monotone 1,0,-1,…`	enforce direction per feature
Stochastic boosting	`--subsample` / `--colsample`	regularize and speed
GOSS	`--goss-top 0.3 --goss-other 0.2`	gradient-based one-side sampling
Hessian clamp (exp.)	`--clamp-hessian 0.01`	winsorize hessian tails
Feature importance	always on	gain% + split counts printed
PGBM intervals	`--pgbm 1`	per-row predictive σ + coverage
Histogram subtraction	`--subtract 1` (default)	~1.3× faster
4-bit bin packing	`--pack4 1` (`--bins ≤16`)	trims bin-read bandwidth
Shared-mem / f16 / int atomic histograms	`--smem-hist` / `--f16-hist` / `--int-hist`	experimental atomic-path variants
f64 accumulation	build `--features f64-hist`	precision at extreme N

Objectives: --objective l2 (RMSE/MAE/R²) and --objective logistic (AUC/logloss/accuracy).

Project layout

rustwood/
├── src/
│   ├── gpu_kernels.rs   # all #[kernel] functions (pure Rust → PTX)
│   ├── booster.rs       # async GPU loop, device-resident model, inference, --serve worker, .rwood I/O
│   ├── cpu.rs           # rayon CPU trainer (--device cpu), bit-identical to the GPU path
│   ├── data.rs encoding.rs config.rs metrics.rs main.rs
├── python/              # sklearn-style wrapper (GPU + CPU; persistent worker, no FFI)
├── notebooks/           # Colab demo: rustwood vs LightGBM
├── scripts/             # data gen, benchmark harnesses, plotting
├── results/             # generated plots + flamegraphs
├── assets/              # brand identity (logomark + social-preview banner)
├── external/cuda-oxide  # vendored backend (git submodule)
└── build.sh             # build against the vendored cuda-oxide submodule

Acknowledgements & related work

Built on NVlabs cuda-oxide (Rust → PTX). The half-precision atomic it needed for an experiment (DeviceAtomicF16 → atom.add.noftz.f16) was contributed upstream. Benchmarked against XGBoost and LightGBM (Apache-2.0 / MIT); rustwood's feature set draws on published techniques from those libraries and the CatBoost and PGBM papers.

Limitations. Oblivious trees are less expressive per tree than asymmetric leaf-wise trees, so on large all-numeric problems rustwood trails XGBoost and LightGBM on accuracy (it is faster, so more rounds partly close the gap). PGBM interval calibration is conservative. The f16 and shared-memory atomic histograms are correct but not faster than f32 + replication on B300; they remain as flags for other hardware.

License

Apache-2.0. See Cargo.toml for crate metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Highlights

Results

Tabular datasets — test AUC, depth 6 / 200 trees

Synthetic scaling

Where the GPU time goes

Compactness

Model I/O — the `.rwood` format

CPU device

How it works

Build & run

Python API

Features

Project layout

Acknowledgements & related work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
external		external
notebooks		notebooks
python		python
results		results
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.toml		Cargo.toml
README.md		README.md
build.sh		build.sh
codedb.snapshot		codedb.snapshot
rust-toolchain.toml		rust-toolchain.toml

Folders and files

Latest commit

History

Repository files navigation

Highlights

Results

Tabular datasets — test AUC, depth 6 / 200 trees

Synthetic scaling

Where the GPU time goes

Compactness

Model I/O — the .rwood format

CPU device

How it works

Build & run

Python API

Features

Project layout

Acknowledgements & related work

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Model I/O — the `.rwood` format

Packages