Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/content/docs/getting-started/meta.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"title": "Getting Started",
"pages": [
"installation",
"supported_models",
"quickstart",
"overview",
"development",
Expand Down
34 changes: 34 additions & 0 deletions docs/content/docs/getting-started/supported_models.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: "Supported Models"
---

SkyRL provides support for end to end training with the FSDP and Megatron training backends (both using vLLM for inference),
as well as support for a JAX backend (for both trainig and inference) via a Tinker API server.
The following models are supported for each of the training backends

## FSDP
Any model that is supported via HuggingFace transformers + vLLM is supported via the FSDP backend. The FSDP backend also comes with support for Ulysses style sequence parallelism,
making it a viable option for training on dense models with up to 32B parameters and 32K/64K context length.

## Megatron
The [Megatron backend](/docs/examples/megatron) comes with support for 5D parallelism (DP/TP/CP/PP/CP) and allows for efficient scaling for large MoE models (100B+)
Support for mapping from Huggingface model definitions to Megatron GPTModels is done using NVIDIA's [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) library.

The following models are supported in SkyRL's Megatron backend via Megatron-Bridge + vLLM
- Qwen-3.5 (Dense: 0.8B/2B/4B/9B/27B, MoE: 35B-A3B/122B-A10B/397B-A17B)
- Nemotron-3 (Dense: Nano-4B-BF16, MoE: Nano-30B-A3B-BF16)
- GLM-4.7/GLM-4.7-Flash
- Qwen3 (Dense: 0.6B/1.7B/4B/8B/32B, MoE: 30B-A3B/235B-A22B)
- Moonlight-16B-A3B

Support is pending for Kimi-K2.5, GLM 5/5.1, Gemma 4 models, and Nemotron-3-Super-120B-A10B-BF16.

## JAX
The JAX backend supports the following models:
- Qwen-3.5 Dense
- Deepseek-V3
- Llama-3
- Qwen3 Dense/MoE



17 changes: 1 addition & 16 deletions examples/train/nemotron_3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,7 @@

This example trains [NVIDIA-Nemotron-3-Nano-4B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16) on GSM8K using GRPO with the Megatron backend.

Nemotron-3-Nano is a hybrid Mamba+Attention+MoE architecture (52 layers, 128 experts, SSM state). It requires specific dependency versions that differ from the default SkyRL configuration.

### Required dependency changes

A patch is provided that makes the necessary `pyproject.toml` changes. It was tested against SkyRL commit [`c69f2488`](https://github.com/NovaSky-AI/SkyRL/commit/c69f24881509f78dfc16d88e2f77392c4795c3ce).

```bash
git apply examples/train/nemotron_3/nemotron_support.patch
uv lock
```

The patch makes two changes:

1. **Enable Mamba dependencies** — Nemotron-3-Nano uses Mamba (SSM) layers. The default config disables `mamba-ssm` and `causal-conv1d` via `override-dependencies`; the patch enables them.

2. **Switch Megatron-Bridge to `nano-v3` branch** — This branch includes `Nemotron3NanoProvider` which handles the HF-to-Megatron model conversion for this hybrid architecture.
Nemotron-3-Nano is a hybrid Mamba+Attention+MoE architecture (52 layers, 128 experts, SSM state).

### Running

Expand Down
23 changes: 0 additions & 23 deletions examples/train/nemotron_3/nemotron_support.patch

This file was deleted.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ megatron = [
"flash-attn==2.8.3; sys_platform == 'linux' and (python_version != '3.12' or platform_machine != 'x86_64')",
"flash-linear-attention; sys_platform == 'linux'",
"causal-conv1d; sys_platform == 'linux'",
"mamba-ssm>=2.3.0; sys_platform == 'linux'",
"vllm==0.19.0; sys_platform == 'linux'",
"vllm-router; sys_platform == 'linux'",
"nixl; sys_platform == 'linux'",
Expand Down
24 changes: 17 additions & 7 deletions tests/backends/skyrl_train/gpu/gpu_ci/conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import contextlib
import os
from functools import lru_cache

Expand Down Expand Up @@ -45,26 +46,35 @@ def _build_ray_env_vars():
return env_vars


def _ray_init():
def _ray_init(extra_env_vars: dict[str, str] | None = None):
if ray.is_initialized():
ray.shutdown()

# TODO (team): maybe we should use the default config and use prepare_runtime_environment in some way
env_vars = _build_ray_env_vars()
if extra_env_vars:
env_vars.update(extra_env_vars)

logger.info(f"Initializing Ray with environment variables: {env_vars}")
ray.init(runtime_env={"env_vars": env_vars})


@contextlib.contextmanager
def ray_init(extra_env_vars: dict[str, str] | None = None):
_ray_init(extra_env_vars)
try:
yield
finally:
ray.shutdown()


@pytest.fixture
def ray_init_fixture():
_ray_init()
yield
ray.shutdown()
with ray_init():
yield


@pytest.fixture(scope="class")
def class_scoped_ray_init_fixture():
_ray_init()
yield
ray.shutdown()
with ray_init():
yield
Loading
Loading