NovaSky-AI · erictang000 · Apr 14, 2026 · Apr 14, 2026 · Apr 15, 2026 · Apr 15, 2026
diff --git a/docs/content/docs/getting-started/meta.json b/docs/content/docs/getting-started/meta.json
@@ -2,6 +2,7 @@
   "title": "Getting Started",
   "pages": [
     "installation",
+    "supported_models",
     "quickstart",
     "overview",
     "development",

diff --git a/docs/content/docs/getting-started/supported_models.mdx b/docs/content/docs/getting-started/supported_models.mdx
@@ -0,0 +1,34 @@
+---
+title: "Supported Models"
+---
+
+SkyRL provides support for end to end training with the FSDP and Megatron training backends (both using vLLM for inference),
+as well as support for a JAX backend (for both trainig and inference) via a Tinker API server. 
+The following models are supported for each of the training backends
+
+## FSDP
+Any model that is supported via HuggingFace transformers + vLLM is supported via the FSDP backend. The FSDP backend also comes with support for Ulysses style sequence parallelism,
+making it a viable option for training on dense models with up to 32B parameters and 32K/64K context length.
+
+## Megatron
+The [Megatron backend](/docs/examples/megatron) comes with support for 5D parallelism (DP/TP/CP/PP/CP) and allows for efficient scaling for large MoE models (100B+)
+Support for mapping from Huggingface model definitions to Megatron GPTModels is done using NVIDIA's [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) library.
+
+The following models are supported in SkyRL's Megatron backend via Megatron-Bridge + vLLM
+- Qwen-3.5 (Dense: 0.8B/2B/4B/9B/27B, MoE: 35B-A3B/122B-A10B/397B-A17B)
+- Nemotron-3 (Dense: Nano-4B-BF16, MoE: Nano-30B-A3B-BF16)
+- GLM-4.7/GLM-4.7-Flash 
+- Qwen3 (Dense: 0.6B/1.7B/4B/8B/32B, MoE: 30B-A3B/235B-A22B)
+- Moonlight-16B-A3B 
+
+Support is pending for Kimi-K2.5, GLM 5/5.1, Gemma 4 models, and Nemotron-3-Super-120B-A10B-BF16.
+
+## JAX
+The JAX backend supports the following models:
+- Qwen-3.5 Dense
+- Deepseek-V3
+- Llama-3
+- Qwen3 Dense/MoE
+
+
+
diff --git a/examples/train/nemotron_3/README.md b/examples/train/nemotron_3/README.md
@@ -4,22 +4,7 @@
 
 This example trains [NVIDIA-Nemotron-3-Nano-4B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16) on GSM8K using GRPO with the Megatron backend.
 
-Nemotron-3-Nano is a hybrid Mamba+Attention+MoE architecture (52 layers, 128 experts, SSM state). It requires specific dependency versions that differ from the default SkyRL configuration.
-
-### Required dependency changes
-
-A patch is provided that makes the necessary `pyproject.toml` changes. It was tested against SkyRL commit [`c69f2488`](https://github.com/NovaSky-AI/SkyRL/commit/c69f24881509f78dfc16d88e2f77392c4795c3ce).
-
-```bash
-git apply examples/train/nemotron_3/nemotron_support.patch
-uv lock
-```
-
-The patch makes two changes:
-
-1. **Enable Mamba dependencies** — Nemotron-3-Nano uses Mamba (SSM) layers. The default config disables `mamba-ssm` and `causal-conv1d` via `override-dependencies`; the patch enables them.
-
-2. **Switch Megatron-Bridge to `nano-v3` branch** — This branch includes `Nemotron3NanoProvider` which handles the HF-to-Megatron model conversion for this hybrid architecture.
+Nemotron-3-Nano is a hybrid Mamba+Attention+MoE architecture (52 layers, 128 experts, SSM state).
 
 ### Running
 

diff --git a/examples/train/nemotron_3/nemotron_support.patch b/examples/train/nemotron_3/nemotron_support.patch
diff --git a/pyproject.toml b/pyproject.toml
@@ -125,6 +125,7 @@ megatron = [
     "flash-attn==2.8.3; sys_platform == 'linux' and (python_version != '3.12' or platform_machine != 'x86_64')",
     "flash-linear-attention; sys_platform == 'linux'",
     "causal-conv1d; sys_platform == 'linux'",
+    "mamba-ssm>=2.3.0; sys_platform == 'linux'",
     "vllm==0.19.0; sys_platform == 'linux'",
     "vllm-router; sys_platform == 'linux'",
     "nixl; sys_platform == 'linux'",

diff --git a/tests/backends/skyrl_train/gpu/gpu_ci/conftest.py b/tests/backends/skyrl_train/gpu/gpu_ci/conftest.py
@@ -1,3 +1,4 @@
+import contextlib
 import os
 from functools import lru_cache
 
@@ -45,26 +46,35 @@ def _build_ray_env_vars():
     return env_vars
 
 
-def _ray_init():
+def _ray_init(extra_env_vars: dict[str, str] | None = None):
     if ray.is_initialized():
         ray.shutdown()
 
     # TODO (team): maybe we should use the default config and use prepare_runtime_environment in some way
     env_vars = _build_ray_env_vars()
+    if extra_env_vars:
+        env_vars.update(extra_env_vars)
 
     logger.info(f"Initializing Ray with environment variables: {env_vars}")
     ray.init(runtime_env={"env_vars": env_vars})
 
 
+@contextlib.contextmanager
+def ray_init(extra_env_vars: dict[str, str] | None = None):
+    _ray_init(extra_env_vars)
+    try:
+        yield
+    finally:
+        ray.shutdown()
+
+
 @pytest.fixture
 def ray_init_fixture():
-    _ray_init()
-    yield
-    ray.shutdown()
+    with ray_init():
+        yield
 
 
 @pytest.fixture(scope="class")
 def class_scoped_ray_init_fixture():
-    _ray_init()
-    yield
-    ray.shutdown()
+    with ray_init():
+        yield