Skip to content

Fleet/all#1410

Open
dzorlu wants to merge 80 commits intoNovaSky-AI:mainfrom
fleet-ai:fleet/all
Open

Fleet/all#1410
dzorlu wants to merge 80 commits intoNovaSky-AI:mainfrom
fleet-ai:fleet/all

Conversation

@dzorlu
Copy link
Copy Markdown

@dzorlu dzorlu commented Mar 30, 2026

All


Open with Devin

Deniz and others added 30 commits March 28, 2026 14:38
Add fleet_task environment that integrates Fleet-hosted tasks with SkyRL
via OpenEnv's FleetTaskEnv abstraction layer. Supports multi-turn
tool-use and computer-use (multimodal) modalities.

- FleetTaskEnv(BaseTextEnv): provisions Fleet env, multi-turn episodes,
  reward via verifier, partial reward support, hint augmentation
- Tool call parser: handles <tool_call>/<function_call> tag formats with
  JSON repair for missing closing braces
- Multimodal observations: returns image_url content blocks for CUA,
  compatible with upstream's extract_images_from_conversation()
- Per-env metrics aggregation with environment breakdown
- Context management integration for long trajectories
- Trace upload support for eval telemetry

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onfigs

Port Fleet-specific training infrastructure from fork to fresh SkyRL-v2:

Entrypoints:
- main_fleet.py: GRPO training on Fleet-hosted envs with S3 checkpoints
- main_task_gen.py: Task generation training entrypoint
- main_fleet_tinker.py: Tinker-based training with Fleet envs (LoRA, async)

Dataset & Checkpoints:
- prepare_dataset.py: Convert Fleet task JSON to SkyRL parquet format
  (stratified split, dedup, env capping, difficulty filtering)
- s3_checkpoints.py: Async S3 upload, cross-VM resume, local cleanup
- export_tasks.py: CLI to export tasks from Fleet API

Training Scripts:
- fleet-common-setup.sh: Shared setup (deps, OpenEnv, dataset download)
- fleet-common-run.sh: Multi-node Ray cluster + training launch
- fleet-35b-run.sh: Qwen3.5-35B config (TP=2, multi-node)
- fleet-qwen35-extra-setup.sh: Qwen3.5 deps (transformers 5.3, flash-attn)
- fleet-task-gen-run.sh: Task generation config

SkyPilot YAML Configs:
- openenv-fleet-grpo-qwen3_5-35b.yaml: 2-node H200 training
- task-gen-grpo-qwen3_5-9b.yaml: Single-node task gen

Also adds fleet_task and task_gen config to skyrl_gym_config/default.yaml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port the task generation environment from fleet-ai/SkyRL that enables
RL-based training of task-generating models. The environment supports
multi-turn task generation where the model generates (prompt, verifier)
pairs that are evaluated via Fleet harness rollouts.

Key components:
- TaskGenEnv(BaseTextEnv): Multi-turn env with tool-based DB exploration,
  task generation, and reward computation via variance + hint gap
- VerifierSandbox: AST-based static analysis for generated verifier code
  safety (blocked imports/builtins, complexity bounds, signature checks)
- Tool call parser: Handles <tool_call>/<function_call> tag formats

Reward formula: R = gate * (base_quality + alpha * var(raw_scores) + hint_gap)

Depends on PR #2 (fleet/training) for integrations.fleet.task_gen_reward.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When all raw rollout samples for a prompt score 0, hint augmentation
generates additional rollouts with verifier feedback injected into the
prompt. This rescues GRPO signal for otherwise dead prompts.

Key components:
- _run_hint_augmentation() in SkyRLGymGenerator: groups outputs by
  instance_id, identifies failing prompts, builds hint text from
  verifier ERROR/SUCCESS_ACCUMULATOR, launches hinted rollouts
- RLTF-SD: replaces hinted prompt_ids with original unhinted prompt_ids
  so the model learns to produce hint-quality outputs from the original
  prompt alone (grad log pi(y_hint | x_0) not grad log pi(y_hint | x_0 + hint))
- First-turn baseline in compute_grpo_outcome_advantage: when is_hinted
  is present, computes group mean/std from raw samples only, preventing
  hinted samples from contaminating the GRPO baseline
- Metrics: hint/total_hinted_rollouts, hint/hint_success_rate,
  hint/prompts_hinted, hint/signal_rescued

Config: enable_hints, hint_reward_threshold, n_hint_samples in
fleet_task section of skyrl_gym_config.

Only runs during training (not eval), only for non-step-wise
trajectories, and only when fleet_task.enable_hints=true.

Depends on PR #1 (fleet/task-env) for FleetTaskEnv.build_hint_text().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port vision-language model support from SkyRL v1 (feat/vl-support-clean)
to SkyRL-v2's architecture:

- Generator: VL-aware chat template, image accumulation across turns,
  multi_modal_data construction for vLLM
- Engine pipeline: thread multi_modal_data through preprocess/generate
  in both sync and async vLLM engines
- Fleet env: Qwen coordinate adaptation ([0,1000] <-> pixel), initial
  screenshot capture, computer_use browser hints, done signal detection
- Utilities: image extraction, base64 decode, processor loading,
  VL chat template with proper vision token expansion
- New VL run script and SkyPilot YAML for CUA training
- Update existing YAMLs to use fleet/all branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RunPod/Lambda/Nebius/Vast were all out of H200 capacity.
Add GCP spot with proper NVIDIA 570 driver image.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkyRL-v2 pyproject.toml defines 'fsdp' extra (includes vllm, flash-attn,
torch, flashinfer) but not a standalone 'vllm' extra. The old SkyRL had
'vllm' as a separate extra.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install silently fails to build causal-conv1d CUDA extension
(reports "Checked 1 package" but module is not importable). Use pip
with --no-build-isolation to ensure it finds torch from the venv.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In SkyRL-v2, scripts/ is directly under repo root (not nested under
skyrl-train/). Changed cd from "../.." to ".." so the run scripts
correctly resolve the repo root directory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TrainerConfig: loss_chunk_size, use_hybrid_env_sampling, min_samples_per_env
GeneratorConfig: inject_context_status, context_warning_threshold, trajectory_timeout_seconds

SkyRL-v2's strict Hydra config rejects unknown keys (no + prefix),
so these must be defined in the dataclass and YAML defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The fleet entrypoints use @hydra.main which loads the legacy YAML
directly, but validate_cfg expects generator.inference_engine.*
(the new structured format). Apply translate_legacy_config to
convert flat generator.* keys before validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The legacy YAML has flat generator.* keys (e.g. generator.backend)
but validate_cfg expects generator.inference_engine.* with all fields
including distributed_executor_backend. Add the full inference_engine
section with defaults so all fields are present after Hydra loads the
config and translate_legacy_config moves CLI overrides into it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove explicit fleet_task register() call from main_fleet.py since
  skyrl_gym.envs.__init__ already auto-registers it
- Remove --data-dir-name task_gen from task-gen run script so it uses
  the default MODALITY-based path (matching setup's download path)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace OmegaConf.create() approach (loses dataclass type info) with
  in-place sync of flat generator.* CLI overrides into the structured
  generator.inference_engine section. This preserves the Hydra DictConfig
  and avoids TypeError on dataclasses.asdict().
- Remove --skip-prepare from task-gen YAML so parquet files are generated
- Remove duplicate fleet_task registration (auto-registered by __init__)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hydra entrypoints pass DictConfig (not dataclass instances), so
dataclasses.asdict() fails. Fall back to OmegaConf.to_yaml() for
DictConfig objects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hydra's @hydra.main produces DictConfig objects, but the codebase expects
typed dataclass instances (asdict(), attribute access, etc.). Switch Fleet
entrypoints to use SkyRLTrainConfig.from_cli_overrides() which produces
proper typed dataclasses via the legacy config translation path.

- Add fleet_task/task_gen as Optional[Dict] fields on SkyRLGymConfig
- Strip ++/+ Hydra prefixes from CLI args before from_cli_overrides
- Remove _sync_legacy_generator_to_inference_engine (legacy path handles it)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate 1.12.0 passes param.__dict__ (which includes transformers 5.3.0's
_is_hf_initialized flag) to Parameter.__new__() during init_empty_weights.
PyTorch 2.10.0 rejects this unknown kwarg. Newer accelerate filters it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install -U accelerate pulls newer torch with CUDA 13.0, breaking
torchvision (CUDA 12.8). Use pip install --no-deps instead to upgrade
only accelerate without re-resolving transitive dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate's init_empty_weights passes param.__dict__ to Parameter()
which includes _is_hf_initialized (set by transformers 5.x). torch 2.10
rejects this unknown kwarg. Patch Parameter.__new__ in fsdp_utils.py
to filter it out. Revert accelerate upgrade attempt (latest is 1.13.0,
still has the same issue).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- config.py: Always use legacy config path in from_cli_overrides to ensure
  flat keys (generator.backend etc.) are properly translated via
  translate_legacy_config. Fixes VL/35B ValueError on GeneratorConfig.

- prepare_dataset.py: Add --env-class CLI arg (fleet_task|task_gen) to set
  per-record env_class in parquet data. Previously hardcoded to fleet_task,
  causing task_gen training to create FleetTaskEnv (requires tasks_file).

- fleet-common-setup.sh: Accept --env-class and pass to prepare_dataset.

- task-gen YAML: Pass --env-class task_gen in setup block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- task_gen_env.py: Default ROLLOUT_DIR to ~/rollouts instead of /workspace/rollouts.
  /workspace doesn't exist on GCP (only RunPod), causing PermissionError.

- config.py: Disable OmegaConf struct flag on base config before merging
  CLI overrides. Empty dicts in YAML (like chat_template_kwargs: {}) are
  loaded as closed structs, rejecting new keys during merge.

- config.py: Add try/except around asdict() in get_config_as_yaml_str
  to handle edge cases where asdict fails on Ray-serialized dataclasses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… logging

FLEET_API_KEY was not being propagated to Ray workers via runtime_env,
causing task_gen's import_single_task to fail with empty API key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dataset prepare step stores the environment name as 'data_source'
column, but TaskGenEnv.__init__ only looked for 'env_key'. This caused
all import_single_task calls to use env_id='unknown', which fails with
"Environment 'unknown' not found" from Fleet API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
expandable_segments:True in PYTORCH_CUDA_ALLOC_CONF is incompatible
with vLLM's CuMemAllocator, causing AssertionError during model load.
The 9B script already had this flag; the 35B was missing it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet env returns list-format content (from OpenEnv multimodal observations)
that text-only templates like Qwen3.5-35B-A3B can't handle. This converts
list content (strings or image_url dicts) to plain text before applying the
chat template, preventing jinja2 TemplateError on non-VL models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hint augmentation extends trajectory_ids in generator_input in-place,
but the separate uids variable in the trainer was never updated. This
caused IndexError in postprocess_generator_output when uids had fewer
entries than rewards (128 raw + N hinted rewards vs 128 uids).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a Fleet environment fails to provision (e.g., list_tools timeout),
return a zero-reward trajectory instead of propagating the exception
through tqdm.gather and crashing the entire training step. This makes
training resilient to transient Fleet API / MCP failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deniz and others added 21 commits March 29, 2026 18:03
The generator was calling env.step()/close()/init() via
_run_in_executor_if_available() which runs sync methods in a thread pool.
The sync methods use asyncio.run() creating NEW event loops, breaking MCP
transports bound to the main event loop (→ TCPTransport closed, 100%
verifier failure rate).

Add _env_init/_env_step/_env_close that prefer async variants when
available (init_async/step_async/close_async), keeping everything in the
main event loop. Falls back to executor for envs without async methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code changes (same as fleet/training):
- Port loss_chunk_size to HFModelWrapper: identity lm_head trick returns
  hidden states (B,S,8192) instead of logits (B,S,131072), then computes
  logits in 4096-token chunks with gradient checkpointing
- Pass loss_chunk_size from fsdp_worker.py for policy and ref model
- Revert flash_attn back to false (GatedDeltaNet Xid 31 in FSDP2)

Docs:
- Rewrite CHANGELOG.md as clean coherent document with all 6 fixes
- Update CLAUDE.md to match

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkyRL-v2's apply_overlong_filtering takes (loss_masks, stop_reasons)
not (loss_masks, response_ids, eos_token_id). Updated the call in
prepare_training_data() to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eval_n_samples_per_prompt 3→1: 57 trajectories instead of 171, ~3x faster eval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
k=3 sampling needed for pass@3 metric. Use MAX_TASKS env var to
reduce eval set size instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
flash_attn=false OOMs at 97K seq len even with chunked lm_head. The old
fork uses flash_attn=true + chunked lm_head — matching that config.

Updated CHANGELOG fix #5 and CLAUDE.md: earlier Xid 31 was from memory
exhaustion (missing chunked lm_head), not GatedDeltaNet incompatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Guard against Tinker returning logprobs with different length than
   tokens in rollout collection (truncate/pad to match).
2. Safety check in prepare_training_data ensuring all arrays match
   target_tokens length before creating Datum.
3. These prevent the "target_tokens and logprobs must have the same
   shape" error from Tinker's forward_backward API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rrors

Retries describe_db/query_db up to 3 times on connection-related errors
(TCPTransport closed, connection reset). Defense-in-depth alongside the
Fleet SDK keepalive_expiry fix (fleet-sdk PR NovaSky-AI#85).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vLLM 0.18.0 CuMemAllocator conflicts with expandable_segments. Without it,
memory fragmentation causes OOM on 35B. Pin 0.17.0 (cudaMalloc, no conflict).

Consolidated CHANGELOG: 5 fixes (merged old #4/#5 into single vLLM pin fix).
Updated CLAUDE.md to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The upstream vLLM 0.18 bump (d00b17e) removed backward-compat shims
and added 0.18-only APIs (OpenAIModelRegistry, OpenAIServingRender).
Since fleet-35b-run.sh pins vLLM 0.17.0 (for expandable_segments
compatibility), revert vllm_engine.py to the version with 0.17.0 support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts vLLM 0.17.0 pin and vllm_engine.py pre-0.18 revert. Instead:
- MAX_INPUT_LENGTH 96000→72000 to reduce memory pressure
- --no-pytorch-alloc-conf (disables expandable_segments for 0.18.0 compat)
- flash_attn=true + chunked lm_head + empty_cache at 72K

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. parse_tool_calls() returns ALL <tool_call> tags instead of just the
   first one. The model often batches multiple tool calls in one generation
   (73% of trajectories). Previously only the first was executed, the rest
   silently dropped.

2. Remove the "must explore before generating task" gate. With TCP errors
   on describe_db, this gate rejected 62-78% of generated tasks. The model
   should be free to generate tasks at any point.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…0.18.0

flash_attn=true + vLLM 0.18.0 triggers Xid 31 FAULT_PDE in GatedDeltaNet
during ref forward at both 97K and 72K — not a memory issue but a CUDA
memory mapping corruption from vLLM's CuMemAllocator. Trying SDPA at 72K.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backward pass verified working on sky-4da1-deniz. Re-enable step-0
eval for production training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ep timing

Corrected fix #4 to reflect the final working config:
- flash_attn=false (SDPA), not flash_attn=true
- flash_attn=true causes Xid 31 at both 97K AND 72K (CuMemAllocator issue)
- Added "Verified working" note: ref forward 8.4 min, backward 45.6 min

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hours on GCP spot 2×H200:8 with zero GPU errors.
Avg step time ~70 min, checkpoint saved to S3 at step 10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Fleet-specific optimizations for multi-node FSDP2 training, featuring memory-efficient chunked forward passes, synchronous model offloading with barriers, and support for multimodal Vision-Language tasks. It introduces a new task generation environment with database exploration tools and provides comprehensive S3 checkpoint management. The review feedback highlights a documentation inconsistency regarding flash attention usage, identifies deprecated Pydantic configuration patterns, warns against blocking calls within asynchronous functions, and suggests refactoring duplicated tool parsing logic into a shared utility.

Comment thread CLAUDE.md

3. **CUDA memory management for 35B**: `torch.cuda.empty_cache()` before backward pass in `worker.py` (policy + critic). Prevents OOM from fragmentation.

4. **Reduced sequence length (72K) for 35B**: `fleet-35b-run.sh` uses `MAX_INPUT_LENGTH=72000` (down from 96000) with `--no-pytorch-alloc-conf` (disables `expandable_segments` which conflicts with vLLM 0.18.0's `CuMemAllocator`). At 97K, SDPA OOM'd and flash_attn hit Xid 31 in GatedDeltaNet. At 72K, flash_attn=true + chunked lm_head + empty_cache fits without expandable_segments.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a documentation inconsistency. This line states that flash_attn=true is part of the working configuration at 72K sequence length. However, integrations/fleet/CHANGELOG.md (line 60) and scripts/fleet-35b-run.sh (line 43) specify that flash_attn=false is used. Please update this documentation for consistency.

Comment on lines +123 to +124
class Config:
frozen = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The inner Config class for Pydantic model configuration is deprecated in Pydantic v2 and will be removed in v3. Configuration should be passed as keyword arguments directly to the BaseModel class definition. You should remove this Config class and change line 102 to class RolloutOutput(BaseModel, frozen=True):.

Comment on lines +770 to +771
sampling_path = training_client.save_weights_for_sampler(name=f"step_{step:06d}").result().path
sampling_client = service_client.create_sampling_client(model_path=sampling_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of .result() here is a blocking call inside an async function, which can hinder performance by blocking the event loop. If the tinker SDK provides an asynchronous way to get the result (e.g., an await-able method), it should be used instead to maintain the benefits of an async architecture. This also applies to the .result() calls on lines 877-878.

Comment on lines +1 to +83
"""
Tool call parser for task generation environment.

Parses <tool_call> and <function_call> tagged JSON from LLM responses.
Copied from skyrl-train/integrations/fleet/env.py to avoid cross-package imports.
"""

import json
import re
from typing import Any, Dict, List, Optional


def _try_parse_json(raw: str) -> Optional[Dict[str, Any]]:
"""Try to parse JSON, repairing missing trailing braces if needed."""
raw = raw.strip()
try:
parsed = json.loads(raw)
if isinstance(parsed, dict):
return parsed
except (json.JSONDecodeError, ValueError):
pass

# Repair: models often drop trailing closing braces on nested JSON.
# Try appending up to 3 closing braces.
for extra in range(1, 4):
try:
parsed = json.loads(raw + "}" * extra)
if isinstance(parsed, dict):
return parsed
except (json.JSONDecodeError, ValueError):
continue

return None


def _parse_one(match_text: str) -> Optional[Dict[str, Any]]:
"""Parse a single tool call from matched text."""
parsed = _try_parse_json(match_text)
if parsed is None:
return None
name = parsed.get("name") or parsed.get("tool")
args = parsed.get("arguments") or parsed.get("params", {})
if name:
return {"name": name, "arguments": args}
return None


def parse_tool_call(action: str) -> Optional[Dict[str, Any]]:
"""Parse the first tool call from LLM response. Returns None if not found."""
calls = parse_tool_calls(action)
return calls[0] if calls else None


def parse_tool_calls(action: str) -> List[Dict[str, Any]]:
"""Parse all tool calls from LLM response.

Supports tag-based formats:
- <tool_call>{"name": "...", "arguments": {...}}</tool_call>
- <function_call>{"name": "...", "arguments": {...}}</function_call>

Also handles cases where the closing tag is missing (e.g., when </tool_call>
is used as the stop string and not included in the output).

Returns list of dicts with "name" and "arguments" keys.
"""
results: List[Dict[str, Any]] = []

for tag in ["tool_call", "function_call"]:
# Find all with closing tag
for match in re.finditer(rf"<{tag}>(.*?)</{tag}>", action, re.DOTALL):
parsed = _parse_one(match.group(1))
if parsed:
results.append(parsed)

# If none found with closing tags, try without (stop string case)
if not results:
match = re.search(rf"<{tag}>(.*?)(?:<\||\Z)", action, re.DOTALL)
if match:
parsed = _parse_one(match.group(1))
if parsed:
results.append(parsed)

return results
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file is nearly identical to skyrl-gym/skyrl_gym/envs/fleet_task/tool_call_parser.py. Duplicating this utility makes the code harder to maintain, as bug fixes or improvements would need to be applied in multiple places. Consider moving this parsing logic to a shared utility module within skyrl_gym to avoid duplication.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Deniz and others added 2 commits March 30, 2026 15:45
Add enable_hints flag (default False) gated by env_config.
Previously hints were always ON during training (controlled by is_eval).
Now hints only run when enable_hints=True AND not in eval mode.

Hints were net negative in iter#11 — verifier code dump confused evaluator.
Reward now uses raw variance only: R = base_quality + judge_gate * alpha * var(raw).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add tool_call_reward_per_call config (default 0.0, set to 0.02 in run script)
  Rewards each successful meta-tool call (describe_db, query_db) to incentivize
  multi-turn DB exploration instead of single-turn guessing from system prompt.

- MAX_INPUT_LENGTH 30720 → 65536: baseline runs showed 30K forced single-turn
  convergence by step 5 (describe_db schemas overflow context budget).

- MAX_GENERATE_LENGTH 2048 → 4096: more room for task+verifier output.

- eval_interval 20 → 10: get eval signal earlier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

output = self.model(sequences_fwd, attention_mask=None, position_ids=position_ids_fwd)
else:
output = self.model(sequences_fwd, attention_mask=attention_mask_fwd, position_ids=position_ids_fwd)
use_chunked = self.loss_chunk_size > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 loss_chunk_size=None from config causes TypeError in model forward pass

The config default for loss_chunk_size is Optional[int] = None (config.py:623) and the YAML default is null (ppo_base_config.yaml:260). fsdp_worker.py:191 passes self.cfg.loss_chunk_size (which is None) directly to HFModelWrapper(loss_chunk_size=None). Since None is passed explicitly, it overrides the wrapper's default of 0, setting self.loss_chunk_size = None. Then at line 388, use_chunked = self.loss_chunk_size > 0 raises TypeError: '>' not supported between instances of 'NoneType' and 'int', crashing the first training forward pass for any run that doesn't explicitly set loss_chunk_size to an integer (i.e., all non-Fleet-35B runs using default config).

Suggested change
use_chunked = self.loss_chunk_size > 0
use_chunked = self.loss_chunk_size is not None and self.loss_chunk_size > 0
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +202 to +203
if mini_batch_size <= 0:
mini_batch_size = dp_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 stage_chunks fallback to dp_size silently drops trailing samples

When the while loop at line 200 exhausts without finding a valid mini_batch_size (because no multiple of dp_size evenly divides len(data)), it falls back to mini_batch_size = dp_size at line 203. If len(data) % dp_size != 0, the subsequent num_mini_batches = len(data) // mini_batch_size at line 212 silently drops the remainder samples. For example, with len(data)=145 and dp_size=16, 145 // 16 = 9 mini-batches process only 144 samples, silently dropping 1. This can occur when hint augmentation produces batch sizes that aren't multiples of dp_size.

Suggested change
if mini_batch_size <= 0:
mini_batch_size = dp_size
if mini_batch_size <= 0:
mini_batch_size = dp_size
if len(data) % mini_batch_size != 0:
logger.warning(
f"Fallback mini_batch_size={mini_batch_size} does not evenly divide "
f"batch of {len(data)} samples; {len(data) % mini_batch_size} samples will be dropped."
)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Rewrites _judge_task as a pre-filter optimized for very low false positive
rate. Only rejects tasks with clear structural defects:
  1. Phantom tables (not in env schema)
  2. Undefined function/constant references
  3. Vacuous checks (only user-exists or len>0)
  4. Read-write mismatch (prompt asks reads, verifier checks writes)

Passes env_schema to the classifier so it can verify table references.
Uses Haiku via OpenRouter for low cost/latency (~$0.001 per call).
Defaults to ACCEPT on any error (conservative).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

Open in Devin Review

prompts_hinted += 1

# Create hinted agent_loop tasks (new env instances)
base_rep_id = max(item[0] for item in items) + 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Hint augmentation base_rep_id uses batch index instead of repetition_id, creating collisions across groups

base_rep_id = max(item[0] for item in items) + 1 at line 866 uses item[0], which is the index i into all_outputs (the batch position, e.g. 0–127), not the repetition_id from TrajectoryID. Since prepare_generator_input (skyrl/train/generators/utils.py:404) creates repetition_id values 0 through n_samples_per_prompt-1, hinted TrajectoryID objects should use repetition_id >= n_samples_per_prompt. Instead, base_rep_id is a large batch index (e.g., 120 for a prompt at batch position 120), which is semantically wrong and could collide with batch indices from other instance groups.

Fix

Replace max(item[0] for item in items) with max(trajectory_ids[item[0]].repetition_id for item in items) to correctly derive from the actual repetition_id.

Suggested change
base_rep_id = max(item[0] for item in items) + 1
base_rep_id = max(trajectory_ids[item[0]].repetition_id for item in items) + 1
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Read-write mismatch is too subjective and risks false positives.
Classifier now checks only: phantom tables, undefined refs, vacuous checks.
Switch judge model from Haiku to Sonnet 4.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 14 additional findings in Devin Review.

Open in Devin Review

Comment on lines +393 to +394
lm_head = self.model.lm_head
self.model.lm_head = _IdentityLMHead()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Chunked lm_head identity replacement silently fails with LoRA-wrapped models

When loss_chunk_size > 0, _IdentityLMHead is assigned via self.model.lm_head = _IdentityLMHead() (model_wrapper.py:394). For non-LoRA models, self.model is the CausalLM directly, so the identity replacement works. But when lora_rank > 0, self.model is a PeftModel. Python's __setattr__ sets lm_head on the PeftModel instance dict, but PeftModel's forward calls self.base_model(...) → inner CausalLM, which still accesses its original lm_head (not the identity). The chunked path silently falls back to materializing full (B, S, vocab_size) logits, defeating the memory optimization. This currently doesn't trigger since fleet configs use lora_rank=0, but it's a correctness trap for anyone enabling LoRA with loss_chunk_size.

How the bug manifests

With LoRA enabled:

  1. self.model.lm_head (getter) → PeftModel __getattr__ proxies to inner model's lm_head ✓
  2. self.model.lm_head = _IdentityLMHead() (setter) → sets on PeftModel instance dict, NOT inner model
  3. Forward: PeftModel → base_model → inner CausalLM uses original lm_head → full logits materialized
  4. output["logits"] has shape (B, S, vocab_size) instead of (B, S, hidden_dim)
  5. _chunked_lm_head_forward receives logits instead of hidden states → garbage logprobs
Prompt for agents
In skyrl/backends/skyrl_train/workers/model_wrapper.py, the _IdentityLMHead replacement at lines 393-394 does not work when self.model is a PeftModel (LoRA enabled). The fix should unwrap through PeftModel to set the identity on the actual inner model. For example:

1. Add a helper to find the actual CausalLM model inside PeftModel wrappers:
   inner_model = self.model
   while hasattr(inner_model, 'base_model') or hasattr(inner_model, 'model'):
       inner_model = getattr(inner_model, 'model', inner_model)

2. Replace lm_head on the inner_model:
   inner_model.lm_head = _IdentityLMHead()

3. Restore on the same inner_model in the finally block.

Alternatively, add an assertion at lines 388-390 that raises an error if both loss_chunk_size > 0 and the model is a PeftModel, to prevent silent incorrect behavior.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

)

# Ensure output directory exists
os.makedirs(os.path.dirname(os.path.expanduser(output_file)), exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 export_tasks.py crashes with FileNotFoundError when output path has no directory component

os.makedirs(os.path.dirname(os.path.expanduser(output_file)), exist_ok=True) at export_tasks.py:51 will raise FileNotFoundError if output_file is a bare filename like "tasks.json". os.path.dirname("tasks.json") returns "", and os.makedirs("", exist_ok=True) fails. The fix on line 53 (output_path = os.path.expanduser(output_file)) handles ~ correctly but doesn't prevent the earlier makedirs crash.

Suggested change
os.makedirs(os.path.dirname(os.path.expanduser(output_file)), exist_ok=True)
output_path = os.path.expanduser(output_file)
output_dir = os.path.dirname(output_path)
if output_dir:
os.makedirs(output_dir, exist_ok=True)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant