Conversation
Add fleet_task environment that integrates Fleet-hosted tasks with SkyRL via OpenEnv's FleetTaskEnv abstraction layer. Supports multi-turn tool-use and computer-use (multimodal) modalities. - FleetTaskEnv(BaseTextEnv): provisions Fleet env, multi-turn episodes, reward via verifier, partial reward support, hint augmentation - Tool call parser: handles <tool_call>/<function_call> tag formats with JSON repair for missing closing braces - Multimodal observations: returns image_url content blocks for CUA, compatible with upstream's extract_images_from_conversation() - Per-env metrics aggregation with environment breakdown - Context management integration for long trajectories - Trace upload support for eval telemetry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onfigs Port Fleet-specific training infrastructure from fork to fresh SkyRL-v2: Entrypoints: - main_fleet.py: GRPO training on Fleet-hosted envs with S3 checkpoints - main_task_gen.py: Task generation training entrypoint - main_fleet_tinker.py: Tinker-based training with Fleet envs (LoRA, async) Dataset & Checkpoints: - prepare_dataset.py: Convert Fleet task JSON to SkyRL parquet format (stratified split, dedup, env capping, difficulty filtering) - s3_checkpoints.py: Async S3 upload, cross-VM resume, local cleanup - export_tasks.py: CLI to export tasks from Fleet API Training Scripts: - fleet-common-setup.sh: Shared setup (deps, OpenEnv, dataset download) - fleet-common-run.sh: Multi-node Ray cluster + training launch - fleet-35b-run.sh: Qwen3.5-35B config (TP=2, multi-node) - fleet-qwen35-extra-setup.sh: Qwen3.5 deps (transformers 5.3, flash-attn) - fleet-task-gen-run.sh: Task generation config SkyPilot YAML Configs: - openenv-fleet-grpo-qwen3_5-35b.yaml: 2-node H200 training - task-gen-grpo-qwen3_5-9b.yaml: Single-node task gen Also adds fleet_task and task_gen config to skyrl_gym_config/default.yaml. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port the task generation environment from fleet-ai/SkyRL that enables RL-based training of task-generating models. The environment supports multi-turn task generation where the model generates (prompt, verifier) pairs that are evaluated via Fleet harness rollouts. Key components: - TaskGenEnv(BaseTextEnv): Multi-turn env with tool-based DB exploration, task generation, and reward computation via variance + hint gap - VerifierSandbox: AST-based static analysis for generated verifier code safety (blocked imports/builtins, complexity bounds, signature checks) - Tool call parser: Handles <tool_call>/<function_call> tag formats Reward formula: R = gate * (base_quality + alpha * var(raw_scores) + hint_gap) Depends on PR #2 (fleet/training) for integrations.fleet.task_gen_reward. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When all raw rollout samples for a prompt score 0, hint augmentation generates additional rollouts with verifier feedback injected into the prompt. This rescues GRPO signal for otherwise dead prompts. Key components: - _run_hint_augmentation() in SkyRLGymGenerator: groups outputs by instance_id, identifies failing prompts, builds hint text from verifier ERROR/SUCCESS_ACCUMULATOR, launches hinted rollouts - RLTF-SD: replaces hinted prompt_ids with original unhinted prompt_ids so the model learns to produce hint-quality outputs from the original prompt alone (grad log pi(y_hint | x_0) not grad log pi(y_hint | x_0 + hint)) - First-turn baseline in compute_grpo_outcome_advantage: when is_hinted is present, computes group mean/std from raw samples only, preventing hinted samples from contaminating the GRPO baseline - Metrics: hint/total_hinted_rollouts, hint/hint_success_rate, hint/prompts_hinted, hint/signal_rescued Config: enable_hints, hint_reward_threshold, n_hint_samples in fleet_task section of skyrl_gym_config. Only runs during training (not eval), only for non-step-wise trajectories, and only when fleet_task.enable_hints=true. Depends on PR #1 (fleet/task-env) for FleetTaskEnv.build_hint_text(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port vision-language model support from SkyRL v1 (feat/vl-support-clean) to SkyRL-v2's architecture: - Generator: VL-aware chat template, image accumulation across turns, multi_modal_data construction for vLLM - Engine pipeline: thread multi_modal_data through preprocess/generate in both sync and async vLLM engines - Fleet env: Qwen coordinate adaptation ([0,1000] <-> pixel), initial screenshot capture, computer_use browser hints, done signal detection - Utilities: image extraction, base64 decode, processor loading, VL chat template with proper vision token expansion - New VL run script and SkyPilot YAML for CUA training - Update existing YAMLs to use fleet/all branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RunPod/Lambda/Nebius/Vast were all out of H200 capacity. Add GCP spot with proper NVIDIA 570 driver image. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkyRL-v2 pyproject.toml defines 'fsdp' extra (includes vllm, flash-attn, torch, flashinfer) but not a standalone 'vllm' extra. The old SkyRL had 'vllm' as a separate extra. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install silently fails to build causal-conv1d CUDA extension (reports "Checked 1 package" but module is not importable). Use pip with --no-build-isolation to ensure it finds torch from the venv. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In SkyRL-v2, scripts/ is directly under repo root (not nested under skyrl-train/). Changed cd from "../.." to ".." so the run scripts correctly resolve the repo root directory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TrainerConfig: loss_chunk_size, use_hybrid_env_sampling, min_samples_per_env GeneratorConfig: inject_context_status, context_warning_threshold, trajectory_timeout_seconds SkyRL-v2's strict Hydra config rejects unknown keys (no + prefix), so these must be defined in the dataclass and YAML defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The fleet entrypoints use @hydra.main which loads the legacy YAML directly, but validate_cfg expects generator.inference_engine.* (the new structured format). Apply translate_legacy_config to convert flat generator.* keys before validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The legacy YAML has flat generator.* keys (e.g. generator.backend) but validate_cfg expects generator.inference_engine.* with all fields including distributed_executor_backend. Add the full inference_engine section with defaults so all fields are present after Hydra loads the config and translate_legacy_config moves CLI overrides into it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove explicit fleet_task register() call from main_fleet.py since skyrl_gym.envs.__init__ already auto-registers it - Remove --data-dir-name task_gen from task-gen run script so it uses the default MODALITY-based path (matching setup's download path) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace OmegaConf.create() approach (loses dataclass type info) with in-place sync of flat generator.* CLI overrides into the structured generator.inference_engine section. This preserves the Hydra DictConfig and avoids TypeError on dataclasses.asdict(). - Remove --skip-prepare from task-gen YAML so parquet files are generated - Remove duplicate fleet_task registration (auto-registered by __init__) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hydra entrypoints pass DictConfig (not dataclass instances), so dataclasses.asdict() fails. Fall back to OmegaConf.to_yaml() for DictConfig objects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hydra's @hydra.main produces DictConfig objects, but the codebase expects typed dataclass instances (asdict(), attribute access, etc.). Switch Fleet entrypoints to use SkyRLTrainConfig.from_cli_overrides() which produces proper typed dataclasses via the legacy config translation path. - Add fleet_task/task_gen as Optional[Dict] fields on SkyRLGymConfig - Strip ++/+ Hydra prefixes from CLI args before from_cli_overrides - Remove _sync_legacy_generator_to_inference_engine (legacy path handles it) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate 1.12.0 passes param.__dict__ (which includes transformers 5.3.0's _is_hf_initialized flag) to Parameter.__new__() during init_empty_weights. PyTorch 2.10.0 rejects this unknown kwarg. Newer accelerate filters it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install -U accelerate pulls newer torch with CUDA 13.0, breaking torchvision (CUDA 12.8). Use pip install --no-deps instead to upgrade only accelerate without re-resolving transitive dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate's init_empty_weights passes param.__dict__ to Parameter() which includes _is_hf_initialized (set by transformers 5.x). torch 2.10 rejects this unknown kwarg. Patch Parameter.__new__ in fsdp_utils.py to filter it out. Revert accelerate upgrade attempt (latest is 1.13.0, still has the same issue). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- config.py: Always use legacy config path in from_cli_overrides to ensure flat keys (generator.backend etc.) are properly translated via translate_legacy_config. Fixes VL/35B ValueError on GeneratorConfig. - prepare_dataset.py: Add --env-class CLI arg (fleet_task|task_gen) to set per-record env_class in parquet data. Previously hardcoded to fleet_task, causing task_gen training to create FleetTaskEnv (requires tasks_file). - fleet-common-setup.sh: Accept --env-class and pass to prepare_dataset. - task-gen YAML: Pass --env-class task_gen in setup block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- task_gen_env.py: Default ROLLOUT_DIR to ~/rollouts instead of /workspace/rollouts.
/workspace doesn't exist on GCP (only RunPod), causing PermissionError.
- config.py: Disable OmegaConf struct flag on base config before merging
CLI overrides. Empty dicts in YAML (like chat_template_kwargs: {}) are
loaded as closed structs, rejecting new keys during merge.
- config.py: Add try/except around asdict() in get_config_as_yaml_str
to handle edge cases where asdict fails on Ray-serialized dataclasses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… logging FLEET_API_KEY was not being propagated to Ray workers via runtime_env, causing task_gen's import_single_task to fail with empty API key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dataset prepare step stores the environment name as 'data_source' column, but TaskGenEnv.__init__ only looked for 'env_key'. This caused all import_single_task calls to use env_id='unknown', which fails with "Environment 'unknown' not found" from Fleet API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
expandable_segments:True in PYTORCH_CUDA_ALLOC_CONF is incompatible with vLLM's CuMemAllocator, causing AssertionError during model load. The 9B script already had this flag; the 35B was missing it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet env returns list-format content (from OpenEnv multimodal observations) that text-only templates like Qwen3.5-35B-A3B can't handle. This converts list content (strings or image_url dicts) to plain text before applying the chat template, preventing jinja2 TemplateError on non-VL models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hint augmentation extends trajectory_ids in generator_input in-place, but the separate uids variable in the trainer was never updated. This caused IndexError in postprocess_generator_output when uids had fewer entries than rewards (128 raw + N hinted rewards vs 128 uids). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a Fleet environment fails to provision (e.g., list_tools timeout), return a zero-reward trajectory instead of propagating the exception through tqdm.gather and crashing the entire training step. This makes training resilient to transient Fleet API / MCP failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The generator was calling env.step()/close()/init() via _run_in_executor_if_available() which runs sync methods in a thread pool. The sync methods use asyncio.run() creating NEW event loops, breaking MCP transports bound to the main event loop (→ TCPTransport closed, 100% verifier failure rate). Add _env_init/_env_step/_env_close that prefer async variants when available (init_async/step_async/close_async), keeping everything in the main event loop. Falls back to executor for envs without async methods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code changes (same as fleet/training): - Port loss_chunk_size to HFModelWrapper: identity lm_head trick returns hidden states (B,S,8192) instead of logits (B,S,131072), then computes logits in 4096-token chunks with gradient checkpointing - Pass loss_chunk_size from fsdp_worker.py for policy and ref model - Revert flash_attn back to false (GatedDeltaNet Xid 31 in FSDP2) Docs: - Rewrite CHANGELOG.md as clean coherent document with all 6 fixes - Update CLAUDE.md to match Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkyRL-v2's apply_overlong_filtering takes (loss_masks, stop_reasons) not (loss_masks, response_ids, eos_token_id). Updated the call in prepare_training_data() to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eval_n_samples_per_prompt 3→1: 57 trajectories instead of 171, ~3x faster eval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
k=3 sampling needed for pass@3 metric. Use MAX_TASKS env var to reduce eval set size instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
flash_attn=false OOMs at 97K seq len even with chunked lm_head. The old fork uses flash_attn=true + chunked lm_head — matching that config. Updated CHANGELOG fix #5 and CLAUDE.md: earlier Xid 31 was from memory exhaustion (missing chunked lm_head), not GatedDeltaNet incompatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Guard against Tinker returning logprobs with different length than tokens in rollout collection (truncate/pad to match). 2. Safety check in prepare_training_data ensuring all arrays match target_tokens length before creating Datum. 3. These prevent the "target_tokens and logprobs must have the same shape" error from Tinker's forward_backward API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rrors Retries describe_db/query_db up to 3 times on connection-related errors (TCPTransport closed, connection reset). Defense-in-depth alongside the Fleet SDK keepalive_expiry fix (fleet-sdk PR NovaSky-AI#85). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vLLM 0.18.0 CuMemAllocator conflicts with expandable_segments. Without it, memory fragmentation causes OOM on 35B. Pin 0.17.0 (cudaMalloc, no conflict). Consolidated CHANGELOG: 5 fixes (merged old #4/#5 into single vLLM pin fix). Updated CLAUDE.md to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The upstream vLLM 0.18 bump (d00b17e) removed backward-compat shims and added 0.18-only APIs (OpenAIModelRegistry, OpenAIServingRender). Since fleet-35b-run.sh pins vLLM 0.17.0 (for expandable_segments compatibility), revert vllm_engine.py to the version with 0.17.0 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts vLLM 0.17.0 pin and vllm_engine.py pre-0.18 revert. Instead: - MAX_INPUT_LENGTH 96000→72000 to reduce memory pressure - --no-pytorch-alloc-conf (disables expandable_segments for 0.18.0 compat) - flash_attn=true + chunked lm_head + empty_cache at 72K Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. parse_tool_calls() returns ALL <tool_call> tags instead of just the first one. The model often batches multiple tool calls in one generation (73% of trajectories). Previously only the first was executed, the rest silently dropped. 2. Remove the "must explore before generating task" gate. With TCP errors on describe_db, this gate rejected 62-78% of generated tasks. The model should be free to generate tasks at any point. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…0.18.0 flash_attn=true + vLLM 0.18.0 triggers Xid 31 FAULT_PDE in GatedDeltaNet during ref forward at both 97K and 72K — not a memory issue but a CUDA memory mapping corruption from vLLM's CuMemAllocator. Trying SDPA at 72K. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backward pass verified working on sky-4da1-deniz. Re-enable step-0 eval for production training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ep timing Corrected fix #4 to reflect the final working config: - flash_attn=false (SDPA), not flash_attn=true - flash_attn=true causes Xid 31 at both 97K AND 72K (CuMemAllocator issue) - Added "Verified working" note: ref forward 8.4 min, backward 45.6 min Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hours on GCP spot 2×H200:8 with zero GPU errors. Avg step time ~70 min, checkpoint saved to S3 at step 10. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements Fleet-specific optimizations for multi-node FSDP2 training, featuring memory-efficient chunked forward passes, synchronous model offloading with barriers, and support for multimodal Vision-Language tasks. It introduces a new task generation environment with database exploration tools and provides comprehensive S3 checkpoint management. The review feedback highlights a documentation inconsistency regarding flash attention usage, identifies deprecated Pydantic configuration patterns, warns against blocking calls within asynchronous functions, and suggests refactoring duplicated tool parsing logic into a shared utility.
|
|
||
| 3. **CUDA memory management for 35B**: `torch.cuda.empty_cache()` before backward pass in `worker.py` (policy + critic). Prevents OOM from fragmentation. | ||
|
|
||
| 4. **Reduced sequence length (72K) for 35B**: `fleet-35b-run.sh` uses `MAX_INPUT_LENGTH=72000` (down from 96000) with `--no-pytorch-alloc-conf` (disables `expandable_segments` which conflicts with vLLM 0.18.0's `CuMemAllocator`). At 97K, SDPA OOM'd and flash_attn hit Xid 31 in GatedDeltaNet. At 72K, flash_attn=true + chunked lm_head + empty_cache fits without expandable_segments. |
There was a problem hiding this comment.
There appears to be a documentation inconsistency. This line states that flash_attn=true is part of the working configuration at 72K sequence length. However, integrations/fleet/CHANGELOG.md (line 60) and scripts/fleet-35b-run.sh (line 43) specify that flash_attn=false is used. Please update this documentation for consistency.
| class Config: | ||
| frozen = True |
There was a problem hiding this comment.
The inner Config class for Pydantic model configuration is deprecated in Pydantic v2 and will be removed in v3. Configuration should be passed as keyword arguments directly to the BaseModel class definition. You should remove this Config class and change line 102 to class RolloutOutput(BaseModel, frozen=True):.
| sampling_path = training_client.save_weights_for_sampler(name=f"step_{step:06d}").result().path | ||
| sampling_client = service_client.create_sampling_client(model_path=sampling_path) |
There was a problem hiding this comment.
The use of .result() here is a blocking call inside an async function, which can hinder performance by blocking the event loop. If the tinker SDK provides an asynchronous way to get the result (e.g., an await-able method), it should be used instead to maintain the benefits of an async architecture. This also applies to the .result() calls on lines 877-878.
| """ | ||
| Tool call parser for task generation environment. | ||
|
|
||
| Parses <tool_call> and <function_call> tagged JSON from LLM responses. | ||
| Copied from skyrl-train/integrations/fleet/env.py to avoid cross-package imports. | ||
| """ | ||
|
|
||
| import json | ||
| import re | ||
| from typing import Any, Dict, List, Optional | ||
|
|
||
|
|
||
| def _try_parse_json(raw: str) -> Optional[Dict[str, Any]]: | ||
| """Try to parse JSON, repairing missing trailing braces if needed.""" | ||
| raw = raw.strip() | ||
| try: | ||
| parsed = json.loads(raw) | ||
| if isinstance(parsed, dict): | ||
| return parsed | ||
| except (json.JSONDecodeError, ValueError): | ||
| pass | ||
|
|
||
| # Repair: models often drop trailing closing braces on nested JSON. | ||
| # Try appending up to 3 closing braces. | ||
| for extra in range(1, 4): | ||
| try: | ||
| parsed = json.loads(raw + "}" * extra) | ||
| if isinstance(parsed, dict): | ||
| return parsed | ||
| except (json.JSONDecodeError, ValueError): | ||
| continue | ||
|
|
||
| return None | ||
|
|
||
|
|
||
| def _parse_one(match_text: str) -> Optional[Dict[str, Any]]: | ||
| """Parse a single tool call from matched text.""" | ||
| parsed = _try_parse_json(match_text) | ||
| if parsed is None: | ||
| return None | ||
| name = parsed.get("name") or parsed.get("tool") | ||
| args = parsed.get("arguments") or parsed.get("params", {}) | ||
| if name: | ||
| return {"name": name, "arguments": args} | ||
| return None | ||
|
|
||
|
|
||
| def parse_tool_call(action: str) -> Optional[Dict[str, Any]]: | ||
| """Parse the first tool call from LLM response. Returns None if not found.""" | ||
| calls = parse_tool_calls(action) | ||
| return calls[0] if calls else None | ||
|
|
||
|
|
||
| def parse_tool_calls(action: str) -> List[Dict[str, Any]]: | ||
| """Parse all tool calls from LLM response. | ||
|
|
||
| Supports tag-based formats: | ||
| - <tool_call>{"name": "...", "arguments": {...}}</tool_call> | ||
| - <function_call>{"name": "...", "arguments": {...}}</function_call> | ||
|
|
||
| Also handles cases where the closing tag is missing (e.g., when </tool_call> | ||
| is used as the stop string and not included in the output). | ||
|
|
||
| Returns list of dicts with "name" and "arguments" keys. | ||
| """ | ||
| results: List[Dict[str, Any]] = [] | ||
|
|
||
| for tag in ["tool_call", "function_call"]: | ||
| # Find all with closing tag | ||
| for match in re.finditer(rf"<{tag}>(.*?)</{tag}>", action, re.DOTALL): | ||
| parsed = _parse_one(match.group(1)) | ||
| if parsed: | ||
| results.append(parsed) | ||
|
|
||
| # If none found with closing tags, try without (stop string case) | ||
| if not results: | ||
| match = re.search(rf"<{tag}>(.*?)(?:<\||\Z)", action, re.DOTALL) | ||
| if match: | ||
| parsed = _parse_one(match.group(1)) | ||
| if parsed: | ||
| results.append(parsed) | ||
|
|
||
| return results |
There was a problem hiding this comment.
This file is nearly identical to skyrl-gym/skyrl_gym/envs/fleet_task/tool_call_parser.py. Duplicating this utility makes the code harder to maintain, as bug fixes or improvements would need to be applied in multiple places. Consider moving this parsing logic to a shared utility module within skyrl_gym to avoid duplication.
Add enable_hints flag (default False) gated by env_config. Previously hints were always ON during training (controlled by is_eval). Now hints only run when enable_hints=True AND not in eval mode. Hints were net negative in iter#11 — verifier code dump confused evaluator. Reward now uses raw variance only: R = base_quality + judge_gate * alpha * var(raw). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add tool_call_reward_per_call config (default 0.0, set to 0.02 in run script) Rewards each successful meta-tool call (describe_db, query_db) to incentivize multi-turn DB exploration instead of single-turn guessing from system prompt. - MAX_INPUT_LENGTH 30720 → 65536: baseline runs showed 30K forced single-turn convergence by step 5 (describe_db schemas overflow context budget). - MAX_GENERATE_LENGTH 2048 → 4096: more room for task+verifier output. - eval_interval 20 → 10: get eval signal earlier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| output = self.model(sequences_fwd, attention_mask=None, position_ids=position_ids_fwd) | ||
| else: | ||
| output = self.model(sequences_fwd, attention_mask=attention_mask_fwd, position_ids=position_ids_fwd) | ||
| use_chunked = self.loss_chunk_size > 0 |
There was a problem hiding this comment.
🔴 loss_chunk_size=None from config causes TypeError in model forward pass
The config default for loss_chunk_size is Optional[int] = None (config.py:623) and the YAML default is null (ppo_base_config.yaml:260). fsdp_worker.py:191 passes self.cfg.loss_chunk_size (which is None) directly to HFModelWrapper(loss_chunk_size=None). Since None is passed explicitly, it overrides the wrapper's default of 0, setting self.loss_chunk_size = None. Then at line 388, use_chunked = self.loss_chunk_size > 0 raises TypeError: '>' not supported between instances of 'NoneType' and 'int', crashing the first training forward pass for any run that doesn't explicitly set loss_chunk_size to an integer (i.e., all non-Fleet-35B runs using default config).
| use_chunked = self.loss_chunk_size > 0 | |
| use_chunked = self.loss_chunk_size is not None and self.loss_chunk_size > 0 |
Was this helpful? React with 👍 or 👎 to provide feedback.
| if mini_batch_size <= 0: | ||
| mini_batch_size = dp_size |
There was a problem hiding this comment.
🟡 stage_chunks fallback to dp_size silently drops trailing samples
When the while loop at line 200 exhausts without finding a valid mini_batch_size (because no multiple of dp_size evenly divides len(data)), it falls back to mini_batch_size = dp_size at line 203. If len(data) % dp_size != 0, the subsequent num_mini_batches = len(data) // mini_batch_size at line 212 silently drops the remainder samples. For example, with len(data)=145 and dp_size=16, 145 // 16 = 9 mini-batches process only 144 samples, silently dropping 1. This can occur when hint augmentation produces batch sizes that aren't multiples of dp_size.
| if mini_batch_size <= 0: | |
| mini_batch_size = dp_size | |
| if mini_batch_size <= 0: | |
| mini_batch_size = dp_size | |
| if len(data) % mini_batch_size != 0: | |
| logger.warning( | |
| f"Fallback mini_batch_size={mini_batch_size} does not evenly divide " | |
| f"batch of {len(data)} samples; {len(data) % mini_batch_size} samples will be dropped." | |
| ) |
Was this helpful? React with 👍 or 👎 to provide feedback.
Rewrites _judge_task as a pre-filter optimized for very low false positive rate. Only rejects tasks with clear structural defects: 1. Phantom tables (not in env schema) 2. Undefined function/constant references 3. Vacuous checks (only user-exists or len>0) 4. Read-write mismatch (prompt asks reads, verifier checks writes) Passes env_schema to the classifier so it can verify table references. Uses Haiku via OpenRouter for low cost/latency (~$0.001 per call). Defaults to ACCEPT on any error (conservative). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| prompts_hinted += 1 | ||
|
|
||
| # Create hinted agent_loop tasks (new env instances) | ||
| base_rep_id = max(item[0] for item in items) + 1 |
There was a problem hiding this comment.
🟡 Hint augmentation base_rep_id uses batch index instead of repetition_id, creating collisions across groups
base_rep_id = max(item[0] for item in items) + 1 at line 866 uses item[0], which is the index i into all_outputs (the batch position, e.g. 0–127), not the repetition_id from TrajectoryID. Since prepare_generator_input (skyrl/train/generators/utils.py:404) creates repetition_id values 0 through n_samples_per_prompt-1, hinted TrajectoryID objects should use repetition_id >= n_samples_per_prompt. Instead, base_rep_id is a large batch index (e.g., 120 for a prompt at batch position 120), which is semantically wrong and could collide with batch indices from other instance groups.
Fix
Replace max(item[0] for item in items) with max(trajectory_ids[item[0]].repetition_id for item in items) to correctly derive from the actual repetition_id.
| base_rep_id = max(item[0] for item in items) + 1 | |
| base_rep_id = max(trajectory_ids[item[0]].repetition_id for item in items) + 1 |
Was this helpful? React with 👍 or 👎 to provide feedback.
Read-write mismatch is too subjective and risks false positives. Classifier now checks only: phantom tables, undefined refs, vacuous checks. Switch judge model from Haiku to Sonnet 4.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| lm_head = self.model.lm_head | ||
| self.model.lm_head = _IdentityLMHead() |
There was a problem hiding this comment.
🔴 Chunked lm_head identity replacement silently fails with LoRA-wrapped models
When loss_chunk_size > 0, _IdentityLMHead is assigned via self.model.lm_head = _IdentityLMHead() (model_wrapper.py:394). For non-LoRA models, self.model is the CausalLM directly, so the identity replacement works. But when lora_rank > 0, self.model is a PeftModel. Python's __setattr__ sets lm_head on the PeftModel instance dict, but PeftModel's forward calls self.base_model(...) → inner CausalLM, which still accesses its original lm_head (not the identity). The chunked path silently falls back to materializing full (B, S, vocab_size) logits, defeating the memory optimization. This currently doesn't trigger since fleet configs use lora_rank=0, but it's a correctness trap for anyone enabling LoRA with loss_chunk_size.
How the bug manifests
With LoRA enabled:
self.model.lm_head(getter) → PeftModel__getattr__proxies to inner model's lm_head ✓self.model.lm_head = _IdentityLMHead()(setter) → sets on PeftModel instance dict, NOT inner model- Forward: PeftModel → base_model → inner CausalLM uses original lm_head → full logits materialized
output["logits"]has shape (B, S, vocab_size) instead of (B, S, hidden_dim)_chunked_lm_head_forwardreceives logits instead of hidden states → garbage logprobs
Prompt for agents
In skyrl/backends/skyrl_train/workers/model_wrapper.py, the _IdentityLMHead replacement at lines 393-394 does not work when self.model is a PeftModel (LoRA enabled). The fix should unwrap through PeftModel to set the identity on the actual inner model. For example:
1. Add a helper to find the actual CausalLM model inside PeftModel wrappers:
inner_model = self.model
while hasattr(inner_model, 'base_model') or hasattr(inner_model, 'model'):
inner_model = getattr(inner_model, 'model', inner_model)
2. Replace lm_head on the inner_model:
inner_model.lm_head = _IdentityLMHead()
3. Restore on the same inner_model in the finally block.
Alternatively, add an assertion at lines 388-390 that raises an error if both loss_chunk_size > 0 and the model is a PeftModel, to prevent silent incorrect behavior.
Was this helpful? React with 👍 or 👎 to provide feedback.
| ) | ||
|
|
||
| # Ensure output directory exists | ||
| os.makedirs(os.path.dirname(os.path.expanduser(output_file)), exist_ok=True) |
There was a problem hiding this comment.
🟡 export_tasks.py crashes with FileNotFoundError when output path has no directory component
os.makedirs(os.path.dirname(os.path.expanduser(output_file)), exist_ok=True) at export_tasks.py:51 will raise FileNotFoundError if output_file is a bare filename like "tasks.json". os.path.dirname("tasks.json") returns "", and os.makedirs("", exist_ok=True) fails. The fix on line 53 (output_path = os.path.expanduser(output_file)) handles ~ correctly but doesn't prevent the earlier makedirs crash.
| os.makedirs(os.path.dirname(os.path.expanduser(output_file)), exist_ok=True) | |
| output_path = os.path.expanduser(output_file) | |
| output_dir = os.path.dirname(output_path) | |
| if output_dir: | |
| os.makedirs(output_dir, exist_ok=True) |
Was this helpful? React with 👍 or 👎 to provide feedback.
All