Skip to content

[skyrl][tinker] Multi-modal Tinker Sampling#1484

Open
nithinvc wants to merge 6 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/mm-sample
Open

[skyrl][tinker] Multi-modal Tinker Sampling#1484
nithinvc wants to merge 6 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/mm-sample

Conversation

@nithinvc
Copy link
Copy Markdown
Contributor

@nithinvc nithinvc commented Apr 9, 2026

Summary

Adds VLM sampling support to RemoteInferenceClient sample endpoint. Finalizes inference side changes for #1200.

  • Extract _render_for_sample to handle both text-only and image-containing prompts. For text-only prompts, it flattens chunk tokens directly. When images are present, it calls /v1/chat/completions/render to process images, then splices placeholder tokens into the pre-tokenized text stream with adjusted offsets.
  • Update sample() to pass multi-modal features through to the generate payload when present.

Test plan

  • Verify text-only sampling still works (no render call made, features is None)
  • Verify image sampling works end-to-end (render call is made, placeholder tokens are spliced correctly, features are included in the generate payload)
  • New multi-modal sampling tests in test_vlm_inference_generation.py

Open with Devin

@nithinvc nithinvc marked this pull request as ready for review April 9, 2026 16:46
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

gemini-code-assist[bot]

This comment was marked as resolved.

@SumanthRH SumanthRH self-assigned this Apr 13, 2026
pcmoritz pushed a commit that referenced this pull request Apr 17, 2026
## Summary

Integrates the VLLMRenderer (landed in #1464) into the SkyRL train
backend so that VLM training batches include image placeholder tokens
and decoded vision tensors (`pixel_values`, `image_grid_thw`).

- When using new inference (`_SKYRL_USE_NEW_INFERENCE`),
`_to_training_batch` lazily creates a `VLLMRenderer` and renders all
`ModelInput`s through it.
- Extracts `pixel_values` and `image_grid_thw` from rendered outputs and
adds them to the `TrainingInputBatch` as `TensorList` entries (one
tensor per batch element, since patch counts vary per image).
- Extends `_pad_batch` to handle `TensorList` fields by cycling and
cloning entries, matching the existing padding strategy for regular
tensors.
- Reorders `forward_backward` and `forward` to call `_to_training_batch`
before `_sleep_inference_engines`, since the renderer needs the
inference servers need to be initialized. Note that this does not wake
the KV cache or model GPU memory since that is explicitly done in
`save_weights_for_sampler` via the dispatcher.

## E2E Tinker VLM Classifier Curves
With #1484 , we can now run tinker vision language recipes against the
SkyRL. Merging both closes #1200

Example:
```bash
 _SKYRL_USE_NEW_INFERENCE=1 uv run --extra tinker --extra fsdp -m skyrl.tinker.api \
    --base-model "Qwen/Qwen3-VL-8B-Instruct" \
    --backend fsdp \
    --backend-config '{"trainer.placement.policy_num_gpus_per_node": 8, "generator.inference_engine.num_engines": 8, "trainer.placement.colocate_all": true, "trainer.use_sample_packing": false}'
```

Cookbook
```bash
 TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch python -m \
    tinker_cookbook.recipes.vlm_classifier.train  \
    base_url=http://localhost:8000 \
    model_name="Qwen/Qwen3-VL-4B-Instruct" \
    dataset=caltech101 
```

Train nll:
<img width="1200" height="675" alt="train_nll"
src="https://github.com/user-attachments/assets/82e36767-edee-43b7-ab4a-7fbf496c8cbb"
/>

Val nll:
<img width="1200" height="675" alt="val_nll"
src="https://github.com/user-attachments/assets/1dc6e96b-7e1b-4ead-bf0e-71e42eab0491"
/>

Val accuracy:
<img width="1200" height="675" alt="accuracy"
src="https://github.com/user-attachments/assets/ec6f92b8-a544-42d9-9a00-4c06292e7ae3"
/>

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1496"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants