Diffusion Timestep during Joint T2AV Training

Hi, thanks for releasing the LTX-2 training code.

While reading the training code, I noticed that FlexibleStrategy.prepare_training_inputs() processes video and audio separately:

```python
video_result = self._process_modality(self.config.video, batch, "video", timestep_sampler)
audio_result = self._process_modality(self.config.audio, batch, "audio", timestep_sampler)
```

Inside _process_modality(), each generated modality calls _initialize_noisy_target(), which samples sigma independently.

This means that for the same training sample, video and audio may be trained at different diffusion timesteps. For joint text-to-audio-video generation, should the generated audio and video modalities share the same sampled diffusion timestep/sigma?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diffusion Timestep during Joint T2AV Training #245

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Diffusion Timestep during Joint T2AV Training #245

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions