Hi, thanks for releasing the LTX-2 training code.
While reading the training code, I noticed that FlexibleStrategy.prepare_training_inputs() processes video and audio separately:
video_result = self._process_modality(self.config.video, batch, "video", timestep_sampler)
audio_result = self._process_modality(self.config.audio, batch, "audio", timestep_sampler)
Inside _process_modality(), each generated modality calls _initialize_noisy_target(), which samples sigma independently.
This means that for the same training sample, video and audio may be trained at different diffusion timesteps. For joint text-to-audio-video generation, should the generated audio and video modalities share the same sampled diffusion timestep/sigma?
Hi, thanks for releasing the LTX-2 training code.
While reading the training code, I noticed that FlexibleStrategy.prepare_training_inputs() processes video and audio separately:
Inside _process_modality(), each generated modality calls _initialize_noisy_target(), which samples sigma independently.
This means that for the same training sample, video and audio may be trained at different diffusion timesteps. For joint text-to-audio-video generation, should the generated audio and video modalities share the same sampled diffusion timestep/sigma?