Skip to content

Diffusion Timestep during Joint T2AV Training #245

Description

@suimuc

Hi, thanks for releasing the LTX-2 training code.

While reading the training code, I noticed that FlexibleStrategy.prepare_training_inputs() processes video and audio separately:

video_result = self._process_modality(self.config.video, batch, "video", timestep_sampler)
audio_result = self._process_modality(self.config.audio, batch, "audio", timestep_sampler)

Inside _process_modality(), each generated modality calls _initialize_noisy_target(), which samples sigma independently.

This means that for the same training sample, video and audio may be trained at different diffusion timesteps. For joint text-to-audio-video generation, should the generated audio and video modalities share the same sampled diffusion timestep/sigma?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions