Skip to content

feat: --dit-split for row-level and layer-split tensor parallelism across GPUs#1640

Open
Daniel-Uzcategui wants to merge 2 commits into
leejet:masterfrom
Daniel-Uzcategui:pr-dit-split
Open

feat: --dit-split for row-level and layer-split tensor parallelism across GPUs#1640
Daniel-Uzcategui wants to merge 2 commits into
leejet:masterfrom
Daniel-Uzcategui:pr-dit-split

Conversation

@Daniel-Uzcategui

Copy link
Copy Markdown

Feature: --dit-split — GPU Tensor Parallelism for DiT Weights

Two complementary distribution modes for splitting DiT weights across multiple GPUs:

1. Row-Split (--dit-split cudaX,cudaY)

Splits each weight tensor's rows across GPUs using GGML's cuda_split_buffer_type. Balanced for homogeneous GPUs. Weight per-device proportional to tensor rows.

2. Layer-Split (--dit-layer-split with --dit-split)

Assigns entire transformer blocks to different GPUs using multi-backend sched. Better for heterogeneous GPU setups where one GPU is slower.

Dependencies

This PR uses GGML's cuda_split_buffer_type (already compiled in). Also includes CUDA split buffer pool allocator improvements in the ggml submodule to prevent memory fragmentation when loading multiple split buffers sequentially.

Results (4× GTX 1080 Ti 11GB, LTX distilled Q4_K_M + LoRA at 768p)

Frames Without Row-Split Improvement
161f 168s 150s 11% faster denoising
193f OOM 250s New ceiling (was impossible)
249f OOM OOM Activation limit (hardware ceiling)

Implementation

  • sd_ctx_params_t gets dit_split_devices and dit_layer_split fields
  • Row-split: ggml_backend_cuda_split_buffer_type per weight tensor
  • Layer-split: MultiBackendSpec + ggml_backend_sched routes transformer blocks to different GPUs
  • LTXAVRunner supports both modes via constructor and set_dit_split_buft()

🤖 Generated with Claude Code

Daniel-Uzcategui and others added 2 commits June 11, 2026 21:57
Uses GGML's cuda_split_buffer_type (already compiled in) to distribute
DiT weights across multiple GPUs. Adds --dit-split CLI flag.

Results at 768p (LTX distilled Q4_K_M + LoRA):
- 161f: 168s -> 150s (11% faster denoising)
- 193f: was OOM, now 250s (new ceiling)
- 249f: OOM on activations (hardware limit)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ross GPUs

Adds two GPU distribution modes for DiT weights:

1. Row-split (--dit-split cudaX,cudaY): Splits each weight tensor's rows
   across GPUs using GGML's cuda_split_buffer_type. Balanced for homogeneous
   GPUs. Weight per-device proportional to tensor rows.

2. Layer-split (--dit-layer-split with --dit-split): Assigns entire
   transformer blocks to different GPUs using multi-backend sched.
   Better for heterogeneous GPU setups where one GPU is slower.

Results at 768p (LTX distilled Q4_K_M + LoRA, 1080 Ti):
- 161f: 168s -> 150s (11% faster denoising with row-split)
- 193f: was OOM, now 250s (new ceiling with row-split)
- 249f: OOM on activations (hardware limit)

Also includes CUDA split buffer pool allocator to prevent memory
fragmentation when loading multiple split buffers sequentially.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant