feat: --dit-split for row-level and layer-split tensor parallelism across GPUs#1640
Open
Daniel-Uzcategui wants to merge 2 commits into
Open
feat: --dit-split for row-level and layer-split tensor parallelism across GPUs#1640Daniel-Uzcategui wants to merge 2 commits into
Daniel-Uzcategui wants to merge 2 commits into
Conversation
Uses GGML's cuda_split_buffer_type (already compiled in) to distribute DiT weights across multiple GPUs. Adds --dit-split CLI flag. Results at 768p (LTX distilled Q4_K_M + LoRA): - 161f: 168s -> 150s (11% faster denoising) - 193f: was OOM, now 250s (new ceiling) - 249f: OOM on activations (hardware limit) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ross GPUs Adds two GPU distribution modes for DiT weights: 1. Row-split (--dit-split cudaX,cudaY): Splits each weight tensor's rows across GPUs using GGML's cuda_split_buffer_type. Balanced for homogeneous GPUs. Weight per-device proportional to tensor rows. 2. Layer-split (--dit-layer-split with --dit-split): Assigns entire transformer blocks to different GPUs using multi-backend sched. Better for heterogeneous GPU setups where one GPU is slower. Results at 768p (LTX distilled Q4_K_M + LoRA, 1080 Ti): - 161f: 168s -> 150s (11% faster denoising with row-split) - 193f: was OOM, now 250s (new ceiling with row-split) - 249f: OOM on activations (hardware limit) Also includes CUDA split buffer pool allocator to prevent memory fragmentation when loading multiple split buffers sequentially. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Feature:
--dit-split— GPU Tensor Parallelism for DiT WeightsTwo complementary distribution modes for splitting DiT weights across multiple GPUs:
1. Row-Split (
--dit-split cudaX,cudaY)Splits each weight tensor's rows across GPUs using GGML's
cuda_split_buffer_type. Balanced for homogeneous GPUs. Weight per-device proportional to tensor rows.2. Layer-Split (
--dit-layer-splitwith--dit-split)Assigns entire transformer blocks to different GPUs using multi-backend sched. Better for heterogeneous GPU setups where one GPU is slower.
Dependencies
This PR uses GGML's
cuda_split_buffer_type(already compiled in). Also includes CUDA split buffer pool allocator improvements in the ggml submodule to prevent memory fragmentation when loading multiple split buffers sequentially.Results (4× GTX 1080 Ti 11GB, LTX distilled Q4_K_M + LoRA at 768p)
Implementation
sd_ctx_params_tgetsdit_split_devicesanddit_layer_splitfieldsggml_backend_cuda_split_buffer_typeper weight tensorMultiBackendSpec+ggml_backend_schedroutes transformer blocks to different GPUsLTXAVRunnersupports both modes via constructor andset_dit_split_buft()🤖 Generated with Claude Code