perf: clone extracted video frames to prevent DataLoader shared memory bloat#4747
Open
deep9539 wants to merge 1 commit into
Open
perf: clone extracted video frames to prevent DataLoader shared memory bloat#4747deep9539 wants to merge 1 commit into
deep9539 wants to merge 1 commit into
Conversation
When using multiple PyTorch DataLoader workers, tensors are passed to the main process via shared memory (/dev/shm). Extracting frames using `video.get_frames_at(_indices(n)).data` creates a view of the original decoded video tensor. As a result, the underlying storage of the *entire* video is placed into shared memory, rather than just the extracted frames. This rapidly exhausts /dev/shm for large videos, which strictly limits the number of data workers and batch size, ultimately causing GPU data starvation. Adding `.clone()` allocates a new, compact tensor containing only the necessary frames. While this introduces a minor CPU copy overhead, it drastically reduces the shared memory footprint, enabling higher concurrency and better GPU utilization.
Contributor
|
@deep9539 can you provide some sort of validation on this in practice - what is the tradeoff that we are looking at here? runtime on slurm, runtime not on slurm, memory etc. |
Samoed
reviewed
May 27, 2026
| while n > 0: | ||
| try: | ||
| frames: torch.Tensor = video.get_frames_at(_indices(n)).data | ||
| frames: torch.Tensor = video.get_frames_at(_indices(n)).data.clone() |
Member
There was a problem hiding this comment.
Can you add a small comment why we do clone?
Contributor
|
@deep9539 just a friendly ping on this PR |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Problem: When using PyTorch's DataLoader with multiple workers (num_workers > 0), tensors yielded by the dataset are passed to the main process via shared memory (/dev/shm).
Currently, extracting frames via
video.get_frames_at(_indices(n)).datacreates a memory view of the original decoded video with the selected indices given by_indices(n). PyTorch's shared memory serialization mechanism does not slice the underlying storage; it copies the entire underlying memory block. Because decoded videos can easily be hundreds of megabytes, passing views of them causes the entire video to be duplicated in /dev/shm.This leads to a few critical issues:
The Solution: This PR appends .clone() to the extracted frame tensor.
By cloning the tensor, PyTorch allocates a new, tightly packed memory block containing only the specific frames requested by the indices.
Impact / Trade-offs: While .clone() does introduce a small computational overhead to copy the memory on the CPU, it is a highly favorable trade-off particularly when max_frames count is very low compared to total_frames.
Drastic Memory Reduction: We only store the actual batch data in shared memory instead of the entire source videos.
Higher Throughput: This memory optimization allows us to safely increase num_workers and batch_size, eliminating the CPU data-loading bottleneck and keeping the GPU fully fed.
Testing:
Possible further optimization: We can consider doing this only when max_frames << total_frames for the video or something similar.
Verification done on docker environment with few other unrelated changes:
/dev/shmmemory.Actual verfication pending