Skip to content

perf: clone extracted video frames to prevent DataLoader shared memory bloat#4747

Open
deep9539 wants to merge 1 commit into
embeddings-benchmark:mainfrom
deep9539:fix_shared_memory
Open

perf: clone extracted video frames to prevent DataLoader shared memory bloat#4747
deep9539 wants to merge 1 commit into
embeddings-benchmark:mainfrom
deep9539:fix_shared_memory

Conversation

@deep9539

@deep9539 deep9539 commented May 27, 2026

Copy link
Copy Markdown
Contributor

The Problem: When using PyTorch's DataLoader with multiple workers (num_workers > 0), tensors yielded by the dataset are passed to the main process via shared memory (/dev/shm).

Currently, extracting frames via video.get_frames_at(_indices(n)).data creates a memory view of the original decoded video with the selected indices given by _indices(n). PyTorch's shared memory serialization mechanism does not slice the underlying storage; it copies the entire underlying memory block. Because decoded videos can easily be hundreds of megabytes, passing views of them causes the entire video to be duplicated in /dev/shm.

This leads to a few critical issues:

  • Shared Memory Exhaustion: /dev/shm fills up almost immediately, often leading to RuntimeError: shared memory is full.
  • Bottlenecked Concurrency: To avoid crashes, we are forced to artificially lower batch_size and num_workers.
  • GPU Underutilization: Because data loading is bottlenecked by the restricted worker count, the GPU is starved for data and severely underutilized.

The Solution: This PR appends .clone() to the extracted frame tensor.

By cloning the tensor, PyTorch allocates a new, tightly packed memory block containing only the specific frames requested by the indices.

Impact / Trade-offs: While .clone() does introduce a small computational overhead to copy the memory on the CPU, it is a highly favorable trade-off particularly when max_frames count is very low compared to total_frames.
Drastic Memory Reduction: We only store the actual batch data in shared memory instead of the entire source videos.
Higher Throughput: This memory optimization allows us to safely increase num_workers and batch_size, eliminating the CPU data-loading bottleneck and keeping the GPU fully fed.
Testing:

Possible further optimization: We can consider doing this only when max_frames << total_frames for the video or something similar.

Verification done on docker environment with few other unrelated changes:

  • Verified that RuntimeError regarding shared memory is no longer thrown at higher worker counts with restricted /dev/shm memory.
  • Monitored /dev/shm usage during training to confirm the footprint remains stable.

Actual verfication pending

When using multiple PyTorch DataLoader workers, tensors are passed to the main process via shared memory (/dev/shm). Extracting frames using  `video.get_frames_at(_indices(n)).data` creates a view of the original  decoded video tensor. As a result, the underlying storage of the *entire*  video is placed into shared memory, rather than just the extracted frames.

This rapidly exhausts /dev/shm for large videos, which strictly limits  the number of data workers and batch size, ultimately causing GPU data  starvation. Adding `.clone()` allocates a new, compact tensor containing  only the necessary frames. While this introduces a minor CPU copy overhead,  it drastically reduces the shared memory footprint, enabling higher  concurrency and better GPU utilization.
@KennethEnevoldsen

KennethEnevoldsen commented May 27, 2026

Copy link
Copy Markdown
Contributor

@deep9539 can you provide some sort of validation on this in practice - what is the tradeoff that we are looking at here?

runtime on slurm, runtime not on slurm, memory etc.

@Samoed Samoed marked this pull request as ready for review May 27, 2026 10:56
while n > 0:
try:
frames: torch.Tensor = video.get_frames_at(_indices(n)).data
frames: torch.Tensor = video.get_frames_at(_indices(n)).data.clone()

@Samoed Samoed May 27, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment why we do clone?

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

@deep9539 just a friendly ping on this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants