perf: clone extracted video frames to prevent DataLoader shared memory bloat by deep9539 · Pull Request #4747 · embeddings-benchmark/mteb

deep9539 · 2026-05-27T07:47:12Z

The Problem: When using PyTorch's DataLoader with multiple workers (num_workers > 0), tensors yielded by the dataset are passed to the main process via shared memory (/dev/shm).

Currently, extracting frames via video.get_frames_at(_indices(n)).data creates a memory view of the original decoded video with the selected indices given by _indices(n). PyTorch's shared memory serialization mechanism does not slice the underlying storage; it copies the entire underlying memory block. Because decoded videos can easily be hundreds of megabytes, passing views of them causes the entire video to be duplicated in /dev/shm.

This leads to a few critical issues:

Shared Memory Exhaustion: /dev/shm fills up almost immediately, often leading to RuntimeError: shared memory is full.
Bottlenecked Concurrency: To avoid crashes, we are forced to artificially lower batch_size and num_workers.
GPU Underutilization: Because data loading is bottlenecked by the restricted worker count, the GPU is starved for data and severely underutilized.

The Solution: This PR appends .clone() to the extracted frame tensor.

By cloning the tensor, PyTorch allocates a new, tightly packed memory block containing only the specific frames requested by the indices.

Impact / Trade-offs: While .clone() does introduce a small computational overhead to copy the memory on the CPU, it is a highly favorable trade-off particularly when max_frames count is very low compared to total_frames.
Drastic Memory Reduction: We only store the actual batch data in shared memory instead of the entire source videos.
Higher Throughput: This memory optimization allows us to safely increase num_workers and batch_size, eliminating the CPU data-loading bottleneck and keeping the GPU fully fed.
Testing:

Possible further optimization: We can consider doing this only when max_frames << total_frames for the video or something similar.

Verification done on docker environment with few other unrelated changes:

Verified that RuntimeError regarding shared memory is no longer thrown at higher worker counts with restricted /dev/shm memory.
Monitored /dev/shm usage during training to confirm the footprint remains stable.

Actual verfication pending

When using multiple PyTorch DataLoader workers, tensors are passed to the main process via shared memory (/dev/shm). Extracting frames using `video.get_frames_at(_indices(n)).data` creates a view of the original decoded video tensor. As a result, the underlying storage of the *entire* video is placed into shared memory, rather than just the extracted frames. This rapidly exhausts /dev/shm for large videos, which strictly limits the number of data workers and batch size, ultimately causing GPU data starvation. Adding `.clone()` allocates a new, compact tensor containing only the necessary frames. While this introduces a minor CPU copy overhead, it drastically reduces the shared memory footprint, enabling higher concurrency and better GPU utilization.

KennethEnevoldsen · 2026-05-27T08:02:37Z

@deep9539 can you provide some sort of validation on this in practice - what is the tradeoff that we are looking at here?

runtime on slurm, runtime not on slurm, memory etc.

Samoed · 2026-05-27T11:12:25Z

        while n > 0:
            try:
-                frames: torch.Tensor = video.get_frames_at(_indices(n)).data
+                frames: torch.Tensor = video.get_frames_at(_indices(n)).data.clone()


Can you add a small comment why we do clone?

KennethEnevoldsen · 2026-06-08T12:51:38Z

@deep9539 just a friendly ping on this PR

Samoed marked this pull request as ready for review May 27, 2026 10:56

Samoed reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: clone extracted video frames to prevent DataLoader shared memory bloat#4747

perf: clone extracted video frames to prevent DataLoader shared memory bloat#4747
deep9539 wants to merge 1 commit into
embeddings-benchmark:mainfrom
deep9539:fix_shared_memory

deep9539 commented May 27, 2026 •

edited

Loading

Uh oh!

KennethEnevoldsen commented May 27, 2026 •

edited

Loading

Uh oh!

Samoed May 27, 2026 •

edited

Loading

Uh oh!

KennethEnevoldsen commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deep9539 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deep9539 commented May 27, 2026 •

edited

Loading

KennethEnevoldsen commented May 27, 2026 •

edited

Loading

Samoed May 27, 2026 •

edited

Loading