Skip to content

[Performance] Fused RoPE THD kernel becomes dominant bottleneck in long-context training with many packed sequences #2866

@EEEEEKKO

Description

@EEEEEKKO

Problem

In long-context training (e.g., seq_length=65536) with variable-length sequence packing (THD format), the fused RoPE kernel becomes the dominant bottleneck when a micro-batch contains many packed sequences.

Most training samples are shorter than max_seq_length, so a single micro-batch routinely packs hundreds to thousands of spans. In this regime, RoPE time grows linearly with n_seqs and can exceed the combined cost of attention + MLP.

Cause

The kernel grid is dim3(max_seqlen, n_seqs), launching max_seqlen × n_seqs CUDA blocks. Only total_tokens blocks do useful work; the rest read cu_seqlens and early-exit. When n_seqs is large, the vast majority of blocks are wasted.

// fused_rope.cu — forward & backward launchers
dim3 blocks(s, b);   // s = max_seqlen, b = n_seqs
// Inside kernel — THD path
int t_id = s_id + start;
if (t_id >= end) return;   // most blocks exit here

Example: total_tokens=65536, n_seqs=2401157M blocks launched, 65K useful (99.96% wasted).

Profiling Data

0.9B model, 40 layers, H100, TP=2, seq_length=65536:

n_seqs RoPE / layer (×24) % of layer time
<50 22 ms ~10%
50–200 201 ms ~59%
200–500 488 ms ~79%
~2400 4,620 ms ~97%

Environment

  • Transformer Engine 2.6.0.post1
  • H100 80GB, CUDA 12.8
  • Megatron-LM, THD format, span-based attention

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions