Skip to content

queue tuning#326

Merged
Smattr merged 3 commits into
mainfrom
smattr/a1c1734b-2ac0-4b47-bedc-4502c5b03222
May 16, 2026
Merged

queue tuning#326
Smattr merged 3 commits into
mainfrom
smattr/a1c1734b-2ac0-4b47-bedc-4502c5b03222

Conversation

@Smattr
Copy link
Copy Markdown
Owner

@Smattr Smattr commented May 16, 2026

No description provided.

Smattr added 3 commits May 16, 2026 14:40
The queue of pending states is stored in per-thread linked-lists of
“queue nodes” that are each a 4K page worth of state pointers:

     ┌A─────────┬──────┐  ┌B─────────┬──────┐
  0: │ s[…]     │ next ├─►│ s[…]     │ next │
     └──────────┴──────┘  └──────────┴──────┘
     ┌C─────────┬──────┐
  1: │ s[…]     │ next │
     └──────────┴──────┘
     ┌D─────────┬──────┐  ┌E─────────┬──────┐  ┌F─────────┬──────┐
  2: │ s[…]     │ next ├─►│ s[…]     │ next ├─►│ s[…]     │ next │
     └──────────┴──────┘  └──────────┴──────┘  └──────────┴──────┘
     …

When the head of a thread’s queue reaches the `next` pointer in the head
queue node, the next dequeue (1) pops the entire queue node and then (2)
retries dequeue. E.g. pop of thread 2’s head queue node:

     ┌A─────────┬──────┐  ┌B─────────┬──────┐
  0: │ s[…]     │ next ├─►│ s[…]     │ next │
     └──────────┴──────┘  └──────────┴──────┘
     ┌C─────────┬──────┐
  1: │ s[…]     │ next │
     └──────────┴──────┘
     ┌E─────────┬──────┐  ┌F─────────┬──────┐
  2: │ s[…]     │ next ├─►│ s[…]     │ next │
     └──────────┴──────┘  └──────────┴──────┘
     …

This pop is done through a CAS into the queue end pointers. Following
this, `ends` was unconditionally set to the old value. However in the
success case the most up to date value we know of is `new`.

The effect of this was that successfully popping a queue node always did
an unnecessary extra loop:
  1. CAS into the queue end pointers
  2. `ends = old`
  3. retry
  4. find `ends != ends_check`   ┐
  5. re-read queue end pointers  ├─ unnecessary extra work
  6. retry                       ┘

Surprisingly this suboptimal logic seems to have been present unnoticed
since the first implementation of this feature in
2153f1f. This was only discovered
during debugging an orthogonal optimisation to queue management in
development.

This change appears to have no significant effect on some representative
models. So apparently this extra work was not a significant overhead.
In some speculative work, I had an initial misunderstanding that the
pointer stored in the queue tail was the next slot to push to. In
reality it is the slot _before_ the next slot to push to. For this
reason, a pointer to `queue_node.next` can never be stored to the tail.
Some debugging assertions around this would have saved me a lot of time,
so lets add them.
In some speculative work, I had a misunderstanding that led to
constructing a malformed queue. The queue involved a single queue node
but with the head in advance of the tail:

  ┌──────┬──────┐
  │ head │ tail │
  └───┬──┴───┬──┘
      └──────┼───┐
             └─┐ │
               ▼ ▼
     ┌──────────┬──────┐
     │ s[…]     │ next ├─►null
     └──────────┴──────┘

This situation cannot occur in the code as it is at the present point.
And the problem with this situation is that the logic for detecting when
the queue is exhausted no longer works. `head` only ever advances and
will never become equal to `tail`.

This commit introduces some stronger assertions that hopefully would
catch mistakes like this.
@Smattr Smattr merged commit 1253711 into main May 16, 2026
22 checks passed
@Smattr Smattr deleted the smattr/a1c1734b-2ac0-4b47-bedc-4502c5b03222 branch May 16, 2026 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant