JIT: Triple dispatching our way into more opcodes

# Feature or enhancement

### Proposal:

Problem: we are running low on opcodes for specialization forms. Not only that, we have also hit the limit of the computed goto/switch-case interpreter size such that we are seeing compiler bugs in all three: MSVC, Clang, and GCC. (I just fixed another compiler bug in Clang 22 due to interpreter size last week). While the tail calling interpreter solves our problems, we cannot rely on it completely yet for the JIT until at least 5 years from now. However, we still need more specialization forms.

https://github.com/python/cpython/issues/143732 solves this partially, but it's not everything. To illustrate, consider Python attribute lookup. According to chats with @kumaraditya303, it seems complex libraries like Pydantic and ORMs use advanced features. This means we have a cross product of `__getattr__` , `__getattribute__`, `__get__`, `__set__`. Simple math tells us this requires at least 16 specialization forms to handle completely. This is unfeasible. Furthermore, the table driven approach is also not feasible for LOAD_ATTR. This stems from the JIT requiring to see `_PUSH_FRAME` to inline `__get__` and `__set__` in Python. These operations need to be expressed in micro-ops.

I have an idea: recording-only instructions via triple dispatch. The key insight is that since at trace recording time we have full control over the next opcode, this effectively gives us infinite opcodes. Recording-only instructions will behave like a normal instruction, except it requires a specialization before every instruction. It will look like this (pseudocode):
```
// TRACE_RECORD:
// Next opcode is unspecialized. Try to apply a recording-only instruction.
if (next_opcode == _PyOpcode_Deopt[next_opcode]) {
    retcode, next_opcode, frame, stack_pointer, tstate, next_instr, instruction_funcptr_table, oparg = specialize(frame, stack_pointer, tstate, next_instr, instruction_funcptr_table, oparg);
    if (retcode < 0) { JUMP_TO_LABEL(error); }
}
goto dispatch_table_var[next_opcode];
```
Where `specialize` behaves similar to `Python/specialize.c`, where it decides what trace-recording instruction to dispatch to. It will be outlined, and call the correct instruction to handle this, then return all interpreter state after execution. Effectively we change our interpreter from direct threaded to call threaded.

Recording-only instructions will be opcode > 256 to not conflict with existing bytecode operations, allowing the tracing frontend to remain completely unchanged, we just need to change the TRACE_RECORD pseudo instruction. This will slightly slow down trace recording, but nothing we won't recover.

The most important thing: because recording-only instructions are JIT-only and are in separate outlined functions. We will not slow down the base computed goto interpreter. This will also leave the non-JIT builds unaffected and won't slow it down.

### Rejected ideas
EXTENDED_OPCODE. This doesn't solve the main interpreter size issue. Which means more compiler bugs.

### Has this already been discussed elsewhere?

No response given

### Links to previous discussion of this feature:

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIT: Triple dispatching our way into more opcodes #148506

Feature or enhancement

Proposal:

Rejected ideas

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

JIT: Triple dispatching our way into more opcodes #148506

Description

Feature or enhancement

Proposal:

Rejected ideas

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions