Feature or enhancement
Proposal:
Problem: we are running low on opcodes for specialization forms. Not only that, we have also hit the limit of the computed goto/switch-case interpreter size such that we are seeing compiler bugs in all three: MSVC, Clang, and GCC. (I just fixed another compiler bug in Clang 22 due to interpreter size last week). While the tail calling interpreter solves our problems, we cannot rely on it completely yet for the JIT until at least 5 years from now. However, we still need more specialization forms.
#143732 solves this partially, but it's not everything. To illustrate, consider Python attribute lookup. According to chats with @kumaraditya303, it seems complex libraries like Pydantic and ORMs use advanced features. This means we have a cross product of __getattr__ , __getattribute__, __get__, __set__. Simple math tells us this requires at least 16 specialization forms to handle completely. This is unfeasible. Furthermore, the table driven approach is also not feasible for LOAD_ATTR. This stems from the JIT requiring to see _PUSH_FRAME to inline __get__ and __set__ in Python. These operations need to be expressed in micro-ops.
I have an idea: recording-only instructions via triple dispatch. The key insight is that since at trace recording time we have full control over the next opcode, this effectively gives us infinite opcodes. Recording-only instructions will behave like a normal instruction, except it requires a specialization before every instruction. It will look like this (pseudocode):
// TRACE_RECORD:
// Next opcode is unspecialized. Try to apply a recording-only instruction.
if (next_opcode == _PyOpcode_Deopt[next_opcode]) {
retcode, next_opcode, frame, stack_pointer, tstate, next_instr, instruction_funcptr_table, oparg = specialize(frame, stack_pointer, tstate, next_instr, instruction_funcptr_table, oparg);
if (retcode < 0) { JUMP_TO_LABEL(error); }
}
goto dispatch_table_var[next_opcode];
Where specialize behaves similar to Python/specialize.c, where it decides what trace-recording instruction to dispatch to. It will be outlined, and call the correct instruction to handle this, then return all interpreter state after execution. Effectively we change our interpreter from direct threaded to call threaded.
Recording-only instructions will be opcode > 256 to not conflict with existing bytecode operations, allowing the tracing frontend to remain completely unchanged, we just need to change the TRACE_RECORD pseudo instruction. This will slightly slow down trace recording, but nothing we won't recover.
The most important thing: because recording-only instructions are JIT-only and are in separate outlined functions. We will not slow down the base computed goto interpreter. This will also leave the non-JIT builds unaffected and won't slow it down.
Rejected ideas
EXTENDED_OPCODE. This doesn't solve the main interpreter size issue. Which means more compiler bugs.
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Feature or enhancement
Proposal:
Problem: we are running low on opcodes for specialization forms. Not only that, we have also hit the limit of the computed goto/switch-case interpreter size such that we are seeing compiler bugs in all three: MSVC, Clang, and GCC. (I just fixed another compiler bug in Clang 22 due to interpreter size last week). While the tail calling interpreter solves our problems, we cannot rely on it completely yet for the JIT until at least 5 years from now. However, we still need more specialization forms.
#143732 solves this partially, but it's not everything. To illustrate, consider Python attribute lookup. According to chats with @kumaraditya303, it seems complex libraries like Pydantic and ORMs use advanced features. This means we have a cross product of
__getattr__,__getattribute__,__get__,__set__. Simple math tells us this requires at least 16 specialization forms to handle completely. This is unfeasible. Furthermore, the table driven approach is also not feasible for LOAD_ATTR. This stems from the JIT requiring to see_PUSH_FRAMEto inline__get__and__set__in Python. These operations need to be expressed in micro-ops.I have an idea: recording-only instructions via triple dispatch. The key insight is that since at trace recording time we have full control over the next opcode, this effectively gives us infinite opcodes. Recording-only instructions will behave like a normal instruction, except it requires a specialization before every instruction. It will look like this (pseudocode):
Where
specializebehaves similar toPython/specialize.c, where it decides what trace-recording instruction to dispatch to. It will be outlined, and call the correct instruction to handle this, then return all interpreter state after execution. Effectively we change our interpreter from direct threaded to call threaded.Recording-only instructions will be opcode > 256 to not conflict with existing bytecode operations, allowing the tracing frontend to remain completely unchanged, we just need to change the TRACE_RECORD pseudo instruction. This will slightly slow down trace recording, but nothing we won't recover.
The most important thing: because recording-only instructions are JIT-only and are in separate outlined functions. We will not slow down the base computed goto interpreter. This will also leave the non-JIT builds unaffected and won't slow it down.
Rejected ideas
EXTENDED_OPCODE. This doesn't solve the main interpreter size issue. Which means more compiler bugs.
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response