Skip to content

Eliminate redundant refcounting in the JIT #134584

Open
@Fidget-Spinner

Description

@Fidget-Spinner

Feature or enhancement

Proposal:

Thanks to Matt's work on borrowed LOAD_FAST, we can now eliminate reference counting trivially in the JIT.

Reference counting is expensive, Matt found that eliminating 90% of refcounts in LOAD_FAST meant a 2-3% speedup in the interpreter. So the speedup for the JIT should be quite a bit too.

The other problem is that reference counts block register allocation/TOS caching. As they force spills to the stack more often.

This issue has the potential to speedup up JIT benchmarks by several percent.

How to contribute.

Now that reference tracking in the JIT optimizer is in place, the first thing we need to do is to convert ops to make the decref an explicit uop.

For escaping decref ops (ie, ops that decref could run the GC), we would need to refactor them so their decrefs are eliminated via specialization of pops. For example: the following op (which is not an escaping op, but just purely for demonstration!):

macro(BINARY_OP_ADD_INT) =
_GUARD_TOS_INT + _GUARD_NOS_INT + unused/5 + _BINARY_OP_ADD_INT;

becomes

macro(BINARY_OP_ADD_INT) =
_GUARD_TOS_INT + _GUARD_NOS_INT + unused/5 + _BINARY_OP_ADD_INT + _POP_TOP_INT + _POP_TOP_INT;

Previously _BINARY_OP_ADD_INT's stack effect looked like this: (left, right -- res). The new version should look like (left, right -- res, left, right).
So for all the decref specializations, we would just need a _POP_X of their variants! This means no explosion of uop and their decref variants. We just specialize _POP_X to _POP_X_NO_DECREF in the JIT. Keeping things clean

Please hold off on working on anything that might cause a stack overflow for now, we're figuring out how to deal with that. This means all BINARY_OP and such.

These are open for contributors to take:

  • _BINARY_OP_X_UNICODE @corona10
    _BINARY_OP_SUBSCR_X (except for BINARY_OP_SUBSCR_GETITEM)
    _COMPARE_OP_X @corona10
    _CALL_TYPE_1 @tomasr8
    _CALL_STR_1 @Zheaoli
    _CALL_TUPLE_1 @noamcohen97
    _CALL_BUILTIN_O
    _CALL_LEN @corona10
    _CALL_ISINSTANCE @noamcohen97
    _CALL_LIST_APPEND
    _CALL_METHOD_DESCRIPTOR_O
    _STORE_SUBSCR_DICT
    _STORE_SUBSCR_LIST_INT @corona10
    _LOAD_ATTR
    _LOAD_ATTR_INSTANCE_VALUE
    _LOAD_ATTR_MODULE
    _LOAD_ATTR_WITH_HINT
    _LOAD_ATTR_SLOT
    _LOAD_ATTR_CLASS
    _LOAD_ATTR_METHOD_WITH_VALUES
    _LOAD_ATTR_METHOD_NO_DICT
    _LOAD_ATTR_NONDESCRIPTOR_WITH_VALUES
    _LOAD_ATTR_NONDESCRIPTOR_NO_DICT
    _LOAD_ATTR_METHOD_LAZY_DICT
    _STORE_ATTR_INSTANCE_VALUE
    _STORE_ATTR_SLOT
    _STORE_ATTR_WITH_HINT
    _STORE_ATTR

Once that's done, we can think about further ops.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

Activity

Fidget-Spinner

Fidget-Spinner commented on May 23, 2025

@Fidget-Spinner
MemberAuthor

@tomasr8 @brandtbucher this is gonna be a pretty sweet optimization :)
Before:

    / 
    / _BINARY_OP_ADD_INT.o:  file format elf64-x86-64
    / 
    / Disassembly of section .text:
    / 
    / 0000000000000000 <_JIT_ENTRY>:
    / 0: 55                            pushq   %rbp
    / 1: 4d 8b 7d f0                   movq    -0x10(%r13), %r15
    / 5: 49 8b 6d f8                   movq    -0x8(%r13), %rbp
    / 9: 4c 89 ff                      movq    %r15, %rdi
    / c: 48 83 e7 fe                   andq    $-0x2, %rdi
    / 10: 48 89 ee                      movq    %rbp, %rsi
    / 13: 48 83 e6 fe                   andq    $-0x2, %rsi
    / 17: 4d 89 6c 24 40                movq    %r13, 0x40(%r12)
    / 1c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x22 <_JIT_ENTRY+0x22>
    / 000000000000001e:  R_X86_64_GOTPCRELX   _PyLong_Add-0x4
    / 22: 48 89 c3                      movq    %rax, %rbx
    / 25: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    / 2a: 40 f6 c5 01                   testb   $0x1, %bpl
    / 2e: 75 05                         jne     0x35 <_JIT_ENTRY+0x35>
    / 30: ff 4d 00                      decl    (%rbp)
    / 33: 74 3f                         je      0x74 <_JIT_ENTRY+0x74>
    / 35: 41 f6 c7 01                   testb   $0x1, %r15b
    / 39: 75 71                         jne     0xac <_JIT_ENTRY+0xac>
    / 3b: 41 ff 0f                      decl    (%r15)
    / 3e: 75 6c                         jne     0xac <_JIT_ENTRY+0xac>
    / 40: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
    / 0000000000000042:  R_X86_64_64  _PyRuntime+0x28e0
    / 4a: 48 8b 00                      movq    (%rax), %rax
    / 4d: 48 85 c0                      testq   %rax, %rax
    / 50: 74 17                         je      0x69 <_JIT_ENTRY+0x69>
    / 52: 48 b9 00 00 00 00 00 00 00 00 movabsq $0x0, %rcx
    / 0000000000000054:  R_X86_64_64  _PyRuntime+0x28e8
    / 5c: 48 8b 11                      movq    (%rcx), %rdx
    / 5f: 4c 89 ff                      movq    %r15, %rdi
    / 62: be 01 00 00 00                movl    $0x1, %esi
    / 67: ff d0                         callq   *%rax
    / 69: 4c 89 ff                      movq    %r15, %rdi
    / 6c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x72 <_JIT_ENTRY+0x72>
    / 000000000000006e:  R_X86_64_GOTPCRELX   _PyLong_ExactDealloc-0x4
    / 72: eb 38                         jmp     0xac <_JIT_ENTRY+0xac>
    / 74: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
    / 0000000000000076:  R_X86_64_64  _PyRuntime+0x28e0
    / 7e: 48 8b 00                      movq    (%rax), %rax
    / 81: 48 85 c0                      testq   %rax, %rax
    / 84: 74 17                         je      0x9d <_JIT_ENTRY+0x9d>
    / 86: 48 b9 00 00 00 00 00 00 00 00 movabsq $0x0, %rcx
    / 0000000000000088:  R_X86_64_64  _PyRuntime+0x28e8
    / 90: 48 8b 11                      movq    (%rcx), %rdx
    / 93: 48 89 ef                      movq    %rbp, %rdi
    / 96: be 01 00 00 00                movl    $0x1, %esi
    / 9b: ff d0                         callq   *%rax
    / 9d: 48 89 ef                      movq    %rbp, %rdi
    / a0: ff 15 00 00 00 00             callq   *(%rip)                 # 0xa6 <_JIT_ENTRY+0xa6>
    / 00000000000000a2:  R_X86_64_GOTPCRELX   _PyLong_ExactDealloc-0x4
    / a6: 41 f6 c7 01                   testb   $0x1, %r15b
    / aa: 74 8f                         je      0x3b <_JIT_ENTRY+0x3b>
    / ac: 48 85 db                      testq   %rbx, %rbx
    / af: 74 18                         je      0xc9 <_JIT_ENTRY+0xc9>
    / b1: 0f b7 43 06                   movzwl  0x6(%rbx), %eax
    / b5: 83 e0 01                      andl    $0x1, %eax
    / b8: 48 09 d8                      orq     %rbx, %rax
    / bb: 49 89 45 f0                   movq    %rax, -0x10(%r13)
    / bf: 49 83 c5 f8                   addq    $-0x8, %r13
    / c3: 5d                            popq    %rbp
    / c4: e9 00 00 00 00                jmp     0xc9 <_JIT_ENTRY+0xc9>
    / 00000000000000c5:  R_X86_64_PLT32       _JIT_CONTINUE-0x4
    / c9: 49 83 c5 f0                   addq    $-0x10, %r13
    / cd: 5d                            popq    %rbp
    / ce: e9 00 00 00 00                jmp     0xd3 <_JIT_ENTRY+0xd3>
    / 00000000000000cf:  R_X86_64_PLT32       _JIT_ERROR_TARGET-0x4

After:

    / 
    / _BINARY_OP_ADD_INT__NO_INPUT_DECREF.o: file format elf64-x86-64
    / 
    / Disassembly of section .text:
    / 
    / 0000000000000000 <_JIT_ENTRY>:
    / 0: 50                            pushq   %rax
    / 1: 49 8b 7d f0                   movq    -0x10(%r13), %rdi
    / 5: 49 8b 75 f8                   movq    -0x8(%r13), %rsi
    / 9: 48 83 e7 fe                   andq    $-0x2, %rdi
    / d: 48 83 e6 fe                   andq    $-0x2, %rsi
    / 11: 4d 89 6c 24 40                movq    %r13, 0x40(%r12)
    / 16: ff 15 00 00 00 00             callq   *(%rip)                 # 0x1c <_JIT_ENTRY+0x1c>
    / 0000000000000018:  R_X86_64_GOTPCRELX   _PyLong_Add-0x4
    / 1c: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    / 21: 48 85 c0                      testq   %rax, %rax
    / 24: 74 18                         je      0x3e <_JIT_ENTRY+0x3e>
    / 26: 0f b7 48 06                   movzwl  0x6(%rax), %ecx
    / 2a: 83 e1 01                      andl    $0x1, %ecx
    / 2d: 48 09 c1                      orq     %rax, %rcx
    / 30: 49 89 4d f0                   movq    %rcx, -0x10(%r13)
    / 34: 49 83 c5 f8                   addq    $-0x8, %r13
    / 38: 58                            popq    %rax
    / 39: e9 00 00 00 00                jmp     0x3e <_JIT_ENTRY+0x3e>
    / 000000000000003a:  R_X86_64_PLT32       _JIT_CONTINUE-0x4
    / 3e: 49 83 c5 f0                   addq    $-0x10, %r13
    / 42: 58                            popq    %rax
    / 43: e9 00 00 00 00                jmp     0x48 <_JIT_ENTRY+0x48>
    / 0000000000000044:  R_X86_64_PLT32       _JIT_ERROR_TARGET-0x4
added a commit that references this issue on Jun 17, 2025

gh-134584: Decref elimination for float ops in the JIT (GH-134588)

fba5dde
added a commit that references this issue on Jun 19, 2025

pythongh-134584: Decref elimination for float ops in the JIT (pythonG…

added a commit that references this issue on Jun 22, 2025

pythongh-134584: Elimiate redundant refcounting from _BINARY_OP_X_UNI…

19 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetopic-JITtype-featureA feature request or enhancement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Eliminate redundant refcounting in the JIT · Issue #134584 · python/cpython

    Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

    Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant