Skip to content

Deadlock in test_opcache with gh-131174 applied #133130

Closed
@nascheme

Description

@nascheme

Bug report

This is to document the deadlock that occurs with gh-131174 applied. It seems a bit hard to reproduce and I had to repeatedly run the tests. When I got it to finally deadlock, the stack-trace for the three threads are as follows. Note that update_slots_callback() runs with the world stopped.

Thread 1:

[...]
#6  0x000055e8fa1ab214 in _PySemaphore_PlatformWait (sema=sema@entry=0x7ffefa17dda0, timeout=timeout@entry=-1)
    at Python/parking_lot.c:162
#7  0x000055e8fa1ab55c in _PySemaphore_Wait (sema=sema@entry=0x7ffefa17dda0, timeout=timeout@entry=-1,
    detach=detach@entry=0) at Python/parking_lot.c:232
#8  0x000055e8fa1ab6e1 in _PyParkingLot_Park (addr=addr@entry=0x55e8fa5a186c <_PyRuntime+361772>,
    expected=expected@entry=0x7ffefa17de24, size=size@entry=4, timeout_ns=timeout_ns@entry=-1,
    park_arg=park_arg@entry=0x0, detach=detach@entry=0) at Python/parking_lot.c:335
#9  0x000055e8fa1c1c43 in tstate_wait_attach (tstate=tstate@entry=0x55e8fa5a1840 <_PyRuntime+361728>)
    at Python/pystate.c:2145
#10 0x000055e8fa1c1df0 in _PyThreadState_Attach (tstate=tstate@entry=0x55e8fa5a1840 <_PyRuntime+361728>)
    at Python/pystate.c:2180
#11 0x000055e8fa17048f in PyEval_AcquireThread (tstate=tstate@entry=0x55e8fa5a1840 <_PyRuntime+361728>)
    at Python/ceval_gil.c:605
#12 0x000055e8fa1ab56b in _PySemaphore_Wait (sema=sema@entry=0x7ffefa17df00, timeout=timeout@entry=-1,
    detach=detach@entry=1) at Python/parking_lot.c:234
#13 0x000055e8fa1ab6e1 in _PyParkingLot_Park (addr=addr@entry=0x55e8fa599160 <_PyRuntime+327200>,
    expected=expected@entry=0x7ffefa17df97, size=size@entry=1, timeout_ns=timeout_ns@entry=-1,
    park_arg=park_arg@entry=0x7ffefa17dfa0, detach=detach@entry=1) at Python/parking_lot.c:335
#14 0x000055e8fa19fb6a in _PyMutex_LockTimed (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>, timeout=timeout@entry=-1,
    flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:108
#15 0x000055e8fa1a0970 in PyMutex_Lock (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at Python/lock.c:611
#16 0x000055e8fa1435c6 in _PyMutex_Lock (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at ./Include/cpython/lock.h:49
#17 0x000055e8fa14378f in _PyCriticalSection_BeginSlow (c=c@entry=0x7ffefa17e110,
    m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at Python/critical_section.c:37
#18 0x000055e8f9fe8918 in _PyCriticalSection_BeginMutex (c=c@entry=0x7ffefa17e110,
    m=0x55e8fa599160 <_PyRuntime+327200>) at ./Include/internal/pycore_critical_section.h:119
#19 0x000055e8fa000264 in _PyType_LookupStackRefAndVersion (type=0x7f06d8afc810,
    name=name@entry=0x55e8fa558f28 <_PyRuntime+64488>, out=out@entry=0x7f071b60d190) at Objects/typeobject.c:5891
[...]

Thread 2:

[...]
#6  0x000055e8fa1ab214 in _PySemaphore_PlatformWait (sema=sema@entry=0x7f06cc4fd500, timeout=timeout@entry=-1)
    at Python/parking_lot.c:162
#7  0x000055e8fa1ab55c in _PySemaphore_Wait (sema=sema@entry=0x7f06cc4fd500, timeout=timeout@entry=-1,
    detach=detach@entry=0) at Python/parking_lot.c:232
#8  0x000055e8fa1ab6e1 in _PyParkingLot_Park (addr=addr@entry=0x72a40000f03c, expected=expected@entry=0x7f06cc4fd584,
    size=size@entry=4, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x0, detach=detach@entry=0)
    at Python/parking_lot.c:335
#9  0x000055e8fa1c1c43 in tstate_wait_attach (tstate=tstate@entry=0x72a40000f010) at Python/pystate.c:2145
#10 0x000055e8fa1c1df0 in _PyThreadState_Attach (tstate=tstate@entry=0x72a40000f010) at Python/pystate.c:2180
#11 0x000055e8fa17048f in PyEval_AcquireThread (tstate=tstate@entry=0x72a40000f010) at Python/ceval_gil.c:605
#12 0x000055e8fa290c0f in thread_run (boot_raw=boot_raw@entry=0x7214000012d0) at ./Modules/_threadmodule.c:350
#13 0x000055e8fa1e682d in pythread_wrapper (arg=0x720c00000b80) at Python/thread_pthread.h:242
#14 0x00007f071c601def in __tsan_thread_start_func (arg=0x7ffefa17dd60)
    at ../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1012
#15 0x00007f071c354aa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#16 0x00007f071c3e1c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
[...]

Thread 3

[...]
#6  0x000055e8fa1ab214 in _PySemaphore_PlatformWait (sema=sema@entry=0x7f06cbcfb4a0, timeout=timeout@entry=-1)
    at Python/parking_lot.c:162
#7  0x000055e8fa1ab55c in _PySemaphore_Wait (sema=sema@entry=0x7f06cbcfb4a0, timeout=timeout@entry=-1,
    detach=detach@entry=1) at Python/parking_lot.c:232
#8  0x000055e8fa1ab6e1 in _PyParkingLot_Park (addr=addr@entry=0x55e8fa599160 <_PyRuntime+327200>,
    expected=expected@entry=0x7f06cbcfb537, size=size@entry=1, timeout_ns=timeout_ns@entry=-1,
    park_arg=park_arg@entry=0x7f06cbcfb540, detach=detach@entry=1) at Python/parking_lot.c:335
#9  0x000055e8fa19fb6a in _PyMutex_LockTimed (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>, timeout=timeout@entry=-1,
    flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:108
#10 0x000055e8fa1a0970 in PyMutex_Lock (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at Python/lock.c:611
#11 0x000055e8fa1435c6 in _PyMutex_Lock (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at ./Include/cpython/lock.h:49
#12 0x000055e8fa143bf5 in _PyCriticalSection_Resume (tstate=tstate@entry=0x72a400019010)
    at Python/critical_section.c:109
#13 0x000055e8f9f805e2 in _PyCriticalSection_Pop (c=c@entry=0x7f06cbcfb670)
    at ./Include/internal/pycore_critical_section.h:141
#14 0x000055e8f9f80618 in _PyCriticalSection_End (c=c@entry=0x7f06cbcfb670)
    at ./Include/internal/pycore_critical_section.h:155
#15 0x000055e8f9f806c7 in ensure_shared_on_read (mp=mp@entry=0x7f06d94e1030) at Objects/dictobject.c:1326
#16 0x000055e8f9f860a4 in _Py_dict_lookup_threadsafe (mp=mp@entry=0x7f06d94e1030,
    key=key@entry=0x55e8fa5592b8 <_PyRuntime+65400>, hash=hash@entry=5605474849373301295,
    value_addr=value_addr@entry=0x7f06cbcfb740) at Objects/dictobject.c:1497
#17 0x000055e8f9f86e3f in _PyDict_GetItemRef_KnownHash (op=op@entry=0x7f06d94e1030,
    key=key@entry=0x55e8fa5592b8 <_PyRuntime+65400>, hash=hash@entry=5605474849373301295,
    result=result@entry=0x7f06cbcfb7a0) at Objects/dictobject.c:2370
#18 0x000055e8f9ffff81 in find_name_in_mro (type=type@entry=0x7f06d9691410, name=0x55e8fa5592b8 <_PyRuntime+65400>,
    error=error@entry=0x7f06cbcfb824) at Objects/typeobject.c:5721
#19 0x000055e8fa007d9b in update_one_slot (type=type@entry=0x7f06d9691410, p=0x55e8fa5259a0 <slotdefs>)
    at Objects/typeobject.c:11292
#20 0x000055e8fa007f9b in update_slots_callback (type=type@entry=0x7f06d9691410, data=data@entry=0x7f06cbcfb8c0)
    at Objects/typeobject.c:11390
[...]

Inside the _PyCriticalSection_Resume of the 3rd thread, it's waiting on m1:

#12 0x000055e8fa143bf5 in _PyCriticalSection_Resume (tstate=tstate@entry=0x72a400019010)
    at Python/critical_section.c:109
109             PyMutex_Lock(m1);
(gdb) l
104             m2 = c2->_cs_mutex2;
105             c2->_cs_mutex2 = NULL;
106         }
107
108         if (m1) {
109             PyMutex_Lock(m1);
110         }
111         if (m2) {
112             PyMutex_Lock(m2);
113         }
(gdb) p *m1
$9 = {_bits = 3 '\003'}
(gdb) p *m2
$10 = {_bits = 0 '\000'}

Activity

added a commit that references this issue on Apr 29, 2025

Revert gh-127266: avoid data races when updating type slots (gh-131174)…

eecafc3
colesbury

colesbury commented on Apr 29, 2025

@colesbury
Contributor

It looks like m is probably the TYPE_LOCK? It's at _PyRuntime+327200 so it's statically allocated, not some other random object lock.

There's a problem that I forgot when reviewing the PR: even though all the critical sections for other threads are suspended during a stop-the-world, some of the locks may still be locked!

This is because locks can be handed off directly: when unlocking a lock, if the waiter has been waiting for longer TIME_TO_BE_FAIR_NS (1 ms), then the lock is "handed off" directly and not unlocked. This is to ensure fairness and avoid starvation. However, the new owner may not yet have processed that it now "owns" the lock by the time of the stop the world pause.

For example, this psuedo code can deadlock:

T1: PyMutex_Lock(&m1); PyMutex_Unlock(&m1);
T2: PyMutex_Lock(&m1); PyMutex_Unlock(&m1); 

T3:
stop_the_world();
PyMutex_Lock(&m1)
PyMutex_Unlock(&m1); 
start_the_world();

Even though T1 and T2 won't pause for a stop-the-world between the lock and unlock, they may pause during the PyMutex_Lock AND own the lock due to handoff from the other thread (but not yet process it).

In general, we really want to minimize locking during a stop the world pause. For example, the free threaded GC only acquires HEAD_LOCK(). Importantly, HEAD_LOCK() is only acquired with PyMutex_LockFlags(..., _Py_LOCK_DONT_DETACH) where the _Py_LOCK_DONT_DETACH means that the thread waiting to acquire it won't detach, which avoids this handoff + stop the world deadlock.

nascheme

nascheme commented on Apr 29, 2025

@nascheme
MemberAuthor

Thanks for those details. Being careful enough during the stop-the-world seems difficult. update_slots_callback calls the following:

  • _PyObject_HashFast
  • PyDict_Contains
  • _PyDict_GetItemRef_KnownHash

Would it be possible for the thread stopping the world to process the locks to ensure it owns them?

colesbury

colesbury commented on Apr 29, 2025

@colesbury
Contributor

Maybe.. you might be able to do something like:

  1. Push a blank critical section (i.e., a NULL mutex pointer)
  2. PyMutex_Lock() the dictionary
  3. Stop the world
    ...

The blank critical section avoids deadlock due to reentrancy and avoids resuming any previously active critical sections when attaching.

I'm not enthusiastic about this approach. We haven't done something like that before and I'd be worried about both missing some subtle detail and also the extra complexity from additional weird patterns.

I'd much rather we use non-locking versions of the dictionary functions (such as_PyDict_GetItemRef_KnownHash_LockHeld) and adjust assertions as necessary.

kumaraditya303

kumaraditya303 commented on May 28, 2025

@kumaraditya303
Contributor

Fixed by #133177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Deadlock in test_opcache with gh-131174 applied · Issue #133130 · python/cpython

      Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

      Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant