Description
Bug report
This is to document the deadlock that occurs with gh-131174 applied. It seems a bit hard to reproduce and I had to repeatedly run the tests. When I got it to finally deadlock, the stack-trace for the three threads are as follows. Note that update_slots_callback()
runs with the world stopped.
Thread 1:
[...]
#6 0x000055e8fa1ab214 in _PySemaphore_PlatformWait (sema=sema@entry=0x7ffefa17dda0, timeout=timeout@entry=-1)
at Python/parking_lot.c:162
#7 0x000055e8fa1ab55c in _PySemaphore_Wait (sema=sema@entry=0x7ffefa17dda0, timeout=timeout@entry=-1,
detach=detach@entry=0) at Python/parking_lot.c:232
#8 0x000055e8fa1ab6e1 in _PyParkingLot_Park (addr=addr@entry=0x55e8fa5a186c <_PyRuntime+361772>,
expected=expected@entry=0x7ffefa17de24, size=size@entry=4, timeout_ns=timeout_ns@entry=-1,
park_arg=park_arg@entry=0x0, detach=detach@entry=0) at Python/parking_lot.c:335
#9 0x000055e8fa1c1c43 in tstate_wait_attach (tstate=tstate@entry=0x55e8fa5a1840 <_PyRuntime+361728>)
at Python/pystate.c:2145
#10 0x000055e8fa1c1df0 in _PyThreadState_Attach (tstate=tstate@entry=0x55e8fa5a1840 <_PyRuntime+361728>)
at Python/pystate.c:2180
#11 0x000055e8fa17048f in PyEval_AcquireThread (tstate=tstate@entry=0x55e8fa5a1840 <_PyRuntime+361728>)
at Python/ceval_gil.c:605
#12 0x000055e8fa1ab56b in _PySemaphore_Wait (sema=sema@entry=0x7ffefa17df00, timeout=timeout@entry=-1,
detach=detach@entry=1) at Python/parking_lot.c:234
#13 0x000055e8fa1ab6e1 in _PyParkingLot_Park (addr=addr@entry=0x55e8fa599160 <_PyRuntime+327200>,
expected=expected@entry=0x7ffefa17df97, size=size@entry=1, timeout_ns=timeout_ns@entry=-1,
park_arg=park_arg@entry=0x7ffefa17dfa0, detach=detach@entry=1) at Python/parking_lot.c:335
#14 0x000055e8fa19fb6a in _PyMutex_LockTimed (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>, timeout=timeout@entry=-1,
flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:108
#15 0x000055e8fa1a0970 in PyMutex_Lock (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at Python/lock.c:611
#16 0x000055e8fa1435c6 in _PyMutex_Lock (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at ./Include/cpython/lock.h:49
#17 0x000055e8fa14378f in _PyCriticalSection_BeginSlow (c=c@entry=0x7ffefa17e110,
m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at Python/critical_section.c:37
#18 0x000055e8f9fe8918 in _PyCriticalSection_BeginMutex (c=c@entry=0x7ffefa17e110,
m=0x55e8fa599160 <_PyRuntime+327200>) at ./Include/internal/pycore_critical_section.h:119
#19 0x000055e8fa000264 in _PyType_LookupStackRefAndVersion (type=0x7f06d8afc810,
name=name@entry=0x55e8fa558f28 <_PyRuntime+64488>, out=out@entry=0x7f071b60d190) at Objects/typeobject.c:5891
[...]
Thread 2:
[...]
#6 0x000055e8fa1ab214 in _PySemaphore_PlatformWait (sema=sema@entry=0x7f06cc4fd500, timeout=timeout@entry=-1)
at Python/parking_lot.c:162
#7 0x000055e8fa1ab55c in _PySemaphore_Wait (sema=sema@entry=0x7f06cc4fd500, timeout=timeout@entry=-1,
detach=detach@entry=0) at Python/parking_lot.c:232
#8 0x000055e8fa1ab6e1 in _PyParkingLot_Park (addr=addr@entry=0x72a40000f03c, expected=expected@entry=0x7f06cc4fd584,
size=size@entry=4, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x0, detach=detach@entry=0)
at Python/parking_lot.c:335
#9 0x000055e8fa1c1c43 in tstate_wait_attach (tstate=tstate@entry=0x72a40000f010) at Python/pystate.c:2145
#10 0x000055e8fa1c1df0 in _PyThreadState_Attach (tstate=tstate@entry=0x72a40000f010) at Python/pystate.c:2180
#11 0x000055e8fa17048f in PyEval_AcquireThread (tstate=tstate@entry=0x72a40000f010) at Python/ceval_gil.c:605
#12 0x000055e8fa290c0f in thread_run (boot_raw=boot_raw@entry=0x7214000012d0) at ./Modules/_threadmodule.c:350
#13 0x000055e8fa1e682d in pythread_wrapper (arg=0x720c00000b80) at Python/thread_pthread.h:242
#14 0x00007f071c601def in __tsan_thread_start_func (arg=0x7ffefa17dd60)
at ../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1012
#15 0x00007f071c354aa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#16 0x00007f071c3e1c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
[...]
Thread 3
[...]
#6 0x000055e8fa1ab214 in _PySemaphore_PlatformWait (sema=sema@entry=0x7f06cbcfb4a0, timeout=timeout@entry=-1)
at Python/parking_lot.c:162
#7 0x000055e8fa1ab55c in _PySemaphore_Wait (sema=sema@entry=0x7f06cbcfb4a0, timeout=timeout@entry=-1,
detach=detach@entry=1) at Python/parking_lot.c:232
#8 0x000055e8fa1ab6e1 in _PyParkingLot_Park (addr=addr@entry=0x55e8fa599160 <_PyRuntime+327200>,
expected=expected@entry=0x7f06cbcfb537, size=size@entry=1, timeout_ns=timeout_ns@entry=-1,
park_arg=park_arg@entry=0x7f06cbcfb540, detach=detach@entry=1) at Python/parking_lot.c:335
#9 0x000055e8fa19fb6a in _PyMutex_LockTimed (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>, timeout=timeout@entry=-1,
flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:108
#10 0x000055e8fa1a0970 in PyMutex_Lock (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at Python/lock.c:611
#11 0x000055e8fa1435c6 in _PyMutex_Lock (m=m@entry=0x55e8fa599160 <_PyRuntime+327200>) at ./Include/cpython/lock.h:49
#12 0x000055e8fa143bf5 in _PyCriticalSection_Resume (tstate=tstate@entry=0x72a400019010)
at Python/critical_section.c:109
#13 0x000055e8f9f805e2 in _PyCriticalSection_Pop (c=c@entry=0x7f06cbcfb670)
at ./Include/internal/pycore_critical_section.h:141
#14 0x000055e8f9f80618 in _PyCriticalSection_End (c=c@entry=0x7f06cbcfb670)
at ./Include/internal/pycore_critical_section.h:155
#15 0x000055e8f9f806c7 in ensure_shared_on_read (mp=mp@entry=0x7f06d94e1030) at Objects/dictobject.c:1326
#16 0x000055e8f9f860a4 in _Py_dict_lookup_threadsafe (mp=mp@entry=0x7f06d94e1030,
key=key@entry=0x55e8fa5592b8 <_PyRuntime+65400>, hash=hash@entry=5605474849373301295,
value_addr=value_addr@entry=0x7f06cbcfb740) at Objects/dictobject.c:1497
#17 0x000055e8f9f86e3f in _PyDict_GetItemRef_KnownHash (op=op@entry=0x7f06d94e1030,
key=key@entry=0x55e8fa5592b8 <_PyRuntime+65400>, hash=hash@entry=5605474849373301295,
result=result@entry=0x7f06cbcfb7a0) at Objects/dictobject.c:2370
#18 0x000055e8f9ffff81 in find_name_in_mro (type=type@entry=0x7f06d9691410, name=0x55e8fa5592b8 <_PyRuntime+65400>,
error=error@entry=0x7f06cbcfb824) at Objects/typeobject.c:5721
#19 0x000055e8fa007d9b in update_one_slot (type=type@entry=0x7f06d9691410, p=0x55e8fa5259a0 <slotdefs>)
at Objects/typeobject.c:11292
#20 0x000055e8fa007f9b in update_slots_callback (type=type@entry=0x7f06d9691410, data=data@entry=0x7f06cbcfb8c0)
at Objects/typeobject.c:11390
[...]
Inside the _PyCriticalSection_Resume
of the 3rd thread, it's waiting on m1
:
#12 0x000055e8fa143bf5 in _PyCriticalSection_Resume (tstate=tstate@entry=0x72a400019010)
at Python/critical_section.c:109
109 PyMutex_Lock(m1);
(gdb) l
104 m2 = c2->_cs_mutex2;
105 c2->_cs_mutex2 = NULL;
106 }
107
108 if (m1) {
109 PyMutex_Lock(m1);
110 }
111 if (m2) {
112 PyMutex_Lock(m2);
113 }
(gdb) p *m1
$9 = {_bits = 3 '\003'}
(gdb) p *m2
$10 = {_bits = 0 '\000'}
Activity
Revert gh-127266: avoid data races when updating type slots (gh-131174)…
colesbury commentedon Apr 29, 2025
It looks like
m
is probably theTYPE_LOCK
? It's at_PyRuntime+327200
so it's statically allocated, not some other random object lock.There's a problem that I forgot when reviewing the PR: even though all the critical sections for other threads are suspended during a stop-the-world, some of the locks may still be locked!
This is because locks can be handed off directly: when unlocking a lock, if the waiter has been waiting for longer
TIME_TO_BE_FAIR_NS
(1 ms), then the lock is "handed off" directly and not unlocked. This is to ensure fairness and avoid starvation. However, the new owner may not yet have processed that it now "owns" the lock by the time of the stop the world pause.For example, this psuedo code can deadlock:
Even though T1 and T2 won't pause for a stop-the-world between the lock and unlock, they may pause during the
PyMutex_Lock
AND own the lock due to handoff from the other thread (but not yet process it).In general, we really want to minimize locking during a stop the world pause. For example, the free threaded GC only acquires
HEAD_LOCK()
. Importantly,HEAD_LOCK()
is only acquired withPyMutex_LockFlags(..., _Py_LOCK_DONT_DETACH)
where the_Py_LOCK_DONT_DETACH
means that the thread waiting to acquire it won't detach, which avoids this handoff + stop the world deadlock.nascheme commentedon Apr 29, 2025
Thanks for those details. Being careful enough during the stop-the-world seems difficult.
update_slots_callback
calls the following:_PyObject_HashFast
PyDict_Contains
_PyDict_GetItemRef_KnownHash
Would it be possible for the thread stopping the world to process the locks to ensure it owns them?
colesbury commentedon Apr 29, 2025
Maybe.. you might be able to do something like:
NULL
mutex pointer)...
The blank critical section avoids deadlock due to reentrancy and avoids resuming any previously active critical sections when attaching.
I'm not enthusiastic about this approach. We haven't done something like that before and I'd be worried about both missing some subtle detail and also the extra complexity from additional weird patterns.
I'd much rather we use non-locking versions of the dictionary functions (such as
_PyDict_GetItemRef_KnownHash_LockHeld
) and adjust assertions as necessary.kumaraditya303 commentedon May 28, 2025
Fixed by #133177