Skip to content

3.14t vs 3.13t cuts IOCP performance in half #134637

Closed
@SolsticeProjekt

Description

@SolsticeProjekt

Bug report

Bug description:

This is about 3.14.0bl vs 3.13.1, free threaded in both cases.
Microsoft Windows [Version 10.0.19045.4529]

I run and maintain an IOCP server in python 3.13.1t.
There are no 3rd party libraries being used.

The Problem: Using the exact same code, running 3.14t vs 3.13t cuts the throughput in half.

I've made a badly written, working benchmark, extracted/simplified from my IOCP server.

Server:

from ctypes import windll,create_string_buffer,c_void_p,c_ulong,c_ulonglong,Structure,byref,cast,addressof,POINTER,c_char
from ctypes.wintypes import DWORD,HANDLE

kernel32 = windll.kernel32

CreateNamedPipeW = kernel32.CreateNamedPipeW
CreateIOCompletionPort = kernel32.CreateIoCompletionPort
ConnectNamedPipe = kernel32.ConnectNamedPipe
GetQueuedCompletionStatusEx = kernel32.GetQueuedCompletionStatusEx;
ReadFile = kernel32.ReadFile

GLE = kernel32.GetLastError

class OVERLAPPED(Structure):
	_fields_ = (("0", c_void_p),("1", c_void_p),("2", DWORD),("3", DWORD),("4", c_void_p),
				("5", c_void_p),("6",c_void_p),("7",c_void_p),("8",c_void_p))


Overlapped = (OVERLAPPED*10)()
__Overlapped = byref(Overlapped)

IOCP = CreateIOCompletionPort(HANDLE(-1),None,0,4)

flag1 = 1 | 1073741824; flag2 = 4 | 2 | 0 | 8
Pipe = CreateNamedPipeW("\\\\.\\pipe\\IOCPBenchMark",flag1,flag2,255,32,0, None)

if not CreateIOCompletionPort(Pipe,IOCP,1,0): print("ERROR!")

ReadBuffer = create_string_buffer(1024)
__ReadBuffer = byref(ReadBuffer)

OverlapEntries = create_string_buffer(32*128)
ove = byref(OverlapEntries); 

Completed = c_ulong(0)
__Completed = byref(Completed)

def __IOCPThread():
	while True:
		while not GetQueuedCompletionStatusEx(IOCP, ove, 255, __Completed, 0, False): continue
		ReadFile(Pipe, __ReadBuffer,32,None,__Overlapped)


from threading import Thread

Threads = []
for t in range(4): Threads.append(Thread(target=__IOCPThread))


success = ConnectNamedPipe(Pipe, __Overlapped)
if not success:
	if GLE() != 997:
		print("ERROR 2")


while not GetQueuedCompletionStatusEx(IOCP, ove, 255, __Completed, 1, False): continue
print("Connected.")

ReadFile(Pipe, __ReadBuffer,32,None,__Overlapped)
for t in Threads: t.start()


from time import sleep
while True:
	sleep(1)

Client:

from ctypes import windll,c_char_p,byref
from ctypes.wintypes import DWORD

from time import perf_counter as pfc

kernel32 = windll.kernel32

CreateFileW = kernel32.CreateFileW
WriteFile = kernel32.WriteFile
GLE = kernel32.GetLastError

written = DWORD()
__written = byref(written)


print(GLE())

GENERIC_WRITE = 1073741824
Pipe = kernel32.CreateFileW("\\\\.\\pipe\\IOCPBenchMark",GENERIC_WRITE,0,None,3,0,None)

if GLE() == 0: print("Connected.")

test = b"test"

t = pfc()+1
while True:

	for Count in range(1000000):

		if not WriteFile(Pipe, test, 4,__written, None):
			print("ERROR ",GLE())

		if not WriteFile(Pipe, test, 4,__written, None):
			print("ERROR ",GLE())

		if not WriteFile(Pipe, test, 4,__written, None):
			print("ERROR ",GLE())

		if not WriteFile(Pipe, test, 4,__written, None):
			print("ERROR ",GLE())

		if pfc() >= t:
			t = pfc()+1
			print(Count*4)
			break

The server uses 4 threads. if you don't see any output, try reducing the amount.
When I use 8 threads (on my 8 core machine), I don't get any output. d'uh.
No, SMT-Threads don't count for anything here.

Each script runs in its own cmd.exe window.
Please be aware that you'll have to kill the server-process manually.

I wanted to add a call to "taskkill/F /IM:pytho*",
but then realized I might cause someone big trouble with that.

>python3.13t server.py:
Client output:
205536
207128
206764
206504
204768

>python3.14t server.py:
Client output:
107468
105516
106032
107492
108472

Perplexity suggested I should post this here,
because this is a use-case you people might be interested in.

Thank you.

CPython versions tested on:

3.14

Operating systems tested on:

Windows

Linked PRs

Activity

ZeroIntensity

ZeroIntensity commented on May 24, 2025

@ZeroIntensity
Member

In 3.14, we made ctypes thread-safe, so this is probably the result of lock contention. I'm honestly surprised it's not crashing in 3.13. Would you mind benchmarking to find which lock is causing the problem?

cc @kumaraditya303

kumaraditya303

kumaraditya303 commented on May 24, 2025

@kumaraditya303
Contributor

I think it is because of using critical section around PyCFuncPtr_call because restype is mutable so it needs locking in PyCFuncPtr_call, I'll look into making it lock free.

SolsticeProjekt

SolsticeProjekt commented on May 24, 2025

@SolsticeProjekt
Author

In 3.14, we made ctypes thread-safe, so this is probably the result of lock contention. I'm honestly surprised it's not crashing in 3.13. Would you mind benchmarking to find which lock is causing the problem?

cc @kumaraditya303

Are you saying that my IOCP server, using 3.13t, shouldn't actually be working? Because it does. Flawlessly, even under load heavy enough to throttle the rest of the system.

Is there anything I can do to help?

ZeroIntensity

ZeroIntensity commented on May 25, 2025

@ZeroIntensity
Member

I think it is because of using critical section around PyCFuncPtr_call because restype is mutable so it needs locking in PyCFuncPtr_call, I'll look into making it lock free.

Oh, yeah, that sounds problematic. I'd be ok with removing that critical section entirely (or only holding it in places where we actually access per-object state), because it should up to the user to serialize their own C calls.

kumaraditya303

kumaraditya303 commented on May 26, 2025

@kumaraditya303
Contributor

I have a fix at #134702 which makes it lock-free in the general case and fixes the performance regression.

5 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      3.14t vs 3.13t cuts IOCP performance in half · Issue #134637 · python/cpython

      Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

      Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant