Description
Bug report
Bug description:
Hello, I'm writing a thesis on free threading python and thus I'm testing the 3.13.0b1 with --disable-gil.
I installed it with pyenv using this command
env PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0b1
I didn't specify --enable-optimizations and --with-lto because with those the build would fail.
Now, I'm writing a benchmark to compare the free threading python with past versions of normal python and even with the 3.9.10 nogil python.
Here's the problem. The benchmark is a simple matrix-matrix multiplication script that splits the matrix into rows and distributes the rows to a specified number of threads. This is the complete code:
import threading
import time
import random
def multiply_row(A, B, row_index, result):
# Compute the row result
num_columns_B = len(B[0])
num_columns_A = len(A[0])
for j in range(num_columns_B):
sum = 0
for k in range(num_columns_A):
sum += A[row_index][k] * B[k][j]
result[row_index][j] = sum
def parallel_matrix_multiplication(a, b, result, row_indices):
for row_index in row_indices:
multiply_row(a, b, row_index, result)
def multi_threaded_matrix_multiplication(a, b, num_threads):
num_rows = len(a)
result = [[0] * len(b[0]) for _ in range(num_rows)]
row_chunk = num_rows / num_threads
threads = []
for i in range(num_threads):
start_row = i * row_chunk
end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
thread = threading.Thread(target=parallel_matrix_multiplication, args=(a, b, result, range(start_row, end_row)))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
return result
# Helper function to create a random matrix
def create_random_matrix(rows, cols):
return [[random.random() for _ in range(cols)] for _ in range(rows)]
def main():
size = 500 # Define matrix size
a = create_random_matrix(size, size)
b = create_random_matrix(size, size)
num_threads = 8 # Define number of threads
start = time.perf_counter()
result = multi_threaded_matrix_multiplication(a, b, num_threads)
print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")
if __name__ == "__main__":
main()
When I ran this code with these versions of python (3.9.10, nogil-3.9.10, 3.10.13, 3.11.8, 3.12.2) the maximum running time is ~13 seconds with normal 3.9.10, the minimum is ~5 seconds with nogil 3.9.10.
When I run it with 3.13.0b1, the time skyrockets to ~48 seconds.
I tried using cProfile to profile the code but it freezes and never outputs anything (with 3.13, with other versions it works), instead the cpu goes to 100% usage, which makes me think it doesn't use multiple cores, since nogil 3.9 goes to >600% usage, and never stops unless I kill the process.
The basic fibonacci test works like a charm, so I know the --disable-gil build succeded.
All of this is done on a Macbook Air M1 with 16 GB of RAM and 8 cpu cores.
CPython versions tested on:
3.9, 3.10, 3.11, 3.12, 3.13
Operating systems tested on:
macOS
Activity
Eclips4 commentedon Jun 4, 2024
Duplicate of #118749
nogil
multi-threading is slower than multi-threading with gil for CPU bound #118749xbit18 commentedon Jun 4, 2024
Doesn't seem like a duplicate to me. The version is different, he was using 3.13.0a6, mine's beta 1, and he had problems with the fibonacci script, which works ok for me. @Eclips4
nogil
multi-threading is slower than multi-threading with gil for CPU bound #118749colesbury commentedon Jun 4, 2024
Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.
xbit18 commentedon Jun 4, 2024
Ok so just to be clear: this is expected behavior due to the fact that the free threading implementation is still incomplete, or it would behave the same if it was fully implemented?
colesbury commentedon Jun 4, 2024
This is the expected behavior -- it is not changing.
xbit18 commentedon Jun 6, 2024
Ok thank you.
Knowing this I changed the code so that it doesn't use a shared list "result" but thread-local results which are then combined. It doesn't really seem to be having any effect. Am I missing something?
Screen of different timings for the same code execution with different python versions (3.13 is free threading)

Also, I don't know how it can help but I've noticed that incrementing the number of threads seems to make thing worse. For example, using 2 threads I got ~20 seconds, using 8 I got 40 seconds and using 16 I got ~50 seconds.

Screen of different timings for different number of threads specified (all with 3.13.0b1)
This is the updated code, as you can see it doesn't use shared lists anymore but every thread creates a local list which it returns and then all the lists are combined:
iperov commentedon Jun 6, 2024
sorry guys,
where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation
xbit18 commentedon Jun 6, 2024
14 remaining items