Skip to content

3.130b1 Performance Issue with Free Threading build #120040

Not planned
@xbit18

Description

@xbit18

Bug report

Bug description:

Hello, I'm writing a thesis on free threading python and thus I'm testing the 3.13.0b1 with --disable-gil.
I installed it with pyenv using this command

env PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0b1

I didn't specify --enable-optimizations and --with-lto because with those the build would fail.
Now, I'm writing a benchmark to compare the free threading python with past versions of normal python and even with the 3.9.10 nogil python.
Here's the problem. The benchmark is a simple matrix-matrix multiplication script that splits the matrix into rows and distributes the rows to a specified number of threads. This is the complete code:

import threading
import time
import random

def multiply_row(A, B, row_index, result):
    # Compute the row result
    num_columns_B = len(B[0])
    num_columns_A = len(A[0])
    for j in range(num_columns_B):
        sum = 0
        for k in range(num_columns_A):
            sum += A[row_index][k] * B[k][j]
        result[row_index][j] = sum

def parallel_matrix_multiplication(a, b, result, row_indices):
    for row_index in row_indices:
        multiply_row(a, b, row_index, result)

def multi_threaded_matrix_multiplication(a, b, num_threads):
    num_rows = len(a)
    result = [[0] * len(b[0]) for _ in range(num_rows)]
    row_chunk = num_rows / num_threads

    threads = []
    for i in range(num_threads):
        start_row = i * row_chunk
        end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
        thread = threading.Thread(target=parallel_matrix_multiplication, args=(a, b, result, range(start_row, end_row)))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    return result

# Helper function to create a random matrix
def create_random_matrix(rows, cols):
    return [[random.random() for _ in range(cols)] for _ in range(rows)]

def main():
    size = 500  # Define matrix size
    a = create_random_matrix(size, size)
    b = create_random_matrix(size, size)
    num_threads = 8  # Define number of threads

    start = time.perf_counter()
    
    result = multi_threaded_matrix_multiplication(a, b, num_threads)
    print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")

if __name__ == "__main__":
    main()

When I ran this code with these versions of python (3.9.10, nogil-3.9.10, 3.10.13, 3.11.8, 3.12.2) the maximum running time is ~13 seconds with normal 3.9.10, the minimum is ~5 seconds with nogil 3.9.10.
When I run it with 3.13.0b1, the time skyrockets to ~48 seconds.
I tried using cProfile to profile the code but it freezes and never outputs anything (with 3.13, with other versions it works), instead the cpu goes to 100% usage, which makes me think it doesn't use multiple cores, since nogil 3.9 goes to >600% usage, and never stops unless I kill the process.

The basic fibonacci test works like a charm, so I know the --disable-gil build succeded.

All of this is done on a Macbook Air M1 with 16 GB of RAM and 8 cpu cores.

CPython versions tested on:

3.9, 3.10, 3.11, 3.12, 3.13

Operating systems tested on:

macOS

Activity

added
type-bugAn unexpected behavior, bug, or error
on Jun 4, 2024
Eclips4

Eclips4 commented on Jun 4, 2024

@Eclips4
Member

Duplicate of #118749

xbit18

xbit18 commented on Jun 4, 2024

@xbit18
Author

Doesn't seem like a duplicate to me. The version is different, he was using 3.13.0a6, mine's beta 1, and he had problems with the fibonacci script, which works ok for me. @Eclips4

reopened this on Jun 4, 2024
colesbury

colesbury commented on Jun 4, 2024

@colesbury
Contributor

Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.

xbit18

xbit18 commented on Jun 4, 2024

@xbit18
Author

Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.

Ok so just to be clear: this is expected behavior due to the fact that the free threading implementation is still incomplete, or it would behave the same if it was fully implemented?

colesbury

colesbury commented on Jun 4, 2024

@colesbury
Contributor

This is the expected behavior -- it is not changing.

xbit18

xbit18 commented on Jun 6, 2024

@xbit18
Author

Ok thank you.
Knowing this I changed the code so that it doesn't use a shared list "result" but thread-local results which are then combined. It doesn't really seem to be having any effect. Am I missing something?

Screen of different timings for the same code execution with different python versions (3.13 is free threading)
image

Also, I don't know how it can help but I've noticed that incrementing the number of threads seems to make thing worse. For example, using 2 threads I got ~20 seconds, using 8 I got 40 seconds and using 16 I got ~50 seconds.
Screen of different timings for different number of threads specified (all with 3.13.0b1)
image

This is the updated code, as you can see it doesn't use shared lists anymore but every thread creates a local list which it returns and then all the lists are combined:

import time
import random
from concurrent.futures import ThreadPoolExecutor

def multiply_row(A, B, row_index, local_result):
    num_columns_B = len(B[0])
    num_columns_A = len(A[0])
    for j in range(num_columns_B):
        sum = 0
        for k in range(num_columns_A):
            sum += A[row_index][k] * B[k][j]
        local_result[row_index][j] = sum

def parallel_matrix_multiplication(a, b, start_row, end_row):
    local_result = [[0] * len(b[0]) for _ in range(len(a))]
    
    for row_index in range(start_row, end_row):
        multiply_row(a, b, row_index, local_result)
    
    return local_result

def multi_threaded_matrix_multiplication(a, b, num_threads):
    num_rows = len(a)
    result = []
    for _ in range(num_rows):
        result.append([0] * len(b[0]))
    row_chunk = num_rows / num_threads

    futures = []
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        for i in range(num_threads):
            start_row = i * row_chunk
            end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
            future = executor.submit(parallel_matrix_multiplication, a, b, start_row, end_row)
            futures.append(future)
    
    results = [future.result() for future in futures]

    # Combine local results into the final result matrix
    for local_result in results:
        for i in range(num_rows):
            for j in range(len(b[0])):
                result[i][j] += local_result[i][j]

    return result

# Helper function to create a random matrix
def create_random_matrix(rows, cols):
    return [[random.random() for _ in range(cols)] for _ in range(rows)]

def main():
    size = 500  # Define matrix size
    
    a = create_random_matrix(size, size)
    b = create_random_matrix(size, size)

    num_threads = 8 # Define number of threads
    
    start = time.perf_counter()
    
    result = multi_threaded_matrix_multiplication(a, b, num_threads)
    print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")

if __name__ == "__main__":
    main()
iperov

iperov commented on Jun 6, 2024

@iperov

sorry guys,

where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation

xbit18

xbit18 commented on Jun 6, 2024

@xbit18
Author
No description provided.

14 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      3.130b1 Performance Issue with Free Threading build · Issue #120040 · python/cpython

      Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

      Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant