Skip to content

difflib.SequenceMatcher.find_longest_match return wrong result #134116

Closed
@unisgn

Description

@unisgn

Bug report

Bug description:

Python 3.13.3 (main, Apr 9 2025, 07:44:25) [GCC 14.2.1 20250207]

import difflib
s1='TZiSLutKO5xRiAkkw1ZGkpZsq4'
s2='hTkmKeyY0WYoEqn7xD6jDRwRU4quqozyh8WFwkYY82h9wVv93iUzijw4Q8JYh4l496RD20dsTmy1T0Tl5D1sLRetaW2PP75f9fLeSCllRmISdDFLb3QazkubtOAjZ95a5Ril7NdVIX8hJWlJgwmhd7FGlO5aQQQbLeQcSEFqmiDZOnBWoAisj9YeKHiihm2QzAsdZAN78CO8tXEHfKjCOoZWQo513tEJ4b26BRItK'
m = difflib.SequenceMatcher(None, s1, s2).find_longest_match(ahi=len(s1),bhi=len(s2))
assert s1.find('tK') == 6
assert len(s2) == 233
assert s2.find('tK') == 231
assert m == difflib.Match(3,100,1)

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Activity

added
type-bugAn unexpected behavior, bug, or error
on May 16, 2025
picnixz

picnixz commented on May 16, 2025

@picnixz
Member

What's your expected result and what's the result that is found?

added
pendingThe issue will be closed if no feedback is provided
on May 16, 2025
unisgn

unisgn commented on May 16, 2025

@unisgn
Author

What's your expected result and what's the result that is found?

since 'tK' in both s1 and s2, i think at least we expect:
assert m.size >= len('tK')

ZeroIntensity

ZeroIntensity commented on May 16, 2025

@ZeroIntensity
Member

Oh how I love fuzzer-generated repros

tim-one

tim-one commented on May 16, 2025

@tim-one
Member

Pass autojunk=False to the SequenceMatcher constructor. Then you'll get Match(a=6, b=231, size=2) as the result. As is,

>>> difflib.SequenceMatcher(None, s1, s2).bpopular
{'Y', 'T', 'F', 'm', 'w', 'e', '9', 'b', 'W', '5', 't', '4', 'j', 'd', 'J', 's', 'Q', 'D', 'i', 'L', '7', 'Z', 'A', 'R', 'O', 'q', 'h', 'o', '8', 'K', 'l', 'z', '2', 'a', 'E'}

Those characters show up so often that, by default (autojunk=True), the matcher considers them to be noise characters, and saves enormous amounts of time by not looking for matches containing them. 't' is in that set.

Turn off autojunk to make it look at every possibility. Less surprising that way, but may take a lot longer.

unisgn

unisgn commented on May 16, 2025

@unisgn
Author

Pass autojunk=False to the SequenceMatcher constructor. Then you'll get Match(a=6, b=231, size=2) as the result. As is,

difflib.SequenceMatcher(None, s1, s2).bpopular
{'Y', 'T', 'F', 'm', 'w', 'e', '9', 'b', 'W', '5', 't', '4', 'j', 'd', 'J', 's', 'Q', 'D', 'i', 'L', '7', 'Z', 'A', 'R', 'O', 'q', 'h', 'o', '8', 'K', 'l', 'z', '2', 'a', 'E'}

Those characters show up so often that, by default (autojunk=True), the matcher considers them to be noise characters, and saves enormous amounts of time by not looking for matches containing them. 't' is in that set.

Turn off autojunk to make it look at every possibility. Less surprising that way, but may take a lot longer.

Ok. Now I get it. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    pendingThe issue will be closed if no feedback is providedstdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      difflib.SequenceMatcher.find_longest_match return wrong result · Issue #134116 · python/cpython

      Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

      Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant