Skip to content

is_zipfile false positives #72680

Closed
Closed
@thomaswaldmann

Description

@thomaswaldmann
BPO 28494
Nosy @Yhg1s, @gpshead, @vstinner, @serhiy-storchaka, @ThomasWaldmann, @jjolly, @miss-islington
PRs
  • gh-72680: Fix false positives when using zipfile.is_zipfile() #5053
  • bpo-28494: Test existing zipfile working behavior. #15853
  • [3.8] bpo-28494: Test existing zipfile working behavior. (GH-15853) #15891
  • bpo-28494: install ziptestdata to fix install bot. #15902
  • [3.8] bpo-28494: install ziptestdata to fix install bot (GH-15902) #15912
  • Files
  • isz_fail.py
  • isz_fail_fix.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = None
    created_at = <Date 2016-10-20.21:08:48.959>
    labels = ['3.7', '3.8', 'type-bug', 'library']
    title = 'is_zipfile false positives'
    updated_at = <Date 2019-09-11.11:09:56.949>
    user = 'https://github.com/ThomasWaldmann'

    bugs.python.org fields:

    activity = <Date 2019-09-11.11:09:56.949>
    actor = 'miss-islington'
    assignee = 'gregory.p.smith'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2016-10-20.21:08:48.959>
    creator = 'Thomas.Waldmann'
    dependencies = []
    files = ['45162', '45164']
    hgrepos = []
    issue_num = 28494
    keywords = ['patch']
    message_count = 16.0
    messages = ['279084', '279088', '279089', '280341', '281805', '281817', '309245', '311281', '334595', '351723', '351749', '351750', '351751', '351755', '351802', '351823']
    nosy_count = 9.0
    nosy_names = ['twouters', 'gregory.p.smith', 'alanmcintyre', 'vstinner', 'serhiy.storchaka', 'Thomas.Waldmann', 'mryan1539', 'jjolly', 'miss-islington']
    pr_nums = ['5053', '15853', '15891', '15902', '15912']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue28494'
    versions = ['Python 2.7', 'Python 3.7', 'Python 3.8']

    Linked PRs

    Activity

    thomaswaldmann

    thomaswaldmann commented on Oct 20, 2016

    @thomaswaldmann
    MannequinAuthor

    zipfile.is_zipfile has false positives way too easily.

    I just have seen it in practive when a MoinMoin wiki site with a lot of pdf attachments crashed with 500. This was caused by a valid PDF that just happened to contain PK\005\006 somewhere in the middle - this was enough to satisfy is_zipfile() and triggered further processing as a zipfile, which then crashed with IOError (which was not catched in our code, yet).

    I have looked into zipfile code: if the usual EOCD structure (with empty comment) is not at EOF, it is suspected that there might be a non-empty comment and ~64K before EOF are searched for the PK\005\006 magic. If it is somewhere there, it is assumed that the file is a zip, without any further validity check.

    Attached is a failure demo that works with at least 2.7 and 3.5.

    https://en.wikipedia.org/wiki/Zip_(file_format)

    added
    stdlibPython modules in the Lib dir
    type-bugAn unexpected behavior, bug, or error
    on Oct 20, 2016
    thomaswaldmann

    thomaswaldmann commented on Oct 20, 2016

    @thomaswaldmann
    MannequinAuthor

    patch for py2.7

    The EOCD structure is at EOF.

    It either does not contain a comment (this is what the existing code checks first) or it contains a comment of the length that is specified in the structure.

    The patch checks consistency specified length vs. real length (end of fixed part of structure up to EOF). If this does not match, it is likely not a zip file, but just a file that happens to have the magic 4 bytes somewhere in its last 64kB.

    thomaswaldmann

    thomaswaldmann commented on Oct 20, 2016

    @thomaswaldmann
    MannequinAuthor

    Note: checking the first bytes of the file (PK..) might be another option.

    But this has the "problem" that a self-extracting zip starts with an executable that has different first bytes.

    So whether this is an option or not depends on whether is_zipfile() should return truish for self-extracting ZIP files.

    serhiy-storchaka

    serhiy-storchaka commented on Nov 8, 2016

    @serhiy-storchaka
    Member

    The problem is that the zipfile module supports even not well-formed archives, with a data appended past a comment, and with truncated comment. There are special tests for this, and the proposed patch breaks these tests: test_comments, test_ignores_newline_at_end, test_ignores_stuff_appended_past_comments. See bpo-10694 and bpo-1622.

    thomaswaldmann

    thomaswaldmann commented on Nov 27, 2016

    @thomaswaldmann
    MannequinAuthor

    Well, if you have a better idea how to fix is_zipfile, go on.

    I even suggested an alternative, how about that?

    It is a miserable state when the is_zipfile function in the stdlib detects random crap as a zip file.

    serhiy-storchaka

    serhiy-storchaka commented on Nov 27, 2016

    @serhiy-storchaka
    Member

    No, checking the first bytes of the file is not appropriate option. zipfile should support the Python zip application format [1].

    I see two options:

    1. Make is_zipfile() more strict that the ZipFile constructor. The later supports ZIP files with a data past the comment or with truncated comments, but the former should reject them.

    2. Make both is_zipfile() and the ZipFile constructor more robust. They should check not just the EOCD signature, but check the Zip64 end of central directory record (if exists) and the first central file header signature (if the ZIP file is not empty).

    It may be that PDF files contain PK\005\006 not accidentally, but because they contain embedded ZIP files (I don't know if this is a case). In that circumstances is_zipfile() returning True is correct.

    [1] https://docs.python.org/3/library/zipapp.html

    jjolly

    jjolly commented on Dec 30, 2017

    @jjolly
    Mannequin

    Fix submitted that evaluates the ECD structure and validates the first CD entry. The fix also handles empty zipfiles.

    IMO the purpose of this API is to *quickly* verify that the file is a valid zipfile. With this fix, the API only reads another 46 bytes of data (after a seek, of course). This should still qualify as "quick", especially after having potentially read 64k of data.

    Perhaps a full zip validator would be appropriate in addition to is_zipfile. That would be more appropriate as a full feature rather than in this bugfix.

    jjolly

    jjolly commented on Jan 30, 2018

    @jjolly
    Mannequin

    Is there any chance that this will make it into 3.7 or is my reminder too late?

    gpshead

    gpshead commented on Jan 30, 2019

    @gpshead
    Member

    it's a bugfix, it seems reasonable for 3.7 to me. I agree that the previous is_zipfile check is too lenient. I'll follow up on jjolly's PR for any specific concerns I have with the implementation.

    33 remaining items

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Metadata

    Metadata

    Assignees

    No one assigned

      Labels

      stdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

      Projects

      Status

      Done

      Milestone

      No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        is_zipfile false positives · Issue #72680 · python/cpython

        Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

        Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant