Skip to content

python-magic Incorrectly Classifies HTML as text/plain #207

Open
@Luis-manzur

Description

@Luis-manzur

Problem
Our current content type detection using the python-magic library misidentifies HTML content as text/plain if the tag is missing, even when

or other HTML tags are present. This causes incorrect handling of HTML fragments.

Solution
We'll enhance detection by manually checking for or

tags. If found, we'll explicitly set the MIME type to text/html, overriding python-magic's default.

mime = magic.from_buffer(content, mime=True)

# If the file content contains HTML tags, override the detected mime type to text/html
if b"<html" in content.lower() or b"<div" in content.lower():
    mime = "text/html"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Mid July

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions