Skip to content

MsWordDocumentBackend crashes on .docx files with external image references (TargetMode="External") #3113

@rongo-ms

Description

@rongo-ms

Bug

MsWordDocumentBackend._handle_pictures() crashes when processing .docx files that contain image relationships with TargetMode="External" (e.g., images referencing URLs or local file:/// paths instead of embedded in word/media/).

The crash occurs because python-docx's _Relationship.target_part property raises ValueError when accessed on an external relationship, and Docling's code doesn't guard against this.

Error:

ValueError: target_part property on _Relationship is undefined when target mode is External

Docling async API returns:

{
  "status": "failure",
  "errors": [{
    "component_type": "pipeline",
    "module_name": "SimplePipeline",
    "error_message": "target_part property on _Relationship is undefined when target mode is External"
  }],
  "document": {"json_content": null, "md_content": null}
}

Root cause — in docling/backend/msword_backend.py, _handle_pictures()get_docx_image():

if rId in self.docx_obj.part.rels:
    image_part = self.docx_obj.part.rels[rId].target_part  # CRASHES on external rels

The relationship exists but is external. python-docx's _Relationship.target_part raises ValueError when _is_external is True.

Suggested fix — check rel.is_external before accessing target_part:

def get_docx_image(image):
    image_data = None
    rId = image.get("{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed")
    if rId in self.docx_obj.part.rels:
        rel = self.docx_obj.part.rels[rId]
        if rel.is_external:  # Skip external image references
            return None
        image_part = rel.target_part
        image_data = image_part.blob
    return image_data

This would cause external images to fall through to the existing if image_data is None: _log.warning("Warning: image cannot be found") path, which already handles missing images gracefully.

Steps to reproduce

  1. Obtain a .docx file with external image references (common when saving Wikipedia pages as .docx via a browser). The word/_rels/document.xml.rels will contain entries like:
<Relationship Id="rId139"
  Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
  Target="file:///C:\path\to\image.png"
  TargetMode="External"/>
  1. Convert with Docling:
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat

converter = DocumentConverter(allowed_formats=[InputFormat.DOCX])
result = converter.convert("file_with_external_images.docx")
# Raises: ConversionError: Input document ... is not valid.

Or via the async API:

# Submit to /v1/convert/source/async, poll until "success", then GET /v1/result/{task_id}
# Result: {"status": "failure", "errors": [{"error_message": "target_part property..."}]}

Docling version

docling 2.76.0

Python version

Python 3.12

python-docx version: 1.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocxissue related to docx backend

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions