-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Bug
MsWordDocumentBackend._handle_pictures() crashes when processing .docx files that contain image relationships with TargetMode="External" (e.g., images referencing URLs or local file:/// paths instead of embedded in word/media/).
The crash occurs because python-docx's _Relationship.target_part property raises ValueError when accessed on an external relationship, and Docling's code doesn't guard against this.
Error:
ValueError: target_part property on _Relationship is undefined when target mode is External
Docling async API returns:
{
"status": "failure",
"errors": [{
"component_type": "pipeline",
"module_name": "SimplePipeline",
"error_message": "target_part property on _Relationship is undefined when target mode is External"
}],
"document": {"json_content": null, "md_content": null}
}Root cause — in docling/backend/msword_backend.py, _handle_pictures() → get_docx_image():
if rId in self.docx_obj.part.rels:
image_part = self.docx_obj.part.rels[rId].target_part # CRASHES on external relsThe relationship exists but is external. python-docx's _Relationship.target_part raises ValueError when _is_external is True.
Suggested fix — check rel.is_external before accessing target_part:
def get_docx_image(image):
image_data = None
rId = image.get("{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed")
if rId in self.docx_obj.part.rels:
rel = self.docx_obj.part.rels[rId]
if rel.is_external: # Skip external image references
return None
image_part = rel.target_part
image_data = image_part.blob
return image_dataThis would cause external images to fall through to the existing if image_data is None: _log.warning("Warning: image cannot be found") path, which already handles missing images gracefully.
Steps to reproduce
- Obtain a
.docxfile with external image references (common when saving Wikipedia pages as.docxvia a browser). Theword/_rels/document.xml.relswill contain entries like:
<Relationship Id="rId139"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
Target="file:///C:\path\to\image.png"
TargetMode="External"/>- Convert with Docling:
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
converter = DocumentConverter(allowed_formats=[InputFormat.DOCX])
result = converter.convert("file_with_external_images.docx")
# Raises: ConversionError: Input document ... is not valid.Or via the async API:
# Submit to /v1/convert/source/async, poll until "success", then GET /v1/result/{task_id}
# Result: {"status": "failure", "errors": [{"error_message": "target_part property..."}]}Docling version
docling 2.76.0
Python version
Python 3.12
python-docx version: 1.2.0