Skip to content

feat(file_processors): add remote::unstructured-api provider#6076

Draft
sahana-sreeram wants to merge 5 commits into
ogx-ai:mainfrom
sahana-sreeram:feature/add-unstructured-api-provider
Draft

feat(file_processors): add remote::unstructured-api provider#6076
sahana-sreeram wants to merge 5 commits into
ogx-ai:mainfrom
sahana-sreeram:feature/add-unstructured-api-provider

Conversation

@sahana-sreeram

Copy link
Copy Markdown
Contributor

Add Unstructured.io as a remote file processor provider supporting 65+ file formats. Implementation follows the same pattern as remote::docling-serve provider with unit tests providing coverage

What does this PR do?

Adds Unstructured.io as a remote file processor provider (remote::unstructured-api) supporting 65+ file formats including PDF, DOCX, PPTX, XLSX, EML, MSG, HTML, and more.

Key capabilities:

  • 65+ file format support (vs ~10 for other providers)
  • Email format support (EML/MSG) - unique to this provider
  • Cloud-based processing via Unstructured.io SaaS API
  • Element type preservation (Title, NarrativeText, ListItem, Table, Image, etc.)
  • SOC2/HIPAA/GDPR certified processing

Implementation:
Follows the same architectural pattern as remote::docling-serve (PR #5412):

  • Remote API integration using unstructured-client library
  • Element-to-chunk mapping preserving semantic structure
  • API key authentication via exporting UNSTRUCTURED_API_KEY
  • Unit tests with mocked API responses (13 tests, same coverage pattern as docling-serve)

Test Plan

1. Unit Tests

Tests code logic in isolation by mocking API responses. Verifies the provider correctly validates inputs, calls the Unstructured API with proper authentication, maps elements to OGX chunks with all required metadata, and handles both direct file upload and file_id retrieval paths.

Run:

uv run pytest tests/unit/providers/file_processor/test_unstructured_api.py -v

Output:
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_rejects_no_file_and_no_file_id PASSED [  7%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_rejects_both_file_and_file_id PASSED [ 15%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_process_file_success PASSED [ 23%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_element_types_preserved PASSED [ 30%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_empty_elements_skipped PASSED [ 38%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_chunk_metadata_fields PASSED [ 46%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_chunk_id_uniqueness PASSED [ 53%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_page_numbers_preserved PASSED [ 61%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_token_count_calculated PASSED [ 69%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_process_file_via_file_id PASSED [ 76%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessor::test_api_key_used PASSED [ 84%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessorConfig::test_default_values PASSED [ 92%]
tests/unit/providers/file_processor/test_unstructured_api.py::TestUnstructuredApiFileProcessorConfig::test_sample_run_config PASSED [100%]

============================== 13 passed in 0.23s ==============================

Coverage:
- Input validation (2 tests)
- Core processing with element type preservation (3 tests)
- Chunk metadata mapping (4 tests)
- File ID retrieval path (1 test)
- API authentication (1 test)
- Configuration validation (2 tests)

2. Integration Test

Using a sample PDF, verifies the full end-to-end user flow from OGX server through the FileProcessor API to the real Unstructured.io API via exporting API key, confirming successful processing of text and PDF files with the correct chunk/embed structure, element type preservation, page number tracking, and error handling for invalid inputs.

Setup:
export UNSTRUCTURED_API_KEY="your-key"
uv run ogx stack run \
  --providers "file_processors=remote::unstructured-api,files=inline::localfs" \
  --port 8321

Test with curl:
curl -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file=@syllabus_soc24_spring2018.pdf" | jq '.metadata, .chunks[0]'

Output:
{
  "processor": "unstructured-api",
  "processing_time_ms": 9872,
  "extraction_method": "unstructured-api",
  "file_size_bytes": 165750,
  "total_elements": 186
}
{
  "content": "Department of Sociology Harvard University Spring 2018",
  "chunk_id": "23d87af4-e79c-cfbf-1912-67c333904304",
  "metadata": {
    "document_id": "9410f8ab-459f-4459-9294-a4fc2f88ec8b",
    "element_type": "Title",
    "element_index": 0,
    "filename": "syllabus_soc24_spring2018.pdf",
    "page_number": 1
  },
  "chunk_metadata": {
    "chunk_id": "23d87af4-e79c-cfbf-1912-67c333904304",
    "document_id": "9410f8ab-459f-4459-9294-a4fc2f88ec8b",
    "source": "syllabus_soc24_spring2018.pdf",
    "created_timestamp": null,
    "updated_timestamp": null,
    "chunk_window": null,
    "chunk_tokenizer": null,
    "content_token_count": 7,
    "metadata_token_count": null
  }
}

Results:
- CHECK: Processed 11-page PDF → 186 chunks in 8.6 seconds
- CHECK: Element types detected: NarrativeText (104), Title (37), UncategorizedText (18), ListItem (17), Footer (10)
- CHECK: Page numbers preserved (1-11)
- CHECK: All required OGX Chunk fields populated correctly

Verification: existing pytests still pass

Add Unstructured.io as a remote file processor provider supporting 65+ file formats. Implementation follows the same pattern as remote::docling-serve provider with unit tests providing coverage

Signed-off-by: Sahana Sreeram <sahanasreeram01@gmail.com>
@sahana-sreeram sahana-sreeram force-pushed the feature/add-unstructured-api-provider branch from 0c688e7 to 4f8f39f Compare June 10, 2026 15:18

@leseb leseb left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try to wire it with the "auto" provider

sahana-sreeram and others added 3 commits June 10, 2026 13:38
…ed-api into the auto file processor as an optional fallback. When an unstructured api key is provided by the user in config, the auto provider will route supported file formats to Unstructured before checking for/returning a 422 error. This provides 65+ additional format support (including EML, DOC, MSG) when users opt-in with an API key, while maintaining priority routing to pypdf for pdfs and markitdown for office/image/audio. Tested with .eml file, successfully routes to Unstructured. pypdf and markitdown routing unchanged.
@alinaryan

Copy link
Copy Markdown
Contributor

please try to wire it with the "auto" provider

@leseb Is the intent to have auto choose between all available file processor backends? If so, I'm wondering why inline::docling and remote::docling-serve were not also wired in.

If the goal is to make auto a general dispatcher across providers, I'd prefer to do that in a followup PR that includes both docling and unstructured, so we can design the fallback mechanism properly rather than hard-coding one provider's config into auto's config class. For this PR, I think landing just the standalone remote::unstructured-api provider (without the auto changes) would be cleaner.

@leseb

leseb commented Jun 11, 2026

Copy link
Copy Markdown
Member

please try to wire it with the "auto" provider

@leseb Is the intent to have auto choose between all available file processor backends? If so, I'm wondering why inline::docling and remote::docling-serve were not also wired in.

If the goal is to make auto a general dispatcher across providers, I'd prefer to do that in a followup PR that includes both docling and unstructured, so we can design the fallback mechanism properly rather than hard-coding one provider's config into auto's config class. For this PR, I think landing just the standalone remote::unstructured-api provider (without the auto changes) would be cleaner.

🤔 they are wired, the auto provider just reads the format and forwards it to the appropriate provider. unless something changed recently :)

i'm fine doing the wiring in a followup PR though!

@alinaryan

alinaryan commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🤔 they are wired, the auto provider just reads the format and forwards it to the appropriate provider. unless something changed recently :)

i'm fine doing the wiring in a followup PR though!

@leseb I just checked, and docling isn't wired into auto today. Auto only imports and dispatches to pypdf and markitdown. Docling exists as standalone inline::docling and remote::docling-serve providers. Starter distribution only configures inline::auto for file_processors and auto has no docling references.

Sounds like we can do the wiring in a follow-up though, I can open an issue for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants