feat(file_processors): add remote::unstructured-api provider#6076
feat(file_processors): add remote::unstructured-api provider#6076sahana-sreeram wants to merge 5 commits into
Conversation
Add Unstructured.io as a remote file processor provider supporting 65+ file formats. Implementation follows the same pattern as remote::docling-serve provider with unit tests providing coverage Signed-off-by: Sahana Sreeram <sahanasreeram01@gmail.com>
0c688e7 to
4f8f39f
Compare
…ed-api into the auto file processor as an optional fallback. When an unstructured api key is provided by the user in config, the auto provider will route supported file formats to Unstructured before checking for/returning a 422 error. This provides 65+ additional format support (including EML, DOC, MSG) when users opt-in with an API key, while maintaining priority routing to pypdf for pdfs and markitdown for office/image/audio. Tested with .eml file, successfully routes to Unstructured. pypdf and markitdown routing unchanged.
…not set by allowing empty default key value
@leseb Is the intent to have auto choose between all available file processor backends? If so, I'm wondering why inline::docling and remote::docling-serve were not also wired in. If the goal is to make auto a general dispatcher across providers, I'd prefer to do that in a followup PR that includes both docling and unstructured, so we can design the fallback mechanism properly rather than hard-coding one provider's config into auto's config class. For this PR, I think landing just the standalone remote::unstructured-api provider (without the auto changes) would be cleaner. |
🤔 they are wired, the auto provider just reads the format and forwards it to the appropriate provider. unless something changed recently :) i'm fine doing the wiring in a followup PR though! |
@leseb I just checked, and docling isn't wired into auto today. Auto only imports and dispatches to pypdf and markitdown. Docling exists as standalone inline::docling and remote::docling-serve providers. Starter distribution only configures inline::auto for file_processors and auto has no docling references. Sounds like we can do the wiring in a follow-up though, I can open an issue for this! |
Add Unstructured.io as a remote file processor provider supporting 65+ file formats. Implementation follows the same pattern as remote::docling-serve provider with unit tests providing coverage
What does this PR do?
Adds Unstructured.io as a remote file processor provider (
remote::unstructured-api) supporting 65+ file formats including PDF, DOCX, PPTX, XLSX, EML, MSG, HTML, and more.Key capabilities:
Implementation:
Follows the same architectural pattern as
remote::docling-serve(PR #5412):unstructured-clientlibraryUNSTRUCTURED_API_KEYTest Plan
1. Unit Tests
Tests code logic in isolation by mocking API responses. Verifies the provider correctly validates inputs, calls the Unstructured API with proper authentication, maps elements to OGX chunks with all required metadata, and handles both direct file upload and file_id retrieval paths.
Run:
2. Integration Test
Using a sample PDF, verifies the full end-to-end user flow from OGX server through the FileProcessor API to the real Unstructured.io API via exporting API key, confirming successful processing of text and PDF files with the correct chunk/embed structure, element type preservation, page number tracking, and error handling for invalid inputs.
Verification: existing pytests still pass