This sample demonstrates how to build an intelligent document processing solution using Azure Content Understanding to extract structured data from documents and provide conversational querying capabilities.
Note: This sample is for demonstration purposes and should be adapted for production use with appropriate security, monitoring, and error handling considerations.
- Document Ingestion: Automatically process documents using Azure Content Understanding to extract structured data
- Configurable Extraction: Define custom field schemas and extraction rules via JSON configuration
- Conversational Interface: Query processed documents using natural language powered by Azure OpenAI
- Scalable Architecture: Built on Azure Functions for serverless, event-driven processing
- Document Classification: Intelligent document type classification and routing
- Data Storage: Persistent storage with Azure Cosmos DB for extracted data
- Infrastructure as Code: Complete Terraform deployment for reproducible infrastructure
- Azure subscription with access to:
- Azure Content Understanding
- Azure OpenAI Service
- Azure Functions
- Azure Cosmos DB
- Azure Key Vault
- Azure Storage Account
- Python 3.12 or later
- Terraform (for infrastructure deployment)
- Azure CLI
The solution implements three main workflows:
- Document Enquiry: Natural language querying of processed documents using Azure OpenAI
- Configuration Upload: Management of document extraction schemas and rules
- Document Ingestion: Automated processing of documents with Azure Content Understanding
For detailed architecture information, see Architecture Documentation.
One-Click Deployment:
- Click the "Deploy to Azure" button above to open Azure Cloud Shell
- Run the following command for automated deployment:
curl -s https://raw.githubusercontent.com/Azure-Samples/data-extraction-using-azure-content-understanding/main/deploy.sh | bashManual Deployment:
- Click the "Deploy to Azure" button above to open Azure Cloud Shell
- Run the following commands in the Cloud Shell:
# Clone the repository
git clone https://github.com/Azure-Samples/data-extraction-using-azure-content-understanding.git
cd data-extraction-using-azure-content-understanding/iac
# Copy and configure the terraform variables
cp terraform.tfvars.sample terraform.tfvars
# Edit the terraform.tfvars file with your desired values
# You can use the Cloud Shell editor: code terraform.tfvars
# Initialize and deploy the infrastructure
terraform init
terraform plan
terraform apply -auto-approve- Click the "Open in GitHub Codespaces" button above
- Wait for the codespace to be created and configured
- Follow the setup instructions in the terminal
- Clone this repository
- Open in VS Code
- When prompted, reopen in Dev Container
- The container will automatically install all dependencies
- Clone the repository:
git clone https://github.com/Azure-Samples/data-extraction-using-azure-content-understanding.git
cd data-extraction-using-azure-content-understanding- Create a virtual environment:
python -m venv .venv
source ./.venv/Scripts/activate # or ./.venv/bin/activate if on Mac/Linux- Configure VS Code settings:
Add the following property in .vscode/settings.json:
"azureFunctions.pythonVenv": ".venv"- Install dependencies:
pip install -r requirements.txt- Configure environment variables:
cp src/local.settings.sample.json src/local.settings.json
# Edit local.settings.json with your Azure service configurationsFirst, authenticate with Azure and ensure you're using the correct subscription:
az login
# List available subscriptions
az account list --output table
# Set the correct subscription
az account set --subscription "your-subscription-id"
# Verify the selected subscription
az account show --output tableNavigate to the iac folder and deploy the required Azure resources:
cd iac
cp terraform.tfvars.sample terraform.tfvars
# Edit terraform.tfvars with your values
terraform init
terraform plan
terraform applyNote that the provided Terraform scripts will by default create the appropriate roles for access for the user principal running the script; you will need to grant permissions to other users for accessing the Key Vault and CosmosDB instances:
Get their user principal ID - they will have to run:
az cli login
az ad signed-in-user show --query id -o tsvThen, you can run:
az role assignment create \
--assignee "<their-user-principal-id>" \
--role "DocumentDB Account Contributor" \
--scope "/subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.DocumentDB/databaseAccounts/<cosmosdb-name>"
az cosmosdb sql role assignment create \
--account-name "devdatext8wucosmoskb0" \
--resource-group "devdatext8WuRg0" \
--role-definition-id "00000000-0000-0000-0000-000000000001" \
--principal-id "<their-user-principal-id>" \
--scope "/subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.DocumentDB/databaseAccounts/<cosmosdb-name>"Update the src/local.settings.json file:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "<your-storage-connection-string>",
"FUNCTIONS_WORKER_RUNTIME": "python",
"ENVIRONMENT": "local"
}
}If you want to test the out-of-the-box monitoring integration with Application Insights and enable tracing of the Semantic Kernel workflow in the query endpoint, add the following:
{
...
"Values": {
...
"SEMANTICKERNEL_EXPERIMENTAL_GENAI_ENABLE_OTEL_DIAGNOSTICS": true,
"APPLICATIONINSIGHTS_CONNECTION_STRING": "<CONNECTION_STRING>"
}
}You can add "SEMANTICKERNEL_EXPERIMENTAL_GENAI_ENABLE_OTEL_DIAGNOSTICS" to the Function App app settings to capture Semantic Kernel telemetry in the deployed environment;
if you would like to capture prompts/completions as part of that telemetry, include "SEMANTICKERNEL_EXPERIMENTAL_GENAI_ENABLE_OTEL_DIAGNOSTICS_SENSITIVE": true instead of the other SK environment variable.
Update the src/resources/app_config.yaml file with your Azure service endpoints and keys/secrets from Key Vault for the respective environment (specified in your ENVIRONMENT environment variable):
# Example configuration - update with your actual values
azure_content_understanding:
endpoint: "<your-content-understanding-endpoint>"
subscription_key: "<your-content-understanding-key>"
azure_openai:
endpoint: "<your-openai-endpoint>"
api_key: "<your-openai-api-key>"
cosmos_db:
connection_string: "<your-cosmosdb-connection-string>"
key_vault:
url: "<your-key-vault-url>"Note that the app_config.yaml file should NOT directly contain any secrets, only the names of the secrets as stored in Key Vault, as this file is tracked under version control.
Create and upload document extraction configurations:
# Example configuration upload
curl -X POST http://localhost:7071/api/v1/ingest/config \
-H "Content-Type: application/json" \
-d @configs/document-extraction-v1.0.json-
Start the Function App:
func start --script-root ./src/
-
Test Health Check:
curl http://localhost:7071/api/v1/health
Expected Healthy Response:
When all resources and their connectivity are healthy, the health check will return:
{ "status": "healthy", "checks": { "mongo_db": { "status": "healthy", "details": "mongo_db is running as expected." }, "cosmos_db": { "status": "healthy", "details": "cosmos_db is running as expected." }, "key_vault": { "status": "healthy", "details": "key_vault is running as expected." }, "content_understanding": { "status": "healthy", "details": "content_understanding is running as expected." }, "azure_openai": { "status": "healthy", "details": "azure_openai is running as expected." } } }
- Upload Documents: Place documents in the configured Azure Storage container
- Monitor Processing: Check Azure Function logs for processing status
- Query Results: Use the inference API to query processed documents
The src/samples/ directory contains sample HTTP requests for:
- Health checks (
health_check_sample.http) - Configuration management (
config_update_sample.http) - Document querying (
query_api_sample.http) - Classifier management (
classifier_management_sample.http)
Run the test suite:
# Install test dependencies
pip install -r requirements_dev.txt
# Run tests
pytestβββ configs/ # Sample configuration files
βββ docs/ # Documentation and architecture diagrams
βββ iac/ # Terraform infrastructure as code
β βββ modules/ # Reusable Terraform modules
βββ src/ # Source code
β βββ configs/ # Application configuration management
β βββ controllers/ # API controllers
β βββ decorators/ # Custom decorators
β βββ models/ # Data models
β βββ routes/ # API routes
β βββ samples/ # Sample HTTP requests
β βββ services/ # Business logic services
β βββ utils/ # Utility functions
βββ tests/ # Unit and integration tests
Define extraction schemas in JSON format:
{
"id": "document-extraction-v1.0",
"name": "document-extraction",
"version": "v1.0",
"collection_rows": [
{
"data_type": "LeaseAgreement",
"field_schema": [
{
"name": "monthly_rent",
"type": "integer",
"description": "Monthly rent amount",
"method": "extract"
}
],
"analyzer_id": "lease-analyzer"
}
]
}Key environment variables for configuration:
AZURE_CONTENT_UNDERSTANDING_ENDPOINT: Azure Document Intelligence endpointAZURE_CONTENT_UNDERSTANDING_KEY: Azure Document Intelligence keyAZURE_OPENAI_ENDPOINT: Azure OpenAI service endpointAZURE_OPENAI_API_KEY: Azure OpenAI API keyCOSMOS_DB_CONNECTION_STRING: Cosmos DB connection string
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot.
This project is licensed under the MIT License - see the LICENSE file for details.
For support and questions:
- Create an issue in this repository
