A serverless AWS application for batch processing files using Claude AI models. This is a quick and dirty solution designed for rapid analysis of individual files, perfect for generating datasets, extracting structured content from unstructured data, or performing bulk document analysis.
The AI File Processor is a serverless system that processes files uploaded to S3 using AWS Bedrock's Claude models and a prompt that you provide. It's designed for use cases like:
- Dataset Generation: Convert unstructured documents/images into structured JSON data
- Content Analysis: Extract key information from business documents, forms, or receipts
- Object Recognition: Identify and catalog objects in images (YMMV)
- Document Transcription: Convert handwritten or printed text to digital format
- Translation: Translate documents between languages
- Quick Analysis: Rapid processing of document batches for research or analysis
[S3 Input Bucket] → [Lambda Trigger] → [Step Functions] → [Worker Lambdas] → [S3 Output Bucket]
↓ ↓ ↓ ↓ ↓
Upload files Validates Distributes Processes Results & Status
+ _prompt.json structure work in each file files
parallel with Claude
- Input S3 Bucket: Upload your files and prompt configuration
- Trigger Lambda: Validates structure, prevents duplicates, starts processing
- Step Functions: Orchestrates parallel processing of files
- Worker Lambdas: Process individual files using Claude via Bedrock
- Output S3 Bucket: Contains results and status tracking
- Status Updates: Real-time status tracking via JSON files
- AWS CLI configured with appropriate permissions
- AWS SAM CLI installed
- Python 3.11+
- Access to AWS Bedrock Claude models in your region
Your deployment user needs:
- CloudFormation stack creation/update
- Lambda function creation/management
- S3 bucket creation/management
- Step Functions state machine creation
- IAM role/policy creation
- Bedrock model access
git clone <repository-url>
cd ai-file-processorCopy and customize the SAM configuration:
cp samconfig.toml.example samconfig.your-env.tomlEdit samconfig.your-env.toml:
[default.deploy.parameters]
stack_name = "your-stack-name"
capabilities = "CAPABILITY_IAM CAPABILITY_NAMED_IAM"
confirm_changeset = true
resolve_s3 = true
parameter_overrides = "StackPrefix=your-prefix ModelId=arn:aws:bedrock:us-east-1:1234567890:inference-profile/us.anthropic.claude-3-5-sonnet-20241022-v2:0 MaxConcurrency=15"
tags = "Environment=production Team=ai-team Project=document-processing CostCenter=engineering"Deployment Parameters (set in samconfig.your-env.toml):
Required:
stack_name- (required) Name of CloudFormation stack in AWS (e.g., "my-ai-file-processor")ParameterOverrides(required)StackPrefix- (required) Prefix for resource names in AWS (e.g., "my-dev")ModelId- (required)Bedrock model ARN (see "Available Models" below)MaxConcurrency(optional)(default: 10) - Number of files to process simultaneously (1-1000)
tags- (optional) Key-value pairs for AWS resource tagging:- Applied to all resources (Lambda functions, S3 buckets, Step Functions, IAM roles)
- Useful for cost allocation, governance, and resource management
- Common tags: Environment, Team, Project, CostCenter, Owner
Available Models (check Bedrock console for your region):
- You may need to request AWS enable models
- Use the "Inference profile ARN" for the appropriate Claude model located in AWS Console -> Amazon Bedrock -> Infer -> Cross-region inference -> Inference profile
# install dependencies
pip install -r requirements-dev.txt
# set your AWS Region. Example
export AWS_REGION=us-east-1
# Validate template
sam validate
# Build application
sam build
# Deploy with your configuration
sam deploy --config-file samconfig.your-env.toml
### 4. Note the S3 Bucket Names
After deployment, note the created bucket names:
- Input: `{StackPrefix}-ai-file-processor-input`
- Output: `{StackPrefix}-ai-file-processor-output`Files must be organized in exactly one level deep directories:
✅ VALID:
your-bucket/
├── project1/
│ ├── image1.jpg
│ ├── image2.png
│ └── _prompt.json
└── analysis-batch/
├── file1.jpg
├── file2.jpeg
└── _prompt.json
❌ INVALID:
your-bucket/
├── _prompt.json # Too shallow (root level)
└── project/
└── subfolder/
├── image.jpg
└── _prompt.json # Too deep (nested)
- Images:
.png,.jpg,.jpeg - Future support planned for other file types
- Image Size: Image files must be under ~3MB each
Create a _prompt.json file:
{
"prompt": "Analyze this image and extract key information in JSON format with fields: title, description, objects_detected, and confidence_score."
}{
"prompt": "Your analysis prompt here",
"max_tokens": 4096,
"temperature": 0.2
}Configuration Parameters:
prompt(required): The instruction for Claude to analyze each filemax_tokens(optional, default: 8192): Maximum tokens Claude can generate per response (check limits for your specific model)temperature(optional, default: 0.1): Controls randomness (0.0 = deterministic, 1.0 = very random)
Example Prompts:
{
"prompt": "Extract all text from this document and format as structured JSON with sections for headers, body text, and any numerical data."
}{
"prompt": "Identify all objects in this image and return a JSON array with object_name, location, and confidence for each detected item."
}{
"prompt": "Translate this document to English and return both the original text and translation in JSON format."
}More Advanced Prompts
{
"prompt": "This is part of a handwritten correspondence. Please identify whether it is first page, last page, middle page, or single page. First page usually has a salutation and possibly a date, last pages would have a closing, middle pages would have neither, and single page letters would have both a salutation and a closing. Please output in normalized json format. Please include the trasciption of the full document and full english translation if not in english. Also include a short list of topical keywords\n\n```json\njson_data = {\n \"page_type\": \"first_page\",\n \"confidence\": \"[Z%]\",\n \"reasoning\": \"This page contains a date and a line starting with Dear...\", \"transcription\": \"[transcribed text]\", \"english_translation\": \"[translation]\", \"topic_keywords\": [array of keywords] \n}```\n\nPlease include your confidence level and a brief explanation of why you identified the page type. Do not include any text outside of the json itself."
}Prompt with full custom configuration options
{
"prompt": "Analyze the uploaded document",
"max_tokens": 4096,
"temperature": 0.2
}- Upload Files: Upload your files to the input bucket in a folder
- Add Prompt: Upload
_prompt.jsonto trigger processing (Note: Upload_prompt.jsonafter the image files are uploaded as the_prompt.jsonfile triggers the execution using whatever files are currently in the S3 directory.) - Monitor Status: Check the status file in the output bucket
- Retrieve Results: Download processed results from output bucket
Bash commands shown, but this can alternatively be done via the AWS console.
# Upload files to input bucket
aws s3 cp image1.jpg s3://your-prefix-ai-file-processor-input/batch-001/
aws s3 cp image2.png s3://your-prefix-ai-file-processor-input/batch-001/
aws s3 cp document.pdf s3://your-prefix-ai-file-processor-input/batch-001/
# Create and upload prompt file (this triggers processing)
echo '{"prompt": "Extract all text and key information from this document"}' > _prompt.json
aws s3 cp _prompt.json s3://your-prefix-ai-file-processor-input/batch-001/
# Check processing status
aws s3 cp s3://your-prefix-ai-file-processor-output/batch-001/_status.json ./status.json
cat status.json
# Download results when complete
aws s3 sync s3://your-prefix-ai-file-processor-output/batch-001/ ./results/The system creates status files in the output bucket to track progress:
In Progress:
{
"status": "in_progress",
"message": "Processing 5 files",
"total_files": 5,
"completed_files": 0,
"timestamp": "2025-01-15T10:30:00.123Z",
"directory_path": "batch-001/",
"model_id": "arn:aws:bedrock:us-east-1:123456789:inference-profile/us.anthropic.claude-3-5-sonnet-20241022-v2:0",
"execution_arn": "arn:aws:states:execution:..."
}Completed with Token Usage:
{
"status": "completed",
"message": "All files processed successfully",
"total_files": 5,
"completed_files": 5,
"successful_files": 4,
"failed_files": 1,
"timestamp": "2025-01-15T10:35:00.123Z",
"directory_path": "batch-001/",
"model_id": "arn:aws:bedrock:us-east-1:123456789:inference-profile/us.anthropic.claude-3-5-sonnet-20241022-v2:0",
"execution_arn": "arn:aws:states:execution:...",
"token_usage": {
"input_tokens": 1250,
"output_tokens": 3840,
"total_tokens": 5090,
"avg_input_tokens_per_file": 312.5,
"avg_output_tokens_per_file": 960.0,
"avg_total_tokens_per_file": 1272.5
}
}in_progress: Files are being processedcompleted: All files processed successfullyerror: Processing failed (see message for details)
model_id: Bedrock model ARN used for processing (always present)successful_files: Number of files that processed without errors (when completed)failed_files: Number of files that failed during processing (when completed)token_usage: Aggregated token consumption across all successful files (when completed)input_tokens: Total tokens sent to Claude (prompts + image data)output_tokens: Total tokens generated by Claudetotal_tokens: Sum of input and output tokens (used for cost calculation)avg_input_tokens_per_file: Average input tokens per successful fileavg_output_tokens_per_file: Average output tokens per successful fileavg_total_tokens_per_file: Average total tokens per successful file
Note: Token usage is aggregated from S3 object metadata, supporting unlimited batch sizes without Step Functions payload limits.
"Invalid directory structure": Files not in exactly one-level-deep directory"Job output already exists": Duplicate job prevention (delete output directory to retry)"No processable files found": No supported file types in directory"Processing failed": Step Functions execution error
Each processed file generates a .json result file:
Input: batch-001/image1.jpg
Output: batch-001/image1.jpg.json
Example successful result:
{
"title": "Product Catalog Page",
"description": "Image showing various electronic devices with pricing",
"objects_detected": [
{"name": "smartphone", "confidence": 0.95},
{"name": "laptop", "confidence": 0.87}
],
"extracted_text": "$299.99, $1,499.99",
"analysis_timestamp": "2025-01-15T10:35:22Z"
}When individual files fail (e.g., file too large, unsupported format), an error file is created:
{
"status": "error",
"error_code": "ValidationException",
"error_message": "messages.0.content.1.image.source.base64: image exceeds 5 MB maximum: 7021200 bytes > 5242880 bytes",
"file_key": "project1/too-big.png",
"record_id": "project1-too-big-png",
"timestamp": "2025-08-15T16:27:00.248755"
}Important: Individual file failures don't stop the entire batch. Other files continue processing normally, and the overall status will show "completed" even if some files failed.
The system enforces several validation rules:
- Directory exactly one level deep:
folder/_prompt.json - Supported file types in directory
- Valid JSON in
_prompt.jsonwithpromptfield - No existing output for the directory
- Root level prompt:
_prompt.json - Nested directories:
folder/sub/_prompt.json - Missing
promptfield in JSON - Output directory already exists (prevents duplicates)
- No processable files in directory
- Bedrock Charges: Based on input/output tokens per file
- Lambda Charges: Minimal for processing orchestration
- S3 Charges: Storage and request costs
- Step Functions: Per state transition
Estimated costs (us-east-1, Claude 3.5 Sonnet):
- ~$0.01-0.05 per image depending on complexity and response length (you should verify this with your actual token usage and current AWS pricing.)
- File Size:
- Limited by Lambda memory (1GB) and timeout (5 minutes per file)
- Claude/Bedrock has 5MB limit on images via API and base64 encoding (done in the worker lambda) adds about 33% to the file size, so keep images as small as possible
- Concurrency: Default 10 files processed simultaneously (configurable via
MaxConcurrencydeployment parameter) - File Types: Currently jpg and png images only
- Region: Must be deployed in region with Bedrock model access
- "Access Denied" errors: Check Bedrock model permissions in your region
- Files not processing: Verify directory structure and file types
- Status stuck on "in_progress": Check Step Functions execution in AWS Console
- "Job already exists": Delete output directory and retry
- Some files failed: Individual files can fail while batch succeeds - check error files
Bash commands shown, but this can alternatively be done via the AWS console.
User-defined metadata keys are created on each S3 output file:
x-amz-meta-processing-status:successorerrorx-amz-meta-input-tokens: Number of input tokens consumedx-amz-meta-output-tokens: Number of output tokens generatedx-amz-meta-total-tokens: Total tokens for cost calculation
# Find all error files in a batch - using jq
# (you might need to install jq if you don't have it)
export BUCKET="your-prefix-ai-file-processor-output"
export PREFIX="batch001"
aws s3api list-objects-v2 \
--bucket "$BUCKET" \
--prefix "$PREFIX" \
--output json | \
jq -r '.Contents[]? | select(.Key | endswith(".json")) | .Key' | \
while read -r key; do
processing_status=$(aws s3api head-object \
--bucket "$BUCKET" \
--key "$key" \
--query 'Metadata."processing-status"' \
--output text 2>/dev/null)
if [ "$processing_status" = "error" ]; then
echo "$key"
fi
done
# Download and examine error details
aws s3 cp s3://your-prefix-ai-file-processor-output/batch-001/failed-image.jpg.json ./
cat failed-image.jpg.jsonBash commands shown, but this can alternatively be done via the AWS console.
# Check CloudWatch logs
aws logs describe-log-groups --log-group-name-prefix "/aws/lambda/your-prefix"
# View Step Functions execution
aws stepfunctions list-executions --state-machine-arn <your-state-machine-arn>
# Check S3 bucket contents
aws s3 ls s3://your-prefix-ai-file-processor-output/ --recursive# Run unit tests
python -m pytest tests/ -v
# Test individual Lambda functions locally
sam build
# Test trigger function with mock S3 event
# Note: Successful testing requires deployed AWS resources
sam local invoke TriggerFunction -e tests/fixtures/s3_event.json
# Test worker function with mock input
# Note: Successful testing requires deployed AWS resources
sam local invoke WorkerFunction -e tests/fixtures/worker_event.json
# Note: Full integration testing requires deployed AWS resources
# (S3 buckets, Step Functions, Bedrock) which cannot run locallyLimitations of Local Testing:
- S3 buckets, Step Functions, and Bedrock services must be mocked or deployed
- Full workflow testing requires actual AWS deployment
- Local testing is mainly useful for unit tests and individual function validation
template.yaml: CloudFormation/SAM templatesamconfig.toml: Generic SAM configurationsamconfig.example.toml: Example environment-specific config
- All S3 buckets have public access blocked
- IAM roles follow least-privilege principles
- Lambda functions run with minimal required permissions
- No secrets or credentials stored in code
Contributions are welcome! Please feel free to submit a Pull Request.
This is a quick and dirty solution designed for rapid prototyping. For production use, consider:
- Enhanced error handling and retries
- Support for additional file types
- Batch cost optimization
- Advanced monitoring and alerting
- Input validation and sanitization
This project is licensed under the MIT License - see the LICENSE file for details.