This project is a cloud-native, distributed system designed to automate the ingestion and analysis of high-volume image and document uploads. The primary goal was to create a pipeline that is both resilient to failure and highly performant.
Core Architecture: The system uses AWS S3 as the ingest trigger, which places asynchronous jobs onto an AWS SQS queue. A Dockerized Java application acts as the consumer service, processing messages concurrently. This service leverages AWS AI services (Rekognition and Textract) to perform complex data extraction and classification before storing the final, enriched data.
Key Technical Highlights:
Concurrency Model: Utilized a message-driven consumer architecture in Java to manage parallelism and ensure that no processing job is lost due to transient errors.
Performance: The asynchronous queuing model successfully decoupled ingestion from processing, leading to a 20% improvement in end-to-end latency for document analysis.
DevOps: The entire processor service is containerized using Docker for consistent deployment across different AWS environments.