This implementation delivers only the ingestion system and Kafka foundation using the exact workflow:
External Sources -> API Collectors / Scrapers -> Raw Data Acquisition -> JSON Normalization -> Message Serialization -> Kafka Streaming Topic
- External source connectors across domains:
collector-modules/economy.py(World Bank)collector-modules/geopolitics.py(GDELT)collector-modules/climate.py(Open-Meteo)collector-modules/technology.py(HN Algolia)collector-modules/society_scraper.py(UN RSS)
- Raw data acquisition envelope with ingestion metadata (
ingest_id,pipeline_version,collector_module,ingested_at) - Canonical JSON normalization in
normalization/normalise.py - Serialization stage in
serialization.py - Kafka streaming client in
kafka_stream.py - Orchestrated end-to-end pipeline in
ingestion_pipeline.py - Kafka container stack in
docker-compose.yml
- Raw replay/debug topic:
ontology.intelligence.raw.v1 - Normalized ingestion topic:
ontology.intelligence.ingestion.v1
- Start Kafka:
docker compose up -d kafka kafdrop- Install Python dependencies:
pip install -r requirements.txt- Run one ingestion cycle:
python ingestion_pipeline.pyKAFKA_BOOTSTRAP_SERVERS(default:localhost:9092)KAFKA_RAW_TOPIC(default:ontology.intelligence.raw.v1)KAFKA_TOPIC(default:ontology.intelligence.ingestion.v1)
- Collectors are modular: any file in
collector-modules/exposingcollect()is auto-discovered. - The pipeline currently runs in batch mode (
run_once). You can schedule it for near-real-time operation (cron, Airflow, or streaming service) in next iterations.