This guide provides step-by-step instructions for implementing and using the complete observability stack for the ML Pipeline. The stack includes Prometheus (metrics), Grafana (dashboards), Elasticsearch (logs), and Kibana (log analysis).
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Flask App │───▶│ Prometheus │───▶│ Grafana │
│ production_app │ │ (Metrics) │ │ (Dashboards) │
│ │ │ localhost:9090 │ │ localhost:3000 │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ Elasticsearch │───▶│ Kibana │
│ (Logs) │ │ (Log Analysis) │
│ localhost:9200 │ │ localhost:5601 │
└─────────────────┘ └─────────────────┘
cd "/home/abhes/MlOps PipeLine" && source venv/bin/activate && PYTHONPATH="/home/abhes/MlOps PipeLine/src" python production_app.pycd observability
docker compose up -d# Generate predictions to create metrics and logs
for i in {1..20}; do
curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" -d '{
"age": '$((25 + RANDOM % 40))',
"job": "management",
"marital": "single",
"education": "secondary",
"housing": "yes",
"loan": "no",
"duration": '$((100 + RANDOM % 300))',
"campaign": '$((1 + RANDOM % 5))'
}'
sleep 2
done- URL:
http://localhost:9090 - Purpose: Metrics collection and querying
# Model Performance Metrics
ml_model_accuracy # Current model accuracy (0.9215)
ml_model_precision # Model precision
ml_model_recall # Model recall
ml_model_f1_score # F1 score
# Prediction Analytics
ml_predictions_total # Total predictions made
ml_predictions_by_class_total # Predictions by class (0/1)
ml_prediction_confidence # Prediction confidence distribution
ml_prediction_error_rate # Current error rate
# System Performance
http_request_duration_seconds # Request latency
ml_feature_processing_seconds # Feature processing time
ml_model_load_seconds # Model loading time
ml_app_memory_bytes # Memory usage
# Business Metrics
ml_input_validation_failures_total # Input validation errors
ml_active_users # Active users count
Model Performance:
# Model accuracy
ml_model_accuracy
# Prediction rate per minute
rate(ml_predictions_total[1m]) * 60
# Error rate percentage
ml_prediction_error_rate * 100
System Health:
# Average request duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# 95th percentile latency
histogram_quantile(0.95, http_request_duration_seconds_bucket)
# Memory usage in MB
ml_app_memory_bytes / 1024 / 1024
Business Analytics:
# Prediction distribution
ml_predictions_by_class_total
# Average confidence
rate(ml_prediction_confidence_sum[5m]) / rate(ml_prediction_confidence_count[5m])
# Requests per second
rate(ml_predictions_total[1m])
# Check Prometheus health
curl http://localhost:9090/-/healthy
# Query specific metric
curl "http://localhost:9090/api/v1/query?query=ml_model_accuracy"
# Check targets status
curl http://localhost:9090/api/v1/targets
# View configuration
curl http://localhost:9090/api/v1/status/config- URL:
http://localhost:3000 - Login:
admin/admin(change on first login)
- Connections → Data Sources → Add data source
- Select Prometheus
- URL:
http://prometheus:9090 - Save & Test (should show green checkmark)
Panel Type: Stat
Query: ml_model_accuracy
Title: Model Accuracy
Unit: Percent (0-1)
Thresholds:
- Red: < 0.85
- Yellow: 0.85-0.90
- Green: > 0.90
Panel Type: Time series
Query: rate(ml_predictions_total[1m]) * 60
Title: Predictions per Minute
Unit: reqps
Y-axis: Min 0
Panel Type: Stat
Query: ml_prediction_error_rate * 100
Title: Error Rate
Unit: Percent
Thresholds:
- Green: < 5%
- Yellow: 5-10%
- Red: > 10%
Panel Type: Time series
Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Title: 95th Percentile Latency
Unit: Seconds
Panel Type: Pie chart
Query A: ml_predictions_by_class_total{prediction_class="0"}
Query B: ml_predictions_by_class_total{prediction_class="1"}
Title: Prediction Distribution
Legend: No Subscribe / Subscribe
Panel Type: Time series
Query: ml_app_memory_bytes / 1024 / 1024
Title: Memory Usage
Unit: MB
- Time Range: Last 1 hour
- Refresh: 30s
- Auto-refresh: Enabled
- Save Dashboard: "ML Pipeline Production"
- Panel → Alert → Create Alert Rule
- Model Accuracy Alert:
- Condition:
ml_model_accuracy < 0.90 - Evaluation: Every 1m for 5m
- Condition:
- High Error Rate Alert:
- Condition:
ml_prediction_error_rate > 0.10 - Evaluation: Every 1m for 2m
- Condition:
- URL:
http://localhost:5601 - Purpose: Log analysis and visualization
- Management → Kibana → Data Views
- Create data view
- Name:
ML Pipeline Logs - Index pattern:
ml-pipeline-* - Timestamp field:
@timestamp - Save data view
- Analytics → Discover
- Select ML Pipeline Logs data view
- Time range: Last 1 hour
@timestamp # Log timestamp
level # Log level (INFO, ERROR, WARNING)
service # Service name (ml-pipeline)
message # Log message
request_id # Unique request identifier
prediction # Prediction result (0/1)
confidence # Prediction confidence
duration # Request duration
error # Error details (if any)
model_type # Model type used
endpoint # API endpoint called
Type: Vertical bar chart
X-axis: Date Histogram (@timestamp)
Y-axis: Count
Split series: Terms (level)
Type: Data table
Columns: @timestamp, level, message, error, request_id
Filter: level:ERROR
Sort: @timestamp desc
Type: Line chart
X-axis: Date Histogram (@timestamp)
Y-axis: Average (duration)
Type: Metric
Aggregation: Count
Filter: NOT error:*
- Analytics → Dashboard → Create dashboard
- Add all visualizations
- Save as "ML Pipeline Logs Dashboard"
- Stack Management → Rules and Connectors → Rules
- Create rule → Elasticsearch query
- Error Rate Alert:
- Index:
ml-pipeline-* - Query:
level:ERROR - Threshold: > 5 errors in 5 minutes
- Index:
# Check cluster health
curl http://localhost:9200/_cluster/health
# List all indices
curl http://localhost:9200/_cat/indices?v
# Search recent logs
curl -X GET "localhost:9200/ml-pipeline-*/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"@timestamp": {
"gte": "now-1h"
}
}
},
"sort": [{"@timestamp": {"order": "desc"}}],
"size": 10
}'
# Search for errors
curl -X GET "localhost:9200/ml-pipeline-*/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"level": "ERROR"
}
}
}'
# Get prediction statistics
curl -X GET "localhost:9200/ml-pipeline-*/_search?pretty" -H 'Content-Type: application/json' -d'
{
"aggs": {
"prediction_distribution": {
"terms": {
"field": "prediction"
}
},
"avg_confidence": {
"avg": {
"field": "confidence"
}
}
},
"size": 0
}'# Application health
curl http://localhost:5000/health
# Detailed status
curl http://localhost:5000/status
# Prometheus metrics
curl http://localhost:5000/metrics# Prometheus
curl http://localhost:9090/-/healthy
# Grafana
curl http://localhost:3000/api/health
# Elasticsearch
curl http://localhost:9200/_cluster/health
# Kibana
curl http://localhost:5601/api/statusThe system includes pre-configured alerts for:
- Model accuracy degradation (< 90%)
- High error rates (> 10%)
- High request latency (> 1s)
- Service unavailability
- Memory usage spikes
# Check if Flask app is running
curl http://localhost:5000/health
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Restart Prometheus
docker compose restart prometheus# Check Elasticsearch indices
curl http://localhost:9200/_cat/indices?v
# Generate test logs
curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" -d '{"age": 30, "job": "admin", "marital": "single", "education": "secondary", "housing": "yes", "loan": "no", "duration": 200, "campaign": 2}'
# Check if logs are being created
curl http://localhost:9200/ml-pipeline-*/_search?pretty- Check data source URL:
http://prometheus:9090 - Verify Prometheus is running:
docker compose ps - Test connection: Save & Test in data source
# Check memory metrics
curl -s http://localhost:5000/metrics | grep ml_app_memory_bytes
# Monitor system resources
docker compose stats- Model Accuracy: > 90%
- Request Latency: < 500ms (95th percentile)
- Error Rate: < 5%
- Throughput: > 10 requests/minute
- Availability: > 99.9%
- All services running (Flask, Prometheus, Grafana, Elasticsearch, Kibana)
- Metrics being collected (check
/metricsendpoint) - Logs being generated (check Kibana)
- Dashboards updating (check Grafana)
- Alerts configured and working
- Health checks passing
# Check system health
curl http://localhost:5000/health
# View recent errors
curl -X GET "localhost:9200/ml-pipeline-*/_search?q=level:ERROR&sort=@timestamp:desc&size=5"
# Monitor prediction volume
curl -s http://localhost:9090/api/v1/query?query=rate\(ml_predictions_total\[1h\]\) | jq '.data.result[0].value[1]'- Review error logs and patterns
- Check model performance trends
- Update alert thresholds if needed
- Clean up old log indices
- Review dashboard effectiveness
# Backup Grafana dashboards
curl -H "Authorization: Bearer <api-key>" http://localhost:3000/api/search?type=dash-db
# Backup Elasticsearch indices
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d)"
# Export Prometheus data
curl http://localhost:9090/api/v1/admin/tsdb/snapshot -XPOST- Set appropriate alert thresholds based on historical data
- Use structured logging for better searchability
- Monitor business metrics alongside technical metrics
- Implement gradual alerting (warning → critical)
- Regular dashboard reviews and updates
- Optimize query performance in Prometheus and Elasticsearch
- Set appropriate retention policies for logs and metrics
- Use sampling for high-volume tracing
- Monitor resource usage of observability stack itself
- Secure access to monitoring dashboards
- Sanitize sensitive data in logs
- Use HTTPS for production deployments
- Regular security updates for all components
This comprehensive observability stack provides complete visibility into your ML pipeline's performance, health, and business metrics. Use this guide to implement, maintain, and optimize your monitoring infrastructure.
- Abeshith - Project Creator & Lead Developer