An end-to-end MLOps project for predicting colorectal cancer patient survival using machine learning, with comprehensive experiment tracking, pipeline orchestration, and deployment automation.
This project implements a complete machine learning pipeline for colorectal cancer survival prediction, showcasing modern MLOps practices including experiment tracking with MLflow, containerization with Docker, and orchestration using Kubeflow on a local Kubernetes cluster.
- MLflow Integration with DagsHub: Complete experiment tracking and model versioning
- Local Kubernetes Cluster: Production-grade infrastructure on your local machine
- Kubeflow Pipelines: Automated ML workflow orchestration
- Docker Containerization: Consistent deployment across environments
- Data Versioning: Track datasets and code changes throughout the project lifecycle
- End-to-End Pipeline: From data processing to model deployment
Project Setup → Jupyter Notebook Testing → Data Processing
↓
User App ← Dockerization ← Pipeline Making ← Model Training
↓ ↓ ↓
Deployment ← Kubeflow Setup ← Data/Code Versioning ← Experiment Tracking
- ML Framework: Scikit-learn, Pandas, NumPy
- Experiment Tracking: MLflow + DagsHub
- Orchestration: Kubeflow Pipelines
- Containerization: Docker, DockerHub
- Infrastructure: Kubernetes (local cluster)
- Version Control: Git, DVC (Data Version Control)
This project uses colorectal cancer patient data to predict survival outcomes. The dataset includes various clinical and pathological features that influence patient prognosis.
- Python 3.8+
- Docker Desktop
- Kubernetes (Minikube or Docker Desktop with K8s enabled)
- kubectl CLI
- Git
- Clone the repository
git clone https://github.com/imchandanmohan/Colorectal-Cancer-Survival-Prediction.git
cd Colorectal-Cancer-Survival-Prediction- Install dependencies
pip install -r requirements.txt- Configure MLflow with DagsHub
# Set environment variables
export MLFLOW_TRACKING_URI=your_dagshub_uri
export MLFLOW_TRACKING_USERNAME=your_username
export MLFLOW_TRACKING_PASSWORD=your_password- Data Processing: Process and prepare the dataset
- Model Training: Train models with experiment tracking
- Kubeflow Pipeline: Set up and run automated pipelines
- Deployment: Deploy the model using Docker and Kubernetes
├── data/ # Dataset files
├── notebooks/ # Jupyter notebooks for experimentation
├── src/ # Source code
│ ├── data_processing/ # Data preparation scripts
│ ├── model_training/ # Training scripts
│ └── pipeline/ # Kubeflow pipeline definitions
├── docker/ # Dockerfiles
├── kubernetes/ # K8s manifests
├── mlruns/ # MLflow tracking data
└── README.md
All experiments are tracked using MLflow and stored on DagsHub, providing:
- Model parameters and metrics
- Artifact storage
- Model versioning
- Experiment comparison
The project includes:
- Dockerfiles for containerizing the application
- Kubernetes manifests for deployment
- Kubeflow pipeline definitions for automated workflows
- Clinical Decision Support: Assist healthcare providers in treatment planning
- Risk Stratification: Identify high-risk patients for targeted interventions
- Research Applications: Enable survival analysis for clinical research
- Healthcare Analytics: Support population health management initiatives
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Chandan Mohan
- GitHub: @imchandanmohan
- DagsHub for MLflow hosting
- Kubeflow community for orchestration tools
- Healthcare datasets contributors
⭐ If you find this project helpful, please consider giving it a star!