This project implements a machine learning pipeline for the early detection of heart diseases using Apache Spark, MLlib library, and a variety of machine learning algorithms including RandomForest, Logistic Regression, and XGBoost. The pipeline is designed to process a CSV database, apply transformations, train models, evaluate their performance, and showcase results through data visualization.
-
Machine Learning Algorithms:
- RandomForest
- Logistic Regression
- XGBoost
-
Processing Chain:
- Transformers: Utilized 2 transformers in the processing chain.
- Estimator: Trained models using various algorithms.
- Pipeline: Structured a pipeline for seamless data processing.
- Evaluation: Determined key metrics to assess model performance.
-
Activity:
- Hunted down a suitable CSV database for the project. https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
- Illustrated the processing chain with a focus on transformers, estimators, pipelines, and evaluation.
- Programmed a notebook, rigorously tested, and ensured functionality.
- Showcased results through effective data visualization.
- Witness the power of data-driven insights!
- The pipeline demonstrated promising outcomes in the early detection of heart diseases.
- Achieved accuracy and efficiency through strategic algorithmic choices.
- Python (version 3.11.5)
- Apache Spark




