The objective of this project is to build a machine learning model to predict the presence of heart disease in a patient based on a set of medical attributes. This notebook covers the entire process, from initial data exploration to training and evaluating a Support Vector Classifier (SVC).
Source:
This project uses a consolidated version of the popular Heart Disease dataset.
Content:
- The dataset contains 1025 patient records and 14 key medical attributes.
- An initial analysis showed that the data is of high quality, with no missing values and all features already in numerical format.
Attribute Information:
| Feature | Description |
|---|---|
| age | Age of the patient in years |
| sex | Gender (1 = male, 0 = female) |
| cp | Chest pain type (0–3) |
| trestbps | Resting blood pressure (mm Hg) |
| chol | Serum cholesterol (mg/dl) |
| fbs | Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) |
| restecg | Resting electrocardiographic results (0,1,2) |
| thalach | Maximum heart rate achieved |
| exang | Exercise induced angina (1 = yes; 0 = no) |
| oldpeak | ST depression induced by exercise relative to rest |
| slope | Slope of peak exercise ST segment |
| ca | Number of major vessels (0–3) colored by fluoroscopy |
| thal | 0 = normal; 1 = fixed defect; 2 = reversible defect |
| target | Heart disease presence (0 = no, 1 = yes) |
- Loaded the dataset and performed a quick exploratory analysis using
.info()and.describe()to:- Confirm data types
- Check for missing values (there were none)
- Understand the statistical distribution of each feature
- Feature Scaling:
Since SVC is a distance-based algorithm and features are on different scales (e.g.,agevschol), appliedStandardScalerto standardize features (mean = 0, std = 1). This ensures all features contribute equally to the model. - Data Splitting:
Split the dataset into:- Training set: 80%
- Testing set: 20%
This allows the model to be trained and evaluated on unseen data.
- Algorithm: Support Vector Classifier (SVC) for binary classification.
- Training: Model trained on the scaled training data.
- Evaluation:
- Accuracy on test data: ~85%
- Generated a confusion matrix to analyze true positives, true negatives, false positives, and false negatives.
- This project demonstrates how SVC can be used for binary classification tasks on real-world medical datasets.
- Proper feature scaling and train-test splitting are crucial for good SVC performance.
- The dataset is clean and numerical, making it ideal for practicing SVC and hyperparameter tuning.