🎧 Customer Churn Prediction - Spotify Music Streaming Platform This project applies machine learning to predict customer churn using user interaction data from a music streaming service similar to Spotify. It identifies high-risk users and provides insights to inform retention strategies using models like Random Forest, Logistic Regression, and Naive Bayes.
📊 Objective To develop a machine learning pipeline that:
Accurately predicts user churn
Analyzes key churn factors such as user engagement, subscription level, and location
Supports strategic customer retention efforts
🧠 Key Features Comprehensive Data Cleaning: Removes noisy data, handles missing values, and processes user session and demographic data.
Feature Engineering: Captures behavior like:
Time since registration
Song listening patterns
Subscription level (Free/Paid)
Interaction with platform features (e.g., Thumbs Up/Down, Add to Playlist)
Geographic region
Model Training and Evaluation:
Trained and tested using an 80/20 split
Uses F1 Score as the primary metric due to class imbalance
Baseline model (Naive), Logistic Regression, and Random Forest tested
🗂 Dataset Source: Internal JSON file representing Spotify user logs
Size: 286,500 records, 18 features
Fields:
userId, sessionId, page, gender, location, level, userAgent, etc.
Events like NextSong, Thumbs Up, Add to Playlist, Logout, etc.
⚙️ Methodology Data Preprocessing
Removed incomplete entries (missing user/session IDs)
Parsed and standardized timestamps and session logs
Feature Engineering
Aggregated features per user
Binary churn indicator from “Cancellation Confirmation” event
Ratios and frequency distributions of user behavior
Modeling
Used RandomForestClassifier as the best-performing model
Hyperparameter tuning using GridSearchCV
Achieved:
F1 Score (Test): 0.717
Accuracy (Test): 0.756
Important Features Identified
Time since registration
Thumbs Down ratio
Ad interaction rate
User geography
Homepage interaction
🔍 Results
Model F1 Score (Test) Accuracy (Test) Naive Bayes 0.668 0.769 Logistic Regression 0.621 0.733 Random Forest 0.717 0.756 Random Forest was found to be most robust and generalizable.
🚀 Future Work Real-time churn prediction deployment
Integrating social media or review data for richer insights
A/B testing for retention strategy validation
Exploring ensemble and deep learning methods