This project aims to predict user churn for Spotify, a music streaming service, using machine learning techniques. The code includes data preprocessing, exploratory data analysis, feature engineering, model training, evaluation, and hyperparameter tuning.
Dataset link https://drive.google.com/file/d/1rOu5MD-EIO5b1KyHV6iRF-o6Ai5th6Mg/view?usp=drive_link
The dataset used in this project is a JSON file containing user interaction data with Spotify. It includes information about user actions, demographics, and historical data.
The following Python libraries are required to run the code:
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- joblib
The code is structured as follows:
-
Data Loading and Cleaning: The code starts by loading the dataset from a JSON file and performing initial data cleaning operations, such as handling missing values and removing duplicates.
-
Exploratory Data Analysis: Various analyses are conducted to gain insights into user behavior, including daily listening time, number of listening sessions per week, average session duration, unique artists listened to, and playlist diversity.
-
Feature Engineering: The code creates several features that could be predictive of user churn, such as user engagement metrics, session statistics, user agent information, page event counts, location data, and more.
-
Correlation Analysis: Correlation analysis is performed to identify highly correlated features, which are then removed to avoid multicollinearity.
-
Feature Transformation: Certain features are transformed (e.g., log transformation, square root transformation) to improve model performance.
-
Data Preprocessing: The dataset is split into training and testing sets, and feature scaling is applied.
-
Model Training and Evaluation: The code implements several machine learning models, including Logistic Regression and Random Forest, for churn prediction. Grid search is used for hyperparameter tuning, and model performance is evaluated using metrics such as F1-score and accuracy.
-
Feature Importance: The code analyzes and visualizes the importance of features for each trained model, helping to identify the most influential factors for churn prediction.
- Clone the repository or download the code files.
- Make sure you have the required Python libraries installed.
- Place the dataset file (
spotify_dataset.json) in the appropriate location (C:\Users\Mdmuz\OneDrive\Documents\in this case). - Run the code in your Python environment or IDE.
- The code will preprocess the data, train and evaluate the models, and display the results, including evaluation metrics and feature importance plots.
Note: The code includes file paths and URLs specific to the dataset and resources used in this project. You may need to modify these paths according to your local environment.