When I first grabbed this dataset, it was a mess straight away when i opened it in excel. Between the encoding errors, the dates in the wrong format, and the durations that mixed "minutes" and "seasons" in the same column, I couldn't run a single clean query .
This project shows how I took that raw CSV and turned it into a clean, usable SQL database.
- The "Hidden" Delimiter: The file looked like a standard CSV, but it actually used semicolons (
;). I had to open first the file on notebook to see that semicolons where used to separate comlumns. The standard MySQL Import Wizard couldn't handle it properly, so I had to write a customLOAD DATA LOCAL INFILEscript to get the data in correctly. - Fixing Corrupted Text: I noticed titles and names were full of weird characters like
éor’. I spent time identifying these encoding artifacts and replacing them with the correct accents and apostrophes so the data actually looks professional. - The Duration Mess: You can't calculate the average length of a movie if the column also contains "3 Seasons". I split this into two separate numerical columns: one for minutes and one for seasons. Much better for analysis.
- Cleaning up Duplicates: I used
ROW_NUMBER()andPARTITION BYto find and remove duplicate entries based on title, type, and year.
- Cleanup: Stripped out useless spaces and turned empty strings into proper
NULLvalues. - Date Fixing: Converted those messy text dates into standard SQL
YYYY-MM-DDformats. - Data Integrity: Used
REGEXPto make sure IDs and years actually followed a logic. - Performance: I didn't leave everything as
TEXT. I converted columns toINT,VARCHAR, andDATEto make the database faster and lighter.
- Grab the
disney_plus_titles.csvand the.sqlscript. - Quick Note: You'll need to change the file path in the
LOAD DATAcommand to match where you saved the CSV on your computer. - Make sure
local_infileis turned on in your MySQL settings. - Also, to have a general view of the data, u can open it on excel first and clic on one colonne and go to data tab then clic on convert text to column, choose delimited by comas and clic on finish.
For the full strategic analysis (Audience insights, Top actors,...), please refer to the Disney_plus_EDA_analysis.sql file in this repository.
MySQL ; Data Cleaning ; ETL ; Problem Solving