Unlocking insights from real-world used car listings with end-to-end data science.
π Project Highlights
- Full pipeline: Data cleaning β EDA β Feature engineering β Statistical testing β Predictive modeling β Business insights
- Goal: Predict car prices and uncover what drives value in the used car market
- Tech: Python, Pandas, scikit-learn, XGBoost, imblearn, SHAP, Matplotlib, Seaborn
In this project, you'll see:
- Clean, structured code and interactive outputs
- Clear explanations with a business focus
- Ready-to-use analytics for dashboards or product teams
- ποΈ Step 1: Dataset & Table Structure
- π§Ή Step 2: Data Cleaning Pipeline
- π Step 3: Exploratory Data Analysis (EDA)
- ποΈ Step 4: Feature Engineering
- π Step 5: Statistical Analysis & Hypothesis Testing
- π Step 6: Correlation & Multicollinearity Check
- π€ Step 7: Predictive Modeling: Car Price Estimation
- π§ͺ Step 8: Model Diagnostics & Validation
- π Step 9: Actual vs. Predicted Price (Test Results)
- π Step 10: Business Insights & Recommendations
We'll start by loading the raw dataset and inspecting the first few rows to understand its basic structure.
import pandas as pd
df = pd.read_excel('used_cars.xlsx')
df.head()| brand | model | model_year | milage | fuel_type | engine | transmission | Color | accident | price |
|---|---|---|---|---|---|---|---|---|---|
| Ford | Utility Police Interceptor Base | 2013 | 51,000 mi. | E85 Flex Fuel | 300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capab | 6-Speed A/T | Black | At least 1 accident or damage reported | 10300 |
| Hyundai | Palisade SEL | 2021 | 34,742 mi. | Gasoline | 3.8L V6 24V GDI DOHC | 8-Speed Automatic | Moonlight Cloud | At least 1 accident or damage reported | 38005 |
| Lexus | RX 350 RX 350 | 2022 | 22,372 mi. | Gasoline | 3.5 Liter DOHC | Automatic | Blue | None reported | 54598 |
| INFINITI | Q50 Hybrid Sport | 2015 | 88,900 mi. | Hybrid | 354.0HP 3.5L V6 Cylinder Engine Gas/Electric H. | 7-Speed A/T | Black | None reported | 15500 |
| Audi | Q3 45 S line Premium Plus | 2021 | 9,835 mi. | Gasoline | 2.0L I4 16V GDI DOHC Turbo | 8-Speed Automatic | Glacier White Metallic | None reported | 34999 |
Get a quick overview of the dataset's dimensions and column names.
print(f"Dataset shape: {df.shape}")
print("Columns:", df.columns.tolist())Check data types, non-null counts, and missing values for each column.
Key findings:
- The dataset has 4,009 rows and 10 columns.
- Most columns are text/object, except for
model_yearandprice(both integers). - Missing values:
fuel_type(170 missing),accident(113 missing). - No missing values in critical fields like
price,brand,model, ormilage.
Transform raw data into a structured, analysis-ready format.
Key tasks: Standardize all critical columns, enforce allowed categories, extract numerical fields, and drop incomplete or irrelevant rows.
Rename columns for consistency and clarity (lowercase, underscores, etc.).
# Strip, lowercase, and replace spaces with underscores in column names
df.columns = (
df.columns
.str.strip()
.str.lower()
.str.replace(" ", "_")
)
df.columns.tolist()- Convert milage to a numeric value (remove 'mi.', commas).
- Extract engine size as a float (in liters), store as
engine_size.
def extract_engine_size(s):
s = str(s).lower()
# Try to match numbers followed by L or liter (with or without a space)
match = re.search(r'(\d+\.\d+|\d+)\s*(l|liter)', s)
if match:
return float(match.group(1))
# Sometimes only numbers (rare): e.g. '2 V6'
match = re.search(r'(\d+\.\d+|\d+)', s)
return float(match.group(1)) if match else None
df['engine_size'] = df['engine'].apply(extract_engine_size)| milage | engine | engine_size |
|---|---|---|
| 51000.0 | 300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa... | 3.7 |
| 34742.0 | 3.8L V6 24V GDI DOHC | 3.8 |
| 22372.0 | 3.5 Liter DOHC | 3.5 |
| 88900.0 | 354.0HP 3.5L V6 Cylinder Engine Gas/Electric H... | 3.5 |
| 9835.0 | 2.0L I4 16V GDI DOHC Turbo | 2.0 |
Set transmission as only "Automatic" or "Manual".
Map "A/T" and "auto" to "Automatic"; "M/T" and "man" to "Manual". Remove all other values.
def clean_transmission(val):
val = str(val).lower()
if "a/t" in val or "auto" in val:
return "Automatic"
if "m/t" in val or "man" in val:
return "Manual"
return None
df['transmission'] = df['transmission'].apply(clean_transmission)
df = df[df['transmission'].isin(['Automatic', 'Manual'])]Only allow popular car colors:
["black", "white", "grey", "gray", "silver", "red", "blue", "brown", "green", "beige", "yellow", "gold", "orange"]
Remove rows with any other color.
popular_colors = [
"black", "white", "grey", "gray", "silver", "red", "blue",
"brown", "green", "beige", "yellow", "gold", "orange"
]
df['color'] = df['color'].str.lower().str.strip()
df = df[df['color'].isin(popular_colors)]Only allow "Diesel", "Gasoline", or "Hybrid" (case insensitive).
Remove rows with any other or missing fuel type.
allowed_fuel = ['diesel', 'gasoline', 'hybrid']
df['fuel_type'] = df['fuel_type'].str.lower().str.strip()
df = df[df['fuel_type'].isin(allowed_fuel)]
df['fuel_type'] = df['fuel_type'].str.capitalize() # e.g., Diesel, Gasoline, HybridSet accident to 1 if "At least 1 accident" or contains "accident", 0 if "No accident or damage", else drop.
def accident_binary(val):
val = str(val).lower()
if "no accident" in val:
return 0
if "accident" in val:
return 1
return None
df['accident'] = df['accident'].apply(accident_binary)
df = df[df['accident'].isin([0,1])]Drop any row with missing values in any column used for modeling.
df = df.dropna(subset=[
'brand', 'model', 'model_year', 'milage', 'fuel_type',
'engine_size', 'transmission', 'color', 'accident', 'price'
])
df = df.reset_index(drop=True)
print("Dataset shape after cleaning:", df.shape)Dataset shape after cleaning: (719, 11)
df.to_excel('used_cars_cleaned.xlsx', index=False)Objective:
Understand the main patterns and distributions in the cleaned used car dataset.
Visualize key variables and spot potential outliers or trends to guide feature engineering and modeling.
# Summary stats for numeric columns
df.describe().T| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| model_year | 719.0 | 2012.83 | 5.71 | 1996.0 | 2008.0 | 2014.0 | 2017.0 | 2023.0 |
| milage | 719.0 | 97728.93 | 56421.47 | 500.0 | 54950.0 | 88977.0 | 134000.0 | 399000.0 |
| accident | 719.0 | 1.00 | 0.00 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| price | 719.0 | 24832.70 | 19183.27 | 2300.0 | 11000.0 | 19000.0 | 32225.0 | 127899.0 |
| engine_size | 719.0 | 3.74 | 1.43 | 1.3 | 2.5 | 3.5 | 4.7 | 8.3 |
Letβs visualize the distribution of used car prices to spot skew, outliers, and median value.
import matplotlib.pyplot as plt
plt.figure(figsize=(8,4))
plt.hist(df['price'], bins=50, edgecolor='k', alpha=0.7)
plt.title("Distribution of Used Car Prices")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.grid(axis='y')
plt.show()See which brands are most common in the dataset.
brand_counts = df['brand'].value_counts().head(15)
plt.figure(figsize=(10,4))
brand_counts.plot(kind='bar', color='skyblue', edgecolor='k')
plt.title("Top 15 Brands in Used Car Listings")
plt.xlabel("Brand")
plt.ylabel("Number of Listings")
plt.xticks(rotation=45)
plt.show()Boxplots to show price variation across transmission types and fuel types.
import seaborn as sns
plt.figure(figsize=(8,5))
sns.boxplot(data=df, x='transmission', y='price')
plt.title("Price Distribution by Transmission Type")
plt.xlabel("Transmission")
plt.ylabel("Price")
plt.show()
plt.figure(figsize=(8,5))
sns.boxplot(data=df, x='fuel_type', y='price')
plt.title("Price Distribution by Fuel Type")
plt.xlabel("Fuel Type")
plt.ylabel("Price")
plt.show()Visualize the relationship between price and milage to check for negative correlation or outliers.
plt.figure(figsize=(8,5))
plt.scatter(df['milage'], df['price'], alpha=0.4)
plt.title("Price vs. Milage")
plt.xlabel("Milage")
plt.ylabel("Price")
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()- Car prices are right-skewed, mostly under $25,000.
- Mercedes, Ford, and BMW are the most listed brands.
- Automatics are more common and show higher price outliers.
- Higher milage usually means lower price, but some exceptions exist.
These patterns set the stage for effective feature engineering and predictive modeling.
Objective:
Create new variables and transform features to improve the performance of predictive models.
Helps reduce skew and makes linear models more effective.
import numpy as np
df['log_price'] = np.log1p(df['price'])
df['log_milage'] = np.log1p(df['milage'])Convert brand, fuel type, transmission, and color to one-hot encoded columns for modeling.
categorical_cols = ['brand', 'fuel_type', 'transmission', 'color']
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)Add squared milage or interaction terms if needed for model flexibility.
df_encoded['milage_sq'] = df['milage'] ** 2Now the dataset includes log-transformed targets and encoded features, ready for correlation analysis and machine learning.
Objective:
Check key statistical relationships, spot potential predictors, and validate assumptions for modeling.
See how numeric features (price, milage, engine size, year) relate to each other.
import seaborn as sns
import matplotlib.pyplot as plt
numeric_cols = ['price', 'milage', 'engine_size', 'model_year', 'accident']
corr = df[numeric_cols].corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()Test if average price differs significantly between Automatic and Manual cars.
from scipy.stats import ttest_ind
auto_prices = df[df['transmission'] == 'Automatic']['price']
man_prices = df[df['transmission'] == 'Manual']['price']
t_stat, p_val = ttest_ind(auto_prices, man_prices, equal_var=False)
print(f"T-statistic: {t_stat:.2f}, p-value: {p_val:.4f}")T-statistic: 2.37, p-value: 0.0196
-
Correlation:
Price is strongly negatively correlated with milage (β0.57), and positively correlated with model year (0.57) and engine size (0.32). This means newer cars with lower milage and larger engines tend to have higher prices. -
Transmission Effect:
The t-test (T=2.37, p=0.0196) shows a statistically significant difference in average price between automatic and manual carsβautomatics tend to sell for more.
These results confirm that milage, model year, engine size, and transmission type are all important features for car price prediction.
Objective:
Ensure predictor variables are not too highly correlated, which can hurt model performance and interpretation.
Focus on input variables only (exclude price).
feature_cols = ['milage', 'engine_size', 'model_year', 'accident']
corr_features = df[feature_cols].corr()
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(5,4))
sns.heatmap(corr_features, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Matrix")
plt.show()High VIF (>5 or 10) means multicollinearityβconsider removing or combining those features.
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Prepare input data for VIF (drop NA)
X_vif = df[feature_cols].dropna()
vif_data = pd.DataFrame()
vif_data["feature"] = X_vif.columns
vif_data["VIF"] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]
print(vif_data) feature VIF
0 milage 1.566861
1 engine_size 1.008925
2 model_year 1.569887
3 accident 196610.468949
- Most features show low correlation with each other (all |corr| < 0.6), which is ideal for modeling.
- All features except βaccidentβ have low VIF values (<2), confirming minimal multicollinearity.
- The βaccidentβ variable has an extremely high VIF, suggesting it may be redundant or constant in most rows and should not be used in modeling
Overall, milage, engine size, and model year are safe to use for predictive modeling.
model_features = [col for col in df_encoded.columns if col not in ['price', 'log_price', 'accident']]Objective:
Build, tune, and evaluate regression models to predict used car prices from the cleaned dataset.
Compare several algorithms and report performance.
Only numeric columns (including one-hot encoded dummies) are used as predictors. All object/string columns (like 'model', 'engine') are excluded. Split the data into training and test sets (e.g., 80/20 split).
# Select numeric columns and drop target/irrelevant columns
X = df_encoded.select_dtypes(include=[np.number]).drop(columns=['price', 'log_price', 'accident'])
y = df_encoded['log_price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Train size:", X_train.shape, "Test size:", X_test.shape)Train size: (575, 5) Test size: (144, 5)
Fit a simple linear regression and evaluate as a baseline.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
print(f"Linear Regression β RMSE: {mse_lr**0.5:.2f}, R2: {r2_lr:.3f}")Linear Regression β RMSE: 0.43, R2: 0.643
Fit a Random Forest model for potentially better non-linear performance.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print(f"Random Forest β RMSE: {mse_rf**0.5:.2f}, R2: {r2_rf:.3f}")Random Forest β RMSE: 0.44, R2: 0.629
Try XGBoost for high-performance gradient boosting
from xgboost import XGBRegressor
xgb = XGBRegressor(n_estimators=100, random_state=42, verbosity=0)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost β RMSE: {mse_xgb**0.5:.2f}, R2: {r2_xgb:.3f}")XGBoost β RMSE: 0.47, R2: 0.587
Summarize and compare all models side by side.
results = pd.DataFrame({
'Model': ['Linear Regression', 'Random Forest', 'XGBoost'],
'RMSE': [mse_lr**0.5, mse_rf**0.5, mse_xgb**0.5],
'R2': [r2_lr, r2_rf, r2_xgb]
})
print(results) Model RMSE R2
0 Linear Regression 0.433046 0.643493
1 Random Forest 0.441853 0.628845
2 XGBoost 0.466035 0.587108
- Linear Regression delivered the best overall performance, with an RMSE of 0.43 and RΒ² of 0.64.
- Random Forest and XGBoost performed slightly worse, with lower RΒ² and higher RMSE.
- All models explain a substantial portion of price variance, but additional feature engineering or tuning may further improve results.
Linear Regression is selected as the top model for interpretation and business recommendations in this analysis.
Objective:
Ensure the selected model (Linear Regression) is robust, unbiased, and provides reliable predictions.
Plot residuals to check if errors are randomly distributed (no pattern = good).
import matplotlib.pyplot as plt
residuals = y_test - y_pred_lr
plt.figure(figsize=(7,4))
plt.scatter(y_pred_lr, residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted Log Price")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted (Linear Regression)")
plt.show()Check if residuals are approximately normal (as expected for good linear models).
import scipy.stats as stats
import numpy as np
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot of Residuals")
plt.show()Show which variables have the biggest impact in the linear model.
| Feature | Coefficient |
|---|---|
| log_milage | 0.306365 |
| engine_size | 0.020033 |
| model_year | 0.056362 |
| milage_sq | 0.000000034 |
| milage | -0.0000177 |
| -------------- | ---------------: |
| milage | -0.0000177 |
| milage_sq | 0.000000034 |
| model_year | 0.056362 |
| engine_size | 0.020033 |
| log_milage | 0.306365 |
-
Residual Analysis:
The residuals vs. predicted plot shows a random scatter around zero, suggesting the modelβs errors are evenly distributed and not biased. -
Normality Check:
The Q-Q plot indicates the residuals are approximately normally distributed, supporting the assumptions of linear regression. -
Feature Impact:
The most influential features arelog_milage,model_year, andengine_size. Higher log milage and newer model years tend to increase predicted price, while raw milage has a very small negative_
Objective: Quantify each featureβs contribution to the prediction and visualise global/individual effects for our best model (Linear Regression).
π‘ Note: The target is log_price. SHAP values here are on the log-price scale. Use np.expm1(...) if you need to convert predictions back to the original price scale for narrative.
# --- 8.4 SHAP Explainability β Linear Regression ---
%pip install shap --user
import shap
# Build explainer on training data and compute SHAP values for test data
explainer = shap.Explainer(lr, X_train)
shap_values = explainer(X_test)
# 1) Global importance: which features matter most overall
shap.plots.bar(shap_values, max_display=15)
# 2) Global summary: direction and spread of effects across samples
shap.plots.beeswarm(shap_values, max_display=15)
# 3) Local explanation: one specific car (change idx to inspect others)
idx = 0
shap.plots.waterfall(shap_values[idx])SHAP analysis highlights that mileage is the dominant driver of used car prices, with higher mileage sharply reducing value. Model year and engine size also contribute, with newer cars and larger engines generally increasing price. The summary plot confirms that low mileage and recent models push prices up, while the force plot illustrates how for a 2018 vehicle, the positive effect of its recency is offset by the negative impact of high mileage.
Objective:
This section visualizes and summarizes how closely the model's predictions match real test data.
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6))
plt.scatter(np.expm1(y_test), np.expm1(y_pred_lr), alpha=0.5)
plt.plot([np.expm1(y_test).min(), np.expm1(y_test).max()],
[np.expm1(y_test).min(), np.expm1(y_test).max()], 'r--')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs. Predicted Price (Linear Regression)")
plt.show()This plot shows the accuracy of the model on new data: points close to the red line represent accurate predictions, while larger deviations highlight residual error.
Key Insights:
- Milage, model year, and engine size are the main drivers of used car prices.
- Lower milage and newer cars command higher prices, as expected.
- Linear Regression was the best-performing model, offering both accuracy and interpretability.
Recommendations:
-
Pricing strategy:
- Focus acquisition and marketing on low-milage, newer vehicles to maximize resale value.
- Use the modelβs predicted prices as a benchmark for negotiating purchases and setting sales prices.
-
Further improvements:
- Collect more data on condition, trim, or optional features for even better price prediction.
- Consider more advanced modeling (e.g., feature selection, hyperparameter tuning) for incremental gains.
-
Model deployment:
- This workflow can be easily integrated into pricing tools, sales dashboards, or automated business processes.
These actionable insights help drive smarter buying, selling, and inventory management decisions in the used car market.












