This project focuses on predicting electricity consumption (Usage_kWh) in the steel industry using historical operational data. We explore both linear regression and Random Forest regression to handle skewed data and capture nonlinear relationships in the dataset.
- Source:
steel_industry_data.csv - Key columns:
Usage_kWh: Electricity usage (target)Lagging_Current_Reactive.Power_kVarhLeading_Current_Reactive_Power_kVarhLagging_Current_Power_FactorLeading_Current_Power_FactorNSM: Number of shifts or operational measureCO2(tCO2): CO2 emissions
- Exploratory Data Analysis (EDA):
- Checked distributions and skewness.
- Visualized histograms and boxplots for key variables.
- Data Preprocessing:
- Log transformation applied to reduce right-skew in
Usage_kWh.
- Log transformation applied to reduce right-skew in
- Modeling:
- Linear Regression with and without log-transformed target.
- Random Forest Regression with hyperparameter tuning (
n_estimators,n_jobs=-1for parallelization).
- Evaluation:
- Metrics: RMSE, MSE, R².
- Residual analysis and distribution comparison.
- Visualization:
- Predicted vs actual plots, residual plots, log-transform effects.
- Linear Regression (log-transformed target): Captured central tendency but underestimates high usage values.
- Random Forest Regression:
- Captures nonlinearity and extreme values better.
- RMSE significantly lower and distribution of predictions closely matches true data.
- Handles skewed data without transformation.
- Python, pandas, numpy, matplotlib, scikit-learn.
- Handling skewed distributions and feature analysis.
- Model comparison and evaluation metrics.
- Residual analysis and visualization.
- Feature importance interpretation using Random Forest.