Skip to content

komiayi/regression-trees-ensemble-methods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regression Trees and Ensemble Methods — MSE and Stability

A comparative study of three tree-based learning methods — single regression trees, Bagging, and Random Forests — focusing on predictive accuracy (mean squared error) and stability under data perturbation. The analysis combines a controlled simulation on the Friedman benchmark with a real-world application to apartment construction costs in Tehran (UCI Machine Learning Repository).

R License: MIT Last commit


Full project report

The complete report — including all derivations, theorems, proofs, and figures — is available as a PDF (in French): docs/ProjetMAT8886.pdf

This README provides an English summary of the methodology and key findings.


Academic context

This project was carried out in April 2023 as part of MAT8886 — Sujets en apprentissage automatique (Topics in Machine Learning), a graduate course at Université du Québec à Montréal (UQAM). The objective was to investigate, both theoretically and empirically, the conditions under which ensemble methods improve over single regression trees, with a particular focus on stability as a complement to traditional predictive-error metrics.

The work draws on the foundational literature of tree-based methods, including Breiman et al. (1984) for CART, Breiman (1996) for Bagging, Breiman (2001) for Random Forests, and the broader statistical-learning framework of Shalev-Shwartz & Ben-David (2014).


Methods studied

Regression trees

  • Principle. Recursive partitioning of the predictor space into homogeneous regions, with leaf-level predictions given by the mean of the training response.
  • Strengths. Interpretability and ability to capture non-linear interactions without explicit specification.
  • Limitation. High instability — small perturbations of the training data can produce substantially different tree structures, which propagates into prediction variance. Pruning via the complexity parameter mitigates overfitting but does not solve instability.

Bagging (Bootstrap Aggregating)

  • Goal. Reduce the variance of tree-based predictions by averaging across multiple trees fitted on bootstrap resamples of the training data.
  • Mechanism. B bootstrap samples → B trees fitted independently → predictions averaged.
  • Theoretical guarantee. As shown in the report, the variance of the bagged predictor decreases when the individual tree estimators are weakly correlated, which motivates the next refinement.

Random Forests

  • Refinement of Bagging. At each node split, only a random subset of m predictors is considered, decorrelating the individual trees and further reducing variance.
  • Hyperparameter selection. The value of m is selected by k-fold cross-validation in this study, in line with the recommendation m ≈ log(p) of Breiman (2001).
  • Effect. Improved predictive accuracy, robustness to outliers and missing values, and effective handling of high-dimensional settings.

Methodology

Simulation protocol (Friedman dataset)

The Friedman (1991) benchmark generates the response as

y = 10 sin(xx₂) + 20(x₃ − 0.5)² + 10x₄ + 5x₅ + ε

with x₁,…,x₁₀ ∼ Uniform(0, 1) and ε ∼ 𝓝(0, 1). Only the first five predictors carry signal; the remaining five are noise variables included to test the methods' ability to ignore them.

Each replication uses n = 1 200 observations split into 200 for training and 1 000 for testing. The procedure is repeated J = 1 000 times to obtain Monte-Carlo estimates of the mean squared error (MSE) and its variance. Bootstrap aggregation uses B = 100 resamples for both Bagging and Random Forest.

Stability protocol

Stability is quantified by replacing one randomly selected observation in the training sample with a synthetic one, refitting the model, and measuring the cumulative change in predictions across the dataset. Lower values indicate higher stability.

Real-world application

The methodology is then applied to the Tehran residential-building dataset (Rafiei, 2018; UCI Machine Learning Repository), with 27 economic and financial explanatory variables predicting apartment construction costs, to verify that simulation findings transfer to applied settings.


Key results

Friedman simulation — predictive accuracy

Model MSE Var(MSE)
Regression Tree 7.786 0.407
Bagging 5.154 0.192
Random Forest 3.674 0.107

Random Forest reduces the MSE by approximately 53 % compared to a single regression tree, and exhibits roughly four times less variability in its MSE estimates across replications.

Friedman simulation — stability

Model Stability measure
Regression Tree 4 422.368
Bagging 1 021.615
Random Forest 225.311

Random Forest is approximately 20× more stable than a single regression tree under the perturbation protocol, and 4.5× more stable than Bagging — illustrating the additional benefit of feature-level randomization on top of bootstrap resampling.

Random Forest hyperparameter

The optimal number of candidate variables per split, selected by cross-validation across the 1 000 replications, was most often m = 3 (selected in approximately 62 % of replications, with m = 4 in 17 %, m = 2 in 16 %, and m = 5 in 5 %).

Tehran real-world application

Model MSE Stability
Regression Tree 2 878.508 54 248 767
Bagging 2 586.330 7 610 564
Random Forest 1 733.553 4 882 612

The pattern observed in simulation is confirmed on real data: Random Forest produces both the lowest prediction error and the most stable predictions, with Bagging providing an intermediate but still substantial improvement over a single tree.


Conclusion

Single regression trees, despite their interpretability, suffer from inherent instability that limits their reliability in applied settings. Bagging and especially Random Forests provide robust solutions that simultaneously improve predictive accuracy and stability — both in controlled simulation and on real-world economic data. The choice between methods ultimately depends on the trade-off between interpretability and performance: a single tree remains useful as an exploratory or communication tool, while ensemble methods are preferable whenever predictive reliability is the priority.


Technology stack

Component Technology
Core language R
Modeling rpart, randomForest, ipred
Data handling tidyverse (dplyr, tidyr, readr)
Visualization ggplot2
Reporting LaTeX (PDF report)

Repository structure

regression-trees-ensemble-methods/
├── data/         # Input datasets (Friedman simulation and Tehran real data)
├── docs/         # Project report (PDF, in French)
├── figures/      # Generated plots used in the report
├── results/      # Saved numerical outputs (MSE, stability scores)
├── src/          # R source code
│   └── Code_projet.R    # Main analysis script
└── README.md

Reproduction

1. Clone the repository:

git clone https://github.com/komiayi/regression-trees-ensemble-methods.git
cd regression-trees-ensemble-methods

2. Install the required R packages (in an R or RStudio session):

install.packages(c("rpart", "randomForest", "tidyverse", "ipred"))

3. Run the analysis. Three options depending on your workflow:

  • From an interactive R console:

    setwd("src/")
    source("Code_projet.R")
  • From the terminal using Rscript:

    Rscript src/Code_projet.R
  • From RStudio: open src/Code_projet.R and run it interactively.

Note. The full simulation involves 1 000 Monte-Carlo replications and may take significant time depending on hardware.


📚 References

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
  • Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees.
  • Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67.
  • Rafiei, M. (2018). Residential Building Data Set. UCI Machine Learning Repository.
  • Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
  • Vapnik, V. (1999). The Nature of Statistical Learning Theory. Springer.

License

Distributed under the MIT License. See LICENSE for full terms.


Author

Komi Roger Ayi Biostatistician — Data Scientist Université du Québec à Montréal · Montréal, Québec, Canada

Portfolio · LinkedIn · GitHub

About

Comparative study of regression trees, Bagging, and Random Forests focusing on MSE and stability. Friedman simulation (J=1000) + real-world application to Tehran apartment construction costs. Graduate ML project (UQAM, MAT8886, R).

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages