A comparative study of three tree-based learning methods — single regression trees, Bagging, and Random Forests — focusing on predictive accuracy (mean squared error) and stability under data perturbation. The analysis combines a controlled simulation on the Friedman benchmark with a real-world application to apartment construction costs in Tehran (UCI Machine Learning Repository).
The complete report — including all derivations, theorems, proofs, and figures — is available as a PDF (in French):
docs/ProjetMAT8886.pdf
This README provides an English summary of the methodology and key findings.
This project was carried out in April 2023 as part of MAT8886 — Sujets en apprentissage automatique (Topics in Machine Learning), a graduate course at Université du Québec à Montréal (UQAM). The objective was to investigate, both theoretically and empirically, the conditions under which ensemble methods improve over single regression trees, with a particular focus on stability as a complement to traditional predictive-error metrics.
The work draws on the foundational literature of tree-based methods, including Breiman et al. (1984) for CART, Breiman (1996) for Bagging, Breiman (2001) for Random Forests, and the broader statistical-learning framework of Shalev-Shwartz & Ben-David (2014).
- Principle. Recursive partitioning of the predictor space into homogeneous regions, with leaf-level predictions given by the mean of the training response.
- Strengths. Interpretability and ability to capture non-linear interactions without explicit specification.
- Limitation. High instability — small perturbations of the training data can produce substantially different tree structures, which propagates into prediction variance. Pruning via the complexity parameter mitigates overfitting but does not solve instability.
- Goal. Reduce the variance of tree-based predictions by averaging across multiple trees fitted on bootstrap resamples of the training data.
- Mechanism. B bootstrap samples → B trees fitted independently → predictions averaged.
- Theoretical guarantee. As shown in the report, the variance of the bagged predictor decreases when the individual tree estimators are weakly correlated, which motivates the next refinement.
- Refinement of Bagging. At each node split, only a random subset of m predictors is considered, decorrelating the individual trees and further reducing variance.
- Hyperparameter selection. The value of m is selected by k-fold cross-validation in this study, in line with the recommendation
m ≈ log(p)of Breiman (2001). - Effect. Improved predictive accuracy, robustness to outliers and missing values, and effective handling of high-dimensional settings.
The Friedman (1991) benchmark generates the response as
y = 10 sin(x₁x₂) + 20(x₃ − 0.5)² + 10x₄ + 5x₅ + ε
with x₁,…,x₁₀ ∼ Uniform(0, 1) and ε ∼ 𝓝(0, 1). Only the first five predictors carry signal; the remaining five are noise variables included to test the methods' ability to ignore them.
Each replication uses n = 1 200 observations split into 200 for training and 1 000 for testing. The procedure is repeated J = 1 000 times to obtain Monte-Carlo estimates of the mean squared error (MSE) and its variance. Bootstrap aggregation uses B = 100 resamples for both Bagging and Random Forest.
Stability is quantified by replacing one randomly selected observation in the training sample with a synthetic one, refitting the model, and measuring the cumulative change in predictions across the dataset. Lower values indicate higher stability.
The methodology is then applied to the Tehran residential-building dataset (Rafiei, 2018; UCI Machine Learning Repository), with 27 economic and financial explanatory variables predicting apartment construction costs, to verify that simulation findings transfer to applied settings.
| Model | MSE | Var(MSE) |
|---|---|---|
| Regression Tree | 7.786 | 0.407 |
| Bagging | 5.154 | 0.192 |
| Random Forest | 3.674 | 0.107 |
Random Forest reduces the MSE by approximately 53 % compared to a single regression tree, and exhibits roughly four times less variability in its MSE estimates across replications.
| Model | Stability measure |
|---|---|
| Regression Tree | 4 422.368 |
| Bagging | 1 021.615 |
| Random Forest | 225.311 |
Random Forest is approximately 20× more stable than a single regression tree under the perturbation protocol, and 4.5× more stable than Bagging — illustrating the additional benefit of feature-level randomization on top of bootstrap resampling.
The optimal number of candidate variables per split, selected by cross-validation across the 1 000 replications, was most often m = 3 (selected in approximately 62 % of replications, with m = 4 in 17 %, m = 2 in 16 %, and m = 5 in 5 %).
| Model | MSE | Stability |
|---|---|---|
| Regression Tree | 2 878.508 | 54 248 767 |
| Bagging | 2 586.330 | 7 610 564 |
| Random Forest | 1 733.553 | 4 882 612 |
The pattern observed in simulation is confirmed on real data: Random Forest produces both the lowest prediction error and the most stable predictions, with Bagging providing an intermediate but still substantial improvement over a single tree.
Single regression trees, despite their interpretability, suffer from inherent instability that limits their reliability in applied settings. Bagging and especially Random Forests provide robust solutions that simultaneously improve predictive accuracy and stability — both in controlled simulation and on real-world economic data. The choice between methods ultimately depends on the trade-off between interpretability and performance: a single tree remains useful as an exploratory or communication tool, while ensemble methods are preferable whenever predictive reliability is the priority.
| Component | Technology |
|---|---|
| Core language | R |
| Modeling | rpart, randomForest, ipred |
| Data handling | tidyverse (dplyr, tidyr, readr) |
| Visualization | ggplot2 |
| Reporting | LaTeX (PDF report) |
regression-trees-ensemble-methods/
├── data/ # Input datasets (Friedman simulation and Tehran real data)
├── docs/ # Project report (PDF, in French)
├── figures/ # Generated plots used in the report
├── results/ # Saved numerical outputs (MSE, stability scores)
├── src/ # R source code
│ └── Code_projet.R # Main analysis script
└── README.md
1. Clone the repository:
git clone https://github.com/komiayi/regression-trees-ensemble-methods.git
cd regression-trees-ensemble-methods2. Install the required R packages (in an R or RStudio session):
install.packages(c("rpart", "randomForest", "tidyverse", "ipred"))3. Run the analysis. Three options depending on your workflow:
-
From an interactive R console:
setwd("src/") source("Code_projet.R")
-
From the terminal using
Rscript:Rscript src/Code_projet.R
-
From RStudio: open
src/Code_projet.Rand run it interactively.
Note. The full simulation involves 1 000 Monte-Carlo replications and may take significant time depending on hardware.
- Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
- Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
- Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees.
- Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67.
- Rafiei, M. (2018). Residential Building Data Set. UCI Machine Learning Repository.
- Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
- Vapnik, V. (1999). The Nature of Statistical Learning Theory. Springer.
Distributed under the MIT License. See LICENSE for full terms.
Komi Roger Ayi Biostatistician — Data Scientist Université du Québec à Montréal · Montréal, Québec, Canada