-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Open
Description
Hi,
I wanted to better undestand the behavior of using a base_margin during training and prediction of an xgboost model. I read the very nice tutorial and tried to extend it in the case where we have a simple train/test split. As you will see in the below code, I have two models that are exactly the same (one trained with base_score = 0.5 and one with base_margin = 0) - which makes sense according to the tutorial - but upon prediction, using (or not using) the base_margin results in different predictions. Is that expected behavior? Maybe I have misundestood something?
library(xgboost)
library(testthat)
# Load data, split train/test
data(agaricus.train, package = "xgboost")
X = agaricus.train$data
y = agaricus.train$label
set.seed(42)
train_ids = sample(x = 1:nrow(X), size = 4000)
test_ids = setdiff(1:nrow(X), train_ids)
xgb_data_train = xgboost::xgb.DMatrix(data = X[train_ids, ], label = y[train_ids])
xgb_data_test = xgboost::xgb.DMatrix(data = X[test_ids, ], label = y[test_ids])
# m1 model with no base_margin (default base_score = 0.5)
m1 = xgboost::xgb.train(
data = xgb_data_train,
params = list(base_score = 0.5, objective = "binary:logistic"),
nrounds = 1)
# TRAIN ----
# does offset change trained model? => YES (seems also independent of nrounds)
# try base_margin = 1 => NOT the same as base_score = 0.5
xgboost::setinfo(xgb_data_train, "base_margin", rep(1, length(train_ids)))
#> [1] TRUE
m2 = xgboost::xgb.train(
data = xgb_data_train,
params = list(objective = "binary:logistic"),
nrounds = 1)
testthat::expect_equal(xgb.dump(m1), xgb.dump(m2)) # fails
#> Error: Expected `xgb.dump(m1)` to equal `xgb.dump(m2)`.
#> Differences:
#> 8/16 mismatches
#> x[4]: "3:leaf=0.541463435"
#> y[4]: "3:leaf=0.360770881"
#>
#> x[7]: "11:leaf=-0.41538465"
#> y[7]: "11:leaf=-0.712710977"
#>
#> x[8]: "12:leaf=0.495652169"
#> y[8]: "12:leaf=0.323709249"
#>
#> x[9]: "8:leaf=-0.594997048"
#> y[9]: "8:leaf=-1.10756671"
#>
#> x[12]: "9:leaf=0.582978785"
#> y[12]: "9:leaf=0.395674318"
# try model m3 with base_margin = 0 => same as base_score = 0.5!!!
xgboost::setinfo(xgb_data_train, "base_margin", rep(0, length(train_ids)))
#> [1] TRUE
m3 = xgboost::xgb.train(
data = xgb_data_train,
# base_score doesn't matter
params = list(base_score = 0.2, objective = "binary:logistic"),
nrounds = 1)
testthat::expect_equal(xgb.dump(m1), xgb.dump(m3)) # models ARE THE SAME!
# PREDICT ----
# predict with no base_margin
p1 = predict(m1, newdata = xgb_data_test)
# predict with no base_margin, different model
p2 = predict(m2, newdata = xgb_data_test)
# predict with no base_margin, same model as m1 but trained with an offset (base_margin = 0)
p3 = predict(m3, newdata = xgb_data_test)
# ALL results different!!!
expect_false(all(p1 == p2))
expect_false(all(p1 == p3)) # WHY??? same models here!!! and same test data?
# add base_margin = 0 to predictions
xgboost::setinfo(xgb_data_test, "base_margin", rep(0, length(test_ids)))
#> [1] TRUE
# predict with base_margin
p11 = predict(m1, newdata = xgb_data_test)
# predict with base_margin, different model
p22 = predict(m2, newdata = xgb_data_test)
# predict with base_margin, same model as m1 but trained with an offset (base_margin = 0)
p33 = predict(m3, newdata = xgb_data_test)
expect_equal(p1, p11)
expect_equal(p2, p22)
expect_equal(p3, p33)
expect_false(all(p11 == p33)) # WHY???Created on 2025-12-12 with reprex v2.1.1
Metadata
Metadata
Assignees
Labels
No labels