Basic Idea

XGBoost is a type of additive model based on boosting. It utilized forward stagewise methods to learn the model greedily.

In the process of optimization, a new base model will be created sequentially to model the residual between the real label and the output of current model, and reduce the objective function. Finally, the predicted value of an input is the summation of all the trees' output.

Algorithm Explained

In summary, similar to other supervised learning, XGBoost follows the same methods:

Define a objective function
Optimize the objective function

Define an Objective Function

The objective function is defined as:

$obj(\theta)=\sum_{i=1}^{n}l(y_i, \hat{y_i}) + \sum_{i=1}^{K}\Omega(f_k)$

where

n is the number of training samples
K is the number of tree
$f_k$ is a CART tree function.

Train Loss Part

For the $t$ th tree, the predicted value for $x_i$ is:

$y_i^{t} =\sum_{k=1}^{t}f_k(x_i)=y_i^{t-1}+f_t(x_i)$

The objective function $obj(\theta)$ at this point can be reduced to:

obj^{t}(\theta)=\sum_{i=1}^{n}l(y_i^{t}, y_i^{t-1}+f_t(x_i)) + \Omega(f_t) + constant

By applying Taylor expansion and abandoning the constant term, the specific objective at step t becomes:

$obj^{t}(\theta)=\sum_{i=1}^{n}l(g_if_i(x_i)+\frac{1}{2}h_if_i^{1}(x_i)) + \Omega(f_t)$

Regularization Part

First let's refine the definition of a tree function at step t:

$f_t(x)=w_{q(x)}, w\in R^{T}, q(x): R^d \rightarrow {1, 2 \cdot, \cdot T}$

where

T: the number of leaves
w: the vector of score in leaves
q(x): a function that assign data point to the corresponding leaf

The complexity of the model is defined as:

$\Omega(f)= \gamma T+\frac{1}{2}\lambda\sum_{j=1}^{T}w_y^2$

Optimization

Optimizing procedures of XBGoost:

With the help with Talyer expansion and Newton's method, we get the optimal objective function value for a fixed structure of tree
Grow tree greedily based on the objective function value reduction (gain) (instead of searching all the structure of tree)

The Optimal Objective Functions

$obj^\ast = -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T$

$w_j^\ast = -\frac{G_j}{H_j+\lambda}$

$w_j$ represents the score (weight) of a leaf.

Grow Greedily

When growing the $t$ th tree, it's not practice to enumerate all possible tree and pick the best one. Instead, xgboost will greedly find the optimal split points that brings new gain to the objective function. The gain is defined as:

$Gain = Obj_{L+R}-(Obj_L + Obj_R)$

If Gain>0, then objective function will decrease

Steps to get the optimal split points:

Enumerate all the features
Sort the value of every feature
Find the optimal split points for every feature
Find the optimal spllit feature and feature value

When to End Optimization

When the criterions blew are meet, the growth of tree will be ended:

The gain a new split brings is small than zero
The depth of tree reach $max\_depth$
When the score (weight) of the leaf is smaller than $min\_child\_weight$

‣

Mathematical Details

More Details

Parameters

General Parameters: Guide the overall functioning (the type of base leaner, tree or linear model)
Learning Task Parameters: learning task and learning objective

objective function
eval_metric
...

Booster Parameters: Guide the individual booster (tree/regression) at each step

The combination of Trees:

learning rate
regularization terms for objective functions:

reg_alpha
reg_lambda

reg_gamma
number of tree

Individual tree:

Complexity of the model

max_depth
min_child_weight: minimum sum of instance weight needed in a child
gamma (min_split_loss): minimum loss reduction need to make a partition

Randomness of the training data

subsample: subsample ratio of training instances
colsample_bytree: subsample ratio of features

Parameter Tuning

Refer to Parameters sections, most of the parameter tuning is related to booster parameters：

Set learning rate and number of tree: learning_rate =0.1
max_depth, min_child_weight
gamma
subsample & colsample_bytree
increasing number of tree while decreasing the learning rate

Disadvantage & Solution

Tend to Overfit High Dimension Sparse Data

GBDT, XGBoost tend to overfit more on the spase data than LR (TODO Why?)

Solution:

create less sparse feature (cross feature)
XGBoost + LR: XGBoost will do feature transformation (Ref 4).

Overfitting

Combination of tree: increasing number of tree while decreasing the learning rate

Shrinkage reduce the influence of individual tree and leave space for further tree to improve the model

Model complexity:

max_depth
min_child_weight
gamma

Training data randomness:

subsample
colsample_bytree

Q&A

How to Get the Weight of Child?

Ref to the Optimal Objective Function section.

Compare GBDT and XGBoost

Objective function

No regularization term for GBDT, prone to overfit

Optimization process

the first-order Taylor approximation was used for GBDT, while second-order Taylor approximation for XBGoost objective function
For GBDT, the learner will predict the negtive gradient of the sample, it can be considered as some kinds of gradient decent in functional space(ref2).

For XGBoost, except tree learner, liner classifiers such linear regression or logistic regression are also supported; different loss functions are also supported unless it has the second order diravitive.
Similar to random forest, XGBoost also support subsampling features or training data.
Engineering: XGBoost is faster because parallel computing.
Missing value: for XGBoost, it can learn the default split direction(left or right). And when there are missing value, it will be distributed to the default branchXBG

The Data Format that's Suitable for XGBoost

There are four types of data for machine learning task:

image
sequence
graph
tabular

For the first 3 type of data, there are patterns. We can design deep learning structure for each of them
But for tabular data, there are no apparent patterns, most of the time, traditional ML model is enough. In addition, for tabular data, feature engineering is very important.

How do XGBoost accelerate the training

Split finding algorithm

Order the features before training and store in a data structure called block in cache reducing computation
percentile approximation: calculate precentiles for every feature before training. To find the optimal split point, search from these percentiles.
Search in parallel for all features when searching for the optimal split point

columns(feature) subsampling: accelerate computation

Reference

Xgboost, GBDT超详细推导
Xgboost Documentation - Introduction to Boosted Tree
XGBoos - A Scalable Tree Boosting System
Practical Lessons from Predicting Clicks on Ads at Facebook
20道XGBoost面试题
The Elements of Statistical Learning Data Mining, Inference, and Prediction - Chapter 10
既然使用神经网络也可以解决分类问题，那SVM、决策树这些算法还有什么意义呢？ - 柯国霖的回答 - 知乎

XGBoost Explained