XGBoost Explained

XGBoost Explained

Tags
Data Science
Date Published
May 25, 2020

Basic Idea

XGBoost is a type of additive model based on boosting. It utilized forward stagewise methods to learn the model greedily.

In the process of optimization, a new base model will be created sequentially to model the residual between the real label and the output of current model, and reduce the objective function. Finally, the predicted value of an input is the summation of all the trees' output.

Algorithm Explained

In summary, similar to other supervised learning, XGBoost follows the same methods:

  1. Define a objective function
  2. Optimize the objective function

Define an Objective Function

The objective function is defined as:

obj(θ)=i=1nl(yi,yi^)+i=1KΩ(fk)obj(\theta)=\sum_{i=1}^{n}l(y_i, \hat{y_i}) + \sum_{i=1}^{K}\Omega(f_k)

where

  • n is the number of training samples
  • K is the number of tree
  • fkf_k is a CART tree function.

Train Loss Part

For the ttth tree, the predicted value for xix_iis:

yit=k=1tfk(xi)=yit1+ft(xi)y_i^{t} =\sum_{k=1}^{t}f_k(x_i)=y_i^{t-1}+f_t(x_i)

The objective function obj(θ)obj(\theta) at this point can be reduced to:

objt(θ)=i=1nl(yit,yit1+ft(xi))+Ω(ft)+constantobj^{t}(\theta)=\sum_{i=1}^{n}l(y_i^{t}, y_i^{t-1}+f_t(x_i)) + \Omega(f_t) + constant

By applying Taylor expansion and abandoning the constant term, the specific objective at step t becomes:

objt(θ)=i=1nl(gifi(xi)+12hifi1(xi))+Ω(ft)obj^{t}(\theta)=\sum_{i=1}^{n}l(g_if_i(x_i)+\frac{1}{2}h_if_i^{1}(x_i)) + \Omega(f_t)

Regularization Part

First let's refine the definition of a tree function at step t:

ft(x)=wq(x),wRT,q(x):Rd1,2,Tf_t(x)=w_{q(x)}, w\in R^{T}, q(x): R^d \rightarrow {1, 2 \cdot, \cdot T}

where

  • T: the number of leaves
  • w: the vector of score in leaves
  • q(x): a function that assign data point to the corresponding leaf

The complexity of the model is defined as:

Ω(f)=γT+12λj=1Twy2\Omega(f)= \gamma T+\frac{1}{2}\lambda\sum_{j=1}^{T}w_y^2

Optimization

Optimizing procedures of XBGoost:

  1. With the help with Talyer expansion and Newton's method, we get the optimal objective function value for a fixed structure of tree
  2. Grow tree greedily based on the objective function value reduction (gain) (instead of searching all the structure of tree)

The Optimal Objective Functions

obj=12j=1TGj2Hj+λ+γTobj^\ast = -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T

wj=GjHj+λw_j^\ast = -\frac{G_j}{H_j+\lambda}

wjw_j represents the score (weight) of a leaf.

Grow Greedily

When growing the ttth tree, it's not practice to enumerate all possible tree and pick the best one. Instead, xgboost will greedly find the optimal split points that brings new gain to the objective function. The gain is defined as:

Gain=ObjL+R(ObjL+ObjR)Gain = Obj_{L+R}-(Obj_L + Obj_R)

If Gain>0, then objective function will decrease

Steps to get the optimal split points:

  1. Enumerate all the features
  2. Sort the value of every feature
  3. Find the optimal split points for every feature
  4. Find the optimal spllit feature and feature value

When to End Optimization

When the criterions blew are meet, the growth of tree will be ended:

  1. The gain a new split brings is small than zero
  2. The depth of tree reach max_depthmax\_depth
  3. When the score (weight) of the leaf is smaller than min_child_weightmin\_child\_weight

Mathematical Details

More Details

Parameters

  1. General Parameters: Guide the overall functioning (the type of base leaner, tree or linear model)
  2. Learning Task Parameters: learning task and learning objective
    1. objective function
    2. eval_metric
    3. ...
  3. Booster Parameters: Guide the individual booster (tree/regression) at each step
    1. The combination of Trees:
      1. learning rate
      2. regularization terms for objective functions:
        1. reg_alpha
        2. reg_lambda
      3. reg_gamma
      4. number of tree
    2. Individual tree:
      1. Complexity of the model
        1. max_depth
        2. min_child_weight: minimum sum of instance weight needed in a child
        3. gamma (min_split_loss): minimum loss reduction need to make a partition
      2. Randomness of the training data
        1. subsample: subsample ratio of training instances
        2. colsample_bytree: subsample ratio of features

Parameter Tuning

Refer to Parameters sections, most of the parameter tuning is related to booster parameters:

  1. Set learning rate and number of tree: learning_rate =0.1
  2. max_depth, min_child_weight
  3. gamma
  4. subsample & colsample_bytree
  5. increasing number of tree while decreasing the learning rate

Disadvantage & Solution

Tend to Overfit High Dimension Sparse Data

  1. GBDT, XGBoost tend to overfit more on the spase data than LR (TODO Why?)

Solution:

  1. create less sparse feature (cross feature)
  2. XGBoost + LR: XGBoost will do feature transformation (Ref 4).

Overfitting

  1. Combination of tree: increasing number of tree while decreasing the learning rate
    1. Shrinkage reduce the influence of individual tree and leave space for further tree to improve the model
  2. Model complexity:
    1. max_depth
    2. min_child_weight
    3. gamma
  3. Training data randomness:
    1. subsample
    2. colsample_bytree

Q&A

How to Get the Weight of Child?

Ref to the Optimal Objective Function section.

Compare GBDT and XGBoost

  1. Objective function
    1. No regularization term for GBDT, prone to overfit
  2. Optimization process
    1. the first-order Taylor approximation was used for GBDT, while second-order Taylor approximation for XBGoost objective function
    2. For GBDT, the learner will predict the negtive gradient of the sample, it can be considered as some kinds of gradient decent in functional space(ref2).
  3. For XGBoost, except tree learner, liner classifiers such linear regression or logistic regression are also supported; different loss functions are also supported unless it has the second order diravitive.
  4. Similar to random forest, XGBoost also support subsampling features or training data.
  5. Engineering: XGBoost is faster because parallel computing.
  6. Missing value: for XGBoost, it can learn the default split direction(left or right). And when there are missing value, it will be distributed to the default branchXBG

The Data Format that's Suitable for XGBoost

  1. There are four types of data for machine learning task:
    1. image
    2. sequence
    3. graph
    4. tabular
  2. For the first 3 type of data, there are patterns. We can design deep learning structure for each of them
  3. But for tabular data, there are no apparent patterns, most of the time, traditional ML model is enough. In addition, for tabular data, feature engineering is very important.

How do XGBoost accelerate the training

  1. Split finding algorithm
    1. Order the features before training and store in a data structure called block in cache reducing computation
    2. percentile approximation: calculate precentiles for every feature before training. To find the optimal split point, search from these percentiles.
    3. Search in parallel for all features when searching for the optimal split point
  2. columns(feature) subsampling: accelerate computation

Reference

  1. Xgboost, GBDT超详细推导
  2. Xgboost Documentation - Introduction to Boosted Tree
  3. XGBoos - A Scalable Tree Boosting System
  4. Practical Lessons from Predicting Clicks on Ads at Facebook
  5. 20道XGBoost面试题
  6. The Elements of Statistical Learning Data Mining, Inference, and Prediction - Chapter 10
  7. 既然使用神经网络也可以解决分类问题,那SVM、决策树这些算法还有什么意义呢? - 柯国霖的回答 - 知乎