- Basic Idea
- Algorithm Explained
- Define an Objective Function
- Train Loss Part
- Regularization Part
- Optimization
- The Optimal Objective Functions
- Grow Greedily
- When to End Optimization
- Mathematical Details
- More Details
- Parameters
- Parameter Tuning
- Disadvantage & Solution
- Tend to Overfit High Dimension Sparse Data
- Overfitting
- Q&A
- How to Get the Weight of Child?
- Compare GBDT and XGBoost
- The Data Format that's Suitable for XGBoost
- How do XGBoost accelerate the training
- Reference
Basic Idea
XGBoost is a type of additive model based on boosting. It utilized forward stagewise methods to learn the model greedily.
In the process of optimization, a new base model will be created sequentially to model the residual between the real label and the output of current model, and reduce the objective function. Finally, the predicted value of an input is the summation of all the trees' output.
Algorithm Explained
In summary, similar to other supervised learning, XGBoost follows the same methods:
- Define a objective function
- Optimize the objective function
Define an Objective Function
The objective function is defined as:
where
- n is the number of training samples
- K is the number of tree
- is a CART tree function.
Train Loss Part
For the th tree, the predicted value for is:
The objective function at this point can be reduced to:
By applying Taylor expansion and abandoning the constant term, the specific objective at step t becomes:
Regularization Part
First let's refine the definition of a tree function at step t:
where
- T: the number of leaves
- w: the vector of score in leaves
- q(x): a function that assign data point to the corresponding leaf
The complexity of the model is defined as:
Optimization
Optimizing procedures of XBGoost:
- With the help with Talyer expansion and Newton's method, we get the optimal objective function value for a fixed structure of tree
- Grow tree greedily based on the objective function value reduction (gain) (instead of searching all the structure of tree)
The Optimal Objective Functions
represents the score (weight) of a leaf.
Grow Greedily
When growing the th tree, it's not practice to enumerate all possible tree and pick the best one. Instead, xgboost will greedly find the optimal split points that brings new gain to the objective function. The gain is defined as:
If Gain>0, then objective function will decrease
Steps to get the optimal split points:
- Enumerate all the features
- Sort the value of every feature
- Find the optimal split points for every feature
- Find the optimal spllit feature and feature value
When to End Optimization
When the criterions blew are meet, the growth of tree will be ended:
- The gain a new split brings is small than zero
- The depth of tree reach
- When the score (weight) of the leaf is smaller than
Mathematical Details
More Details
Parameters
- General Parameters: Guide the overall functioning (the type of base leaner, tree or linear model)
- Learning Task Parameters: learning task and learning objective
- objective function
- eval_metric
- ...
- Booster Parameters: Guide the individual booster (tree/regression) at each step
- The combination of Trees:
- learning rate
- regularization terms for objective functions:
- reg_alpha
- reg_lambda
- reg_gamma
- number of tree
- Individual tree:
- Complexity of the model
- max_depth
- min_child_weight: minimum sum of instance weight needed in a child
- gamma (min_split_loss): minimum loss reduction need to make a partition
- Randomness of the training data
- subsample: subsample ratio of training instances
- colsample_bytree: subsample ratio of features
Parameter Tuning
Refer to Parameters sections, most of the parameter tuning is related to booster parameters:
- Set learning rate and number of tree: learning_rate =0.1
- max_depth, min_child_weight
- gamma
- subsample & colsample_bytree
- increasing number of tree while decreasing the learning rate
Disadvantage & Solution
Tend to Overfit High Dimension Sparse Data
- GBDT, XGBoost tend to overfit more on the spase data than LR (TODO Why?)
Solution:
- create less sparse feature (cross feature)
- XGBoost + LR: XGBoost will do feature transformation (Ref 4).
Overfitting
- Combination of tree: increasing number of tree while decreasing the learning rate
- Shrinkage reduce the influence of individual tree and leave space for further tree to improve the model
- Model complexity:
- max_depth
- min_child_weight
- gamma
- Training data randomness:
- subsample
- colsample_bytree
Q&A
How to Get the Weight of Child?
Ref to the Optimal Objective Function section.
Compare GBDT and XGBoost
- Objective function
- No regularization term for GBDT, prone to overfit
- Optimization process
- the first-order Taylor approximation was used for GBDT, while second-order Taylor approximation for XBGoost objective function
- For GBDT, the learner will predict the negtive gradient of the sample, it can be considered as some kinds of gradient decent in functional space(ref2).
- For XGBoost, except tree learner, liner classifiers such linear regression or logistic regression are also supported; different loss functions are also supported unless it has the second order diravitive.
- Similar to random forest, XGBoost also support subsampling features or training data.
- Engineering: XGBoost is faster because parallel computing.
- Missing value: for XGBoost, it can learn the default split direction(left or right). And when there are missing value, it will be distributed to the default branchXBG
The Data Format that's Suitable for XGBoost
- There are four types of data for machine learning task:
- image
- sequence
- graph
- tabular
- For the first 3 type of data, there are patterns. We can design deep learning structure for each of them
- But for tabular data, there are no apparent patterns, most of the time, traditional ML model is enough. In addition, for tabular data, feature engineering is very important.
How do XGBoost accelerate the training
- Split finding algorithm
- Order the features before training and store in a data structure called block in cache reducing computation
- percentile approximation: calculate precentiles for every feature before training. To find the optimal split point, search from these percentiles.
- Search in parallel for all features when searching for the optimal split point
- columns(feature) subsampling: accelerate computation
Reference
- Xgboost, GBDT超详细推导
- Xgboost Documentation - Introduction to Boosted Tree
- XGBoos - A Scalable Tree Boosting System
- Practical Lessons from Predicting Clicks on Ads at Facebook
- 20道XGBoost面试题
- The Elements of Statistical Learning Data Mining, Inference, and Prediction - Chapter 10
- 既然使用神经网络也可以解决分类问题,那SVM、决策树这些算法还有什么意义呢? - 柯国霖的回答 - 知乎