XGBoost: A scalable tree boosting system

Abstract

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

Notes

A lot of the notes are taken from XGBoost's tutorial.

Gradient Boosting is a ML technique that takes a number of weak learners and combines them into a single strong learner. Gradient Boosted Trees are a subset of the general problem that applies gradient boosting to trees. XGBoost uses tree ensembles, which are sets of classification and regression trees (CART). In a CART model, we create a series of trees that split the sample based on their features into different leaves, and assign each leaf a different score.

The trees try to complement each other. The complexity of the trees are defined as \(\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2\), where \(w_j\) is the weight of each leaf, and \(T\) is the number of leaves.

XGBoost implements this algorithm and has been particularly successful, being used in many successful Kaggle competitions. XGBoost is extremely fast due to a series of algorithmic tricks. The paper reviews these tricks.