Gradient Boosting Machine – A Brief Introduction


treesOne of the Predictive Analytics projects I am working on at Expedia uses Gradient Boosting Machine (GBM). This is currently one of the state of the art algorithms in Machine Learning. This article provides insights on how to get started and advices for further readings.

I will now focus on the use of GBM for regression, based on decision trees as prediction models. Using concepts of ensemble learning (www.tinyurl.com/ens-learn), the algorithm is based on the iterative construction of regression trees. From the second iteration, the algorithm tries to minimize the error (instead of matching the target) in order to focus on cases with higher errors. A notion of bagging can also be incorporated (see bagging fraction parameter below).

I would summarize GBM using the following description (from Natekin and Knoll, see the tutorial below): “In gradient boosting machines, or simply, GBMs, the learning procedure consecutively fits new models to provide a more accurate estimate of the response variable. The principle idea behind this algorithm is to construct the new base-learners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. The loss functions applied can be arbitrary, but to give a better intuition, if the error function is the classic squared-error loss, the learning procedure would result in consecutive error-fitting.

Examples of parameters to tune for GBM include:

  • Learning rate: the lower the better; usually, the computational time will be the limit. This is the first parameter to be set.
  • Number of iterations (trees): this can be tuned using cross-validation.
  • Tree depth: maximum depth of the tree; can be set using cross-validation.
  • Bagging fraction: percent of data to sample (a value of 1 means all the data are used at each iteration, so no bagging is used)

GBM provided better results than random forest for the regression problem I am involved in. Since there is no free lunch Machine Learning algorithm (www.no-free-lunch.org), you will have to test it to your specific problem. Note that it will be much easier to parallelize execution in the case of random forests.

Here is a list of articles I found very useful (if you were to read only one of them, I would advise the tutorial one):

For the implementation, I use the GBM package in R (www.tinyurl.com/GBM-CRAN). Note that the execution time can be an issue. On a training set of 100k rows and 15 features, building the model can take up to 45min on a business laptop.

This article was originally published in issue 4 of the Swiss Analytics Magazine.

Share

Comments

This entry was posted in Data Mining. Bookmark the permalink.