Decision Tree Ensembles in R
This is the final project in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC. I completed this project with four other classmates.
Last updated
This is the final project in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC. I completed this project with four other classmates.
Last updated
This image may not relate to this project at all. Source: www.paisabazaar.com All images, data and R Script can be found here
Decision Tree has been around for a long time and also known to suffer from bias and variance. Ensemble methods, which combines several trees, are among the most popular solutions. We will try various methods such as bagging, random forest, and boosting in this example.
You are hired to help a finance company identify profitable customers. Its clients are small businesses that purchase financial services. These services include accounting, banking, payroll, and taxes. The finance company also occasionally provides short-term loans to clients. You are also asked to estimate the value of fees generated by potential clients. Clients sign an annual contract, and the finance company would like to be able to predict the total fees that will accumulate over the coming year. It currently relies on search firms, called originators that recruit clients. The finance company would like to identify clients that will generate greater fees so that it can encourage originators to target them.
The data set consists of 3272 obs. of 57 variables.There’s a mix of categorical variables and numeric variables.
Categorical variables: We identify some categorical variables with many categories, which may cause difficulties in the modeling process:
SIC.Code: The numbers in this variable do not represent numeric values but the industry code
Industry.Category: 32 categories
Industry.Subcategory: 268 categories
Metropolitan.Statistical.Area: 232 categories
TimeZone: 6 levels
Division: 9 levels
Other categorical variables with not so many categories may bear useful information for the model:
Originator: 3 categories
Property.Ownership: 3 categories
Industry.Category: 4 main categories
Ever.Bankrupt: 2 categories
Region: 4 categories
Ambiguous variables: Variables with ambiguous meaning (without explanation from the project file):
Amt.Past.Judgments
X1.Years
Variables with NA values: We also recognize NA values in some variables by using str() method in R. Some NA values can be explained and have their own meaning, but some we assume are errors.
Time.Since.Bankruptcy: For clients that haven’t declared bankruptcy, their value in this variable are NA.
Past.Due.of.Charge.Offs: For clients that haven’t gone through Charge Off, their value in this variable are NA.
Credit.Bal.of.Charge.Offs: For clients that haven’t gone through Charge Off, their value in this variable are NA.
Active.Percent.Revolving: There are 28 observations that have NA value in this variables. After looking at the descriptive statistic of this set of data, we don’t recognize any abnormal behaviors. We assume these are missing values and will remove this in the data preparation step
Like most other projects, cleaning and preparing data is an important and time-consuming task. We follow the steps below, detailed codes are available in r-script file.
Converting the data type of monetary value variables from string to number. When we import the original data into R, the money symbol ($) cause these values to be strings, so we need to modify these string values by removing the money symbol and coerce the results to numeric data type.
Removing categorical variables with too many categories.
Modifying the Industry Category variables to have only 4 categories, which are retailing, service, manufacture and wholesales. We also remove all clients in wholesales and manufacture industry. This set includes only 35 observations, so by removing it, we improve the quality of our model.
Adding dummy variables representing categorical variables with small number of categories
Removing three variables with NA values - Time.Since.Bankruptcy, Past.Due.of.Charge.Offs and Credit.Bal.of.Charge.Offs.
Removing 28 observations with NA values in Active.Percent.Revolving variables.
Removing outliers/observations with extreme values.
After these steps, we're left with 3153 observations of 48 variables.
We chose two simple models to compare the predictive performance of Decision Tree models with.
We randomly select a training set with 2000 observations. In order to reduce the dimensions, we use Best Subset Selection approach with exhaustive method to select the most important variables.
Details are avaialable in the R-script file. Here, we have the table with RMSE, MAPE and Correlation of each model.
The model with the lowest testing RMSE is the one with 22 variables.
We also add two varibles that mangement considers important: "Average.House.Value" and "Income.Per.Household"
Our OLS Regression model does not perform really well with R-square around 0.5. However, comparing testing set MSEs, OLS regression still beats KNN. MSE for OLS regression model is 333,477.3, and one for KNN is 605,383.4.
Now we come to Decision Tree.
A good strategy is to grow a very large tree and then prune it back to obtain a subtree. We want to select a subtree that leads to the lowest test error rate. We estimate the test error rate using cross-validation (CV).
This is the summary of the regression tree using 'tree' package:
Note that, the tree is not using all of the available variables. A visualization of the tree:
Using the tree to predict on the testing set, we have testing MSE of 439,125.2, which is not so impressive. Given the training MSE of 335,600, this substantially higher MSE for testing set shows over-fitting.
Prune the tree
Looking at the graph of CV error for each size of the tree, we see that there's not much improvement when the tree size increases from 8 nodes to 12 nodes. Let's use an eight node tree.
Using the pruned tree to predict on the testing set, we have testing MSE of 450,941.4, which is again not so impressive, and still over-fitting.
(Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision tree. The idea is to create several random subsets of data from training set. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.
In R, we use 'randomForest' package to conduct bagging method. This package is also used for Random Forest in our next step.
This approach shows an impressive improvement on predictive performance. The MSE for training set is 291,605.4
and MSE for testing set is 363,215.6. Reducing number of trees from default (500) to 200 would slightly reduce model's accuracy (MSE for testing set is 360,925.2), but also reduce computation time substantially.
This is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. An important parameter in R for this method is mtry - Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3). Trying different values of mtry, we see 11 variable model brings the lowest testing MSE.
Unlike a single decision tree, Ransom Forest is an ensemble of hundreds of decision trees and we cannot easily read the splitting decisions at different nodes. Instead, we interpret our Random Forest model through reading the Variable Importance Chart above, which shows how different variable impact MSE. As how variables are ranked, this chart help the financial company to focus on a few important attributes and allow bankers to mentally prioritize and determine clients.
This is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.
Unlike Random Forest, Boosting is more likely to overfit.
In R, we use 'gbm' package, which is an implementation of extensions to Freund and Schapire’s AdaBoost algorithm and Friedman’s gradient boosting machine. This is the original R implementation of Gradient boosted machines. The most common hyperparameters that you will find in most GBM implementations include Number of trees (total number of trees to fit), Depth of trees (number d of splits in each tree), Learning rate (This controls how quickly the algorithm proceeds down the gradient descent. Smaller values reduce the chance of overfitting but also increases the time to find the optimal fit. This is also called shrinkage), Subsampling (this controls whether or not you use a fraction of the available training observations).
The default settings in gbm includes a learning rate (shrinkage) of 0.001. This is a very small learning rate and typically requires a large number of trees to find the minimum MSE. However, gbm uses a default number of trees of 100, which is rarely sufficient. Adjusting these parameters is necessary.
Showing the summary of the boosted tree, we see the rank of importance of variables.
The MSE for testing set is 359,925.2
In this example, Random Forest not only shows the best performance in predicting, but also has the advantage of avoiding overfitting. However, comparing with the the simple OLS regression, we notice Random Forest still loses in the predicting game. A lot of times, a more complicated model does not guarantee a better result.