06.tree methods
A very powerful group of algorithms falls under the "Tree Methods" title.
- Decision trees
- random forests
- gradient boosted trees
Decision trees
Random Forests
To improve performance, we can use many trees with a random sample of features chosen as the split.
- A new random sample of features is chosen for every single tree at every single split.
- This works for both classification and regression tasks.
Why do we need random sample at every split?
- If we have one very strong feature in the data set. Most of the trees will use that feature as the top split, resulting in an ensemble of similar trees that are highly correlated.
- Averaging highly correlated quantities does not significantly reduce variance.
- By randomly leaving out candidate features from each split, Random Forests "decorrelates" the trees, such that the averaging process can reduce the variance of the resulting model.
Gradient Boosted Trees
Gradient boosting involves three elements.
* A loss function to be optimized.
* A weak learner to make predictions.
* An additive model to add weak learners to minimize the loss function.
Loss Function:
A loss function in basic terms is the function/equation we will use to determine how 'far off' our predictions are.
Weak Learner:
Decision trees are used as the weak learner in gradient boosting.
It is common to constrain the weak learners: such as a maximum number of layers, nodes, splits or leaf nodes.
Additive Model:
Trees are added one at a time and existing trees in the model are not changed.
A gradient descent procedure is used to minimize the loss when adding trees.