TDSM 7.19

From The Data Science Design Manual Wikia
Revision as of 19:20, 11 December 2017 by Anjul.tyagi (talk | contribs)
Jump to: navigation, search

Bias-Variance tradeoff means we can't get increasing accuracy with increasing model complexity always. Bias is the error in the model from incorrect modeling of the problem while Variance is the noise in the data. It simply means, that if we try too hard to fit the data, we will end up fitting the noise and hence the results will be poor.

Examples:

In linear regression, we can add a regularization(penalty) term which will reduce the variance but at the same time, we could get a bigger bias. Another interesting thing I found is that, comparing random forest with GBDT, for the same problem or called the same dataset, we usually get deeper trees in random forest model than GBDT. We can also use bias and variance to make a simple explanation for this. In random forest, we create different tree separately and when we need a prediction, we use bagging or voting to get the final result. It is like making the observation for many times and get the average of them. So, random forest will have a small variance naturally. Thus what you should do when you make the single tree is reducing the bias. So, the tree goes deeper and make it more precise. However, in GBDT, what the tree learned is the gradient of the last tree so GBDT will have a small bias naturally. Thus what you gonna do when you make the tree in GBDT is reducing the variance. So, you will get trees whose depth will be just 3-8.