TDSM 7.19

From The Data Science Design Manual Wikia
Jump to: navigation, search

Bias-Variance tradeoff means, if you use simple model (high bias in the assumption of model) then you will have less accuracy in prediction, but if you use too complex model (less bias) then you can perfectly fit the training data which leads to overfitting. So we should try to balance these two condition so that we get good prediction but with less overfitting with the training data. Bias is the error in the model from incorrect modeling of the problem while Variance is the noise in the data. It simply means, that if we try too hard to fit the data, we will end up fitting the noise and hence the results will be poor.

Examples:

In linear regression, we can add a regularization(penalty) term which will reduce the variance but at the same time, we could get a bigger bias. Another interesting thing I found is that, comparing random forest with GBDT, for the same problem or called the same dataset, we usually get deeper trees in random forest model than GBDT. We can also use bias and variance to make a simple explanation for this. In random forest, we create different tree separately and when we need a prediction, we use bagging or voting to get the final result. It is like making the observation for many times and get the average of them. So, random forest will have a small variance naturally. Thus what you should do when you make the single tree is reducing the bias. So, the tree goes deeper and make it more precise. However, in GBDT, what the tree learned is the gradient of the last tree so GBDT will have a small bias naturally. Thus what you gonna do when you make the tree in GBDT is reducing the variance. So, you will get trees whose depth will be just 3-8.