Sunday, January 26, 2014

How to determine how good your regression model is?

It comes down to model variance - model bias trade off.

What's model variance?
Let's have our model fit the data as much as possible.
If we randomly pick a certain % of our data and fit as much as possible,  we have model 1.
We do this over and over again for n times and end up with n models.
The model variance is basically the variance of the predicted value (y) of these n models for a given x.
Typically,  if we use a less flexible model, say linear, the model variance is gonna be less.

What's model bias?
That's the difference between average predicted y from n models for a given x and the actual y for the same x.
Typically,  if we use a less flexible model, say linear, the model bias is gonna be bigger.

We'd like to arrive at a better model by trading off between model variance and model bias. We don't want our model be too flexible to overfit the data and introduce a huge model variance while we don't want our model be to rigid and have the predicted value be too far away from actual value.

How to tell if u are overfitting?
Use out of sample data.
Let's say u fit the training data and come up with a model with a very small mean sq error on the training data.
Have this model to predict the out of sample data and compare with the actual result.  If u have a large mean sq error on the out of sample data, you are probably overfitting your training data.

No comments:

Post a Comment