Friday, May 16, 2014

Do u have too much (not enough) data?

It's never too much data. :D

But when R cannot allocate memory / the learning algorithm takes way too long... :(

PLOT THE LEARNING CURVES
ie
cost on training set as a function of training set size
Cost on cross validation/test set as a function of training set size

If the 2 curves already converge, your model is Not overfitting the data (not suffering from high variance );
It doesn't necessarily mean your model is good enough (that depends on the cost/error rate of the learning curves and can be addressed with more features, in the case of neural network,  more nodes and hidden layers to build a better model). It only means adding more data to your current model won't help.

If the 2 curves don't converge, your model is overfitting the data (suffering from high variance ). Having more data should help.

So, if u think u have too much data (actually, it's your system / implementation that's sub par), randomly sample a subset and plot the learning curves to decide whether you should spend time on building a better model or on including more data on your current model.

Side note, if u need a powerful server,  renting a cloud server (ex. Ec2 @Amazon) could be an option.

No comments:

Post a Comment