It's never too much data. :D
But when R cannot allocate memory / the learning algorithm takes way too long... :(
PLOT THE LEARNING CURVES
ie
cost on training set as a function of training set size
Cost on cross validation/test set as a function of training set size
If the 2 curves already converge, your model is Not overfitting the data (not suffering from high variance );
It doesn't necessarily mean your model is good enough (that depends on the cost/error rate of the learning curves and can be addressed with more features, in the case of neural network, more nodes and hidden layers to build a better model). It only means adding more data to your current model won't help.
If the 2 curves don't converge, your model is overfitting the data (suffering from high variance ). Having more data should help.
So, if u think u have too much data (actually, it's your system / implementation that's sub par), randomly sample a subset and plot the learning curves to decide whether you should spend time on building a better model or on including more data on your current model.
Side note, if u need a powerful server, renting a cloud server (ex. Ec2 @Amazon) could be an option.
No comments:
Post a Comment