Stochastic gradient descent
Good choice if:
You have continuous amount of data (way too much data to retrain the algorithm with all data)
Your data behavior slowly changes
(Say, tv ad use to have a dominant effect on sales, but as number of online users grow, tv ad effect isn't as dominant anymore )
Saturday, May 17, 2014
What if you want your algo adapt as new data comes in?
Friday, May 16, 2014
Learning algorithm taking too long?
1) map reduce
For batch gradient descent, each iteration requires computing the predicted value for each observation and calculating the difference from actual value.
Map reduce is basically splitting the observations over multiple servers so that the calculations can be done in parallel.
How about splitting over multiple cores in the same server?
If you are using vectorization with good package, the matrix multiplication is already done in parallel for u. :D
2) Instead of using batch gradient descent, randomize your data and use stochastic gradient descent / mini-batch gradient descent
With m observations and each observation having output y and n features denoted by vector x of size n, Gradient descent is the process of tuning learning algorithm parameters vector, theta, bit by bit (iteration by iteration) to minimize the cost, J, which is the sum of squared differece between predicted output, h, and expected output, y.
Tuning theta towards the minimum cost require knowing how the cost changes as theta changes, another word, the partial derivative dJ /dTheta at current point.
For batch, this will require computing h-y for ALL observations for each descent (tuning iteration). Each iteration will result at a lower cost until minimum.
For stochastic, it'll only require computing h-y for ONE observation for each descent. Each iteration will NOT necessarily result at a lower cost. However, running several iterations over the entire set of observations should arrive AROUND cost minimum.
For mini-batch, it computes h-y for SEVERAL observations (mini-batch) for each descent. It takes advantage of (requires) vectorization for parallel processing to achieve an even shorter time. Perhaps batch size should be dictated by number of processor cores.
Start with mini-batch to get theta. Then feed the theta into batch?
Do u have too much (not enough) data?
It's never too much data. :D
But when R cannot allocate memory / the learning algorithm takes way too long... :(
PLOT THE LEARNING CURVES
ie
cost on training set as a function of training set size
Cost on cross validation/test set as a function of training set size
If the 2 curves already converge, your model is Not overfitting the data (not suffering from high variance );
It doesn't necessarily mean your model is good enough (that depends on the cost/error rate of the learning curves and can be addressed with more features, in the case of neural network, more nodes and hidden layers to build a better model). It only means adding more data to your current model won't help.
If the 2 curves don't converge, your model is overfitting the data (suffering from high variance ). Having more data should help.
So, if u think u have too much data (actually, it's your system / implementation that's sub par), randomly sample a subset and plot the learning curves to decide whether you should spend time on building a better model or on including more data on your current model.
Side note, if u need a powerful server, renting a cloud server (ex. Ec2 @Amazon) could be an option.
Thursday, May 15, 2014
How good is your classification model?
One way is accuracy, ie % of correctly classified observations
Downside:
Say, your test is to id whether a patient suffers a rare disease that is found in 1% of the population.
If your test is 99% accurate, is your test doing well?
Maybe, maybe not.
If my test just simply classify all patients as negative, I'll achieve 99% accuracy, but can hardly say my test is good.
To get around this, we can use precision and recall.
From wiki
Precision is the probability that a (randomly selected) retrieved document is relevant.
Recall is the probability that a (randomly selected) relevant document is retrieved in a search.
F1 score incorporates the number
= 2*prec*rec/(prec+rec)
Example
When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.