By the numbers...

Saturday, May 17, 2014

What if you want your algo adapt as new data comes in?

Stochastic gradient descent
Good choice if:
You have continuous amount of data (way too much data to retrain the algorithm with all data)
Your data behavior slowly changes
(Say, tv ad use to have a dominant effect on sales, but as number of online users grow, tv ad effect isn't as dominant anymore )

Friday, May 16, 2014

Learning algorithm taking too long?

1) map reduce
For batch gradient descent, each iteration requires computing the predicted value for each observation and calculating the difference from actual value.
Map reduce is basically splitting the observations over multiple servers so that the calculations can be done in parallel.
How about splitting over multiple cores in the same server?
If you are using vectorization with good package, the matrix multiplication is already done in parallel for u. :D

2) Instead of using batch gradient descent, randomize your data and use stochastic gradient descent / mini-batch gradient descent

With m observations and each observation having output y and n features denoted by vector x of size n, Gradient descent is the process of tuning learning algorithm parameters vector, theta, bit by bit (iteration by iteration) to minimize the cost, J, which is the sum of squared differece between predicted output, h, and expected output, y.

Tuning theta towards the minimum cost require knowing how the cost changes as theta changes, another word, the partial derivative dJ /dTheta at current point.

For batch, this will require computing h-y for ALL observations for each descent (tuning iteration). Each iteration will result at a lower cost until minimum.
For stochastic, it'll only require computing h-y for ONE observation for each descent. Each iteration will NOT necessarily result at a lower cost. However, running several iterations over the entire set of observations should arrive AROUND cost minimum.
For mini-batch, it computes h-y for SEVERAL observations (mini-batch) for each descent. It takes advantage of (requires) vectorization for parallel processing to achieve an even shorter time. Perhaps batch size should be dictated by number of processor cores.

Start with mini-batch to get theta. Then feed the theta into batch?

Do u have too much (not enough) data?

It's never too much data. :D

But when R cannot allocate memory / the learning algorithm takes way too long... :(

PLOT THE LEARNING CURVES
ie
cost on training set as a function of training set size
Cost on cross validation/test set as a function of training set size

If the 2 curves already converge, your model is Not overfitting the data (not suffering from high variance );
It doesn't necessarily mean your model is good enough (that depends on the cost/error rate of the learning curves and can be addressed with more features, in the case of neural network, more nodes and hidden layers to build a better model). It only means adding more data to your current model won't help.

If the 2 curves don't converge, your model is overfitting the data (suffering from high variance ). Having more data should help.

So, if u think u have too much data (actually, it's your system / implementation that's sub par), randomly sample a subset and plot the learning curves to decide whether you should spend time on building a better model or on including more data on your current model.

Side note, if u need a powerful server, renting a cloud server (ex. Ec2 @Amazon) could be an option.

Thursday, May 15, 2014

How good is your classification model?

One way is accuracy, ie % of correctly classified observations
Downside:
Say, your test is to id whether a patient suffers a rare disease that is found in 1% of the population.
If your test is 99% accurate, is your test doing well?
Maybe, maybe not.
If my test just simply classify all patients as negative, I'll achieve 99% accuracy, but can hardly say my test is good.

To get around this, we can use precision and recall.

From wiki

Precision is the probability that a (randomly selected) retrieved document is relevant.

Recall is the probability that a (randomly selected) relevant document is retrieved in a search.

F1 score incorporates the number
= 2*prec*rec/(prec+rec)

Example
When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.

Wednesday, April 30, 2014

SVM vs logistic regression vs neural network

When to use which one?

First of all, logistic regression is very similar to svm with linear kernel and can be used interchangeably in practice .

Look at number of features (n) vs. Number of training example (m)
n >= m, u don't have many training example, so no point in using complex algorithm to overfit the the small amount of data. Use logit /linear svm
n is small (say under 1k), m is intermediate (~10k or even more if you have enough resources / don't mind waiting. ..)
Gaussian svm so we can fit complex boundary
m is huge. Add features on your own and then use logit / linear svm

How about neural network?
Works for all the scenario, just quite a bit slower

Wednesday, April 23, 2014

back propagation in neural network

Have a read here:
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

If you are wondering how we get those equations:

% WHAT DO WE WANT?
% WE WANT dJ/dTheta2 and dJ/dTheta1 for gradient descent
% ie. how much does the cost change as the theta (weights) change

% J     = -y* log(h) - (1-y)* log (1-h)
%     = (y-1)*log(1-h) - y*log(h)
%     = (y-1)*log(1-g) - y*log(g)
% where h = g = g(zL3) and zL3 = Theta2*aL2
%
% dJ/dTheta2    = (dJ/dzL3) * dzL3/dTheta2
%
%     dJ/dzL3    = (dJ/dg) * dg/dzL3
%         dJ/dg    = ((y-1)/(1-g))*(-1) - y/g
%                 = (1-y)/(1-g) - y/g
%         dg/dzL3    = g*(1-g)
%     dJ/dzL3    = g*(1-y) - y*(1-g)
%             = g- yg - y + yg
%             = g-y
%
%     dzL3/dTheta2    = aL2
%
% dJ/dTheta2    = (dJ/dzL3) * dzL3/dTheta2
%             = (g - y) * aL2

%
% dJ/dTheta1 is a bit more tricky
% dJ/dTheta1 = dJ/dzL2 * dzL2/dTheta1
%
    % 1st term
    % dJ/dzL2    = dJ/dzL3 * dzL3/dzL2
        % zL3    = Theta2 * aL2
        %        = Theta2 * g(zL2)
        % dzL3/dzL2    = dzL3/dg(zL2) * dg(zL2)/dzL2
        %            = Theta2 * g*(1-g)    where g = g(zL2)
    % dJ/dzL2    = dJ/dzL3 * dzL3/dzL2
    %             = dJ/dzL3 * Theta2 * g*(1-g)
    %             = [dJ/dzL3 * Theta2] * g'(zL2)
    % note that in [dJ/dzL3 * Theta2], dJ/dzL3 is the "error term" from next layer and we back propagate it by the means of Theta2 to get the weighted average
% dJ/dTheta1     = dJ/dzL2 * dzL2/dTheta1
%                 = [dJ/dzL3 * Theta2] * g'(zL2) * aL1

Sunday, April 6, 2014

Too many features /predictors?

Principal components analysis

If # of observations > features /predictors,
Can also use
Regression on all
pick only the features that are significant.
Or
Regularize the features by adding sum of square regression parameters to penalize big parameters