By the numbers...: April 2014

Wednesday, April 30, 2014

SVM vs logistic regression vs neural network

When to use which one?

First of all, logistic regression is very similar to svm with linear kernel and can be used interchangeably in practice .

Look at number of features (n) vs. Number of training example (m)
n >= m, u don't have many training example, so no point in using complex algorithm to overfit the the small amount of data. Use logit /linear svm
n is small (say under 1k), m is intermediate (~10k or even more if you have enough resources / don't mind waiting. ..)
Gaussian svm so we can fit complex boundary
m is huge. Add features on your own and then use logit / linear svm

How about neural network?
Works for all the scenario, just quite a bit slower

Wednesday, April 23, 2014

back propagation in neural network

Have a read here:
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

If you are wondering how we get those equations:

% WHAT DO WE WANT?
% WE WANT dJ/dTheta2 and dJ/dTheta1 for gradient descent
% ie. how much does the cost change as the theta (weights) change

% J     = -y* log(h) - (1-y)* log (1-h)
%     = (y-1)*log(1-h) - y*log(h)
%     = (y-1)*log(1-g) - y*log(g)
% where h = g = g(zL3) and zL3 = Theta2*aL2
%
% dJ/dTheta2    = (dJ/dzL3) * dzL3/dTheta2
%
%     dJ/dzL3    = (dJ/dg) * dg/dzL3
%         dJ/dg    = ((y-1)/(1-g))*(-1) - y/g
%                 = (1-y)/(1-g) - y/g
%         dg/dzL3    = g*(1-g)
%     dJ/dzL3    = g*(1-y) - y*(1-g)
%             = g- yg - y + yg
%             = g-y
%
%     dzL3/dTheta2    = aL2
%
% dJ/dTheta2    = (dJ/dzL3) * dzL3/dTheta2
%             = (g - y) * aL2

%
% dJ/dTheta1 is a bit more tricky
% dJ/dTheta1 = dJ/dzL2 * dzL2/dTheta1
%
    % 1st term
    % dJ/dzL2    = dJ/dzL3 * dzL3/dzL2
        % zL3    = Theta2 * aL2
        %        = Theta2 * g(zL2)
        % dzL3/dzL2    = dzL3/dg(zL2) * dg(zL2)/dzL2
        %            = Theta2 * g*(1-g)    where g = g(zL2)
    % dJ/dzL2    = dJ/dzL3 * dzL3/dzL2
    %             = dJ/dzL3 * Theta2 * g*(1-g)
    %             = [dJ/dzL3 * Theta2] * g'(zL2)
    % note that in [dJ/dzL3 * Theta2], dJ/dzL3 is the "error term" from next layer and we back propagate it by the means of Theta2 to get the weighted average
% dJ/dTheta1     = dJ/dzL2 * dzL2/dTheta1
%                 = [dJ/dzL3 * Theta2] * g'(zL2) * aL1

Sunday, April 6, 2014

Too many features /predictors?

Principal components analysis

If # of observations > features /predictors,
Can also use
Regression on all
pick only the features that are significant.
Or
Regularize the features by adding sum of square regression parameters to penalize big parameters