Wednesday, April 30, 2014

SVM vs logistic regression vs neural network

When to use which one?

First of all, logistic regression is very similar to svm with linear kernel and can be used interchangeably in practice .

Look at number of features (n) vs. Number of training example (m)
n >= m, u don't have many training example,  so no point in using complex algorithm to overfit the the small amount of data. Use logit /linear svm
n is small (say under 1k), m is intermediate (~10k or even more if you have enough resources / don't mind waiting. ..)
Gaussian svm so we can fit complex boundary
m is huge. Add features on your own and then use logit / linear svm

How about neural network?
Works for all the scenario,  just quite a bit slower

Wednesday, April 23, 2014

back propagation in neural network

Have a read here:
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

If you are wondering how we get those equations:

% WHAT DO WE WANT?
% WE WANT dJ/dTheta2 and dJ/dTheta1 for gradient descent
% ie.  how much does the cost change as the theta (weights) change

% J     = -y* log(h) - (1-y)* log (1-h)
%     = (y-1)*log(1-h) - y*log(h)
%     = (y-1)*log(1-g) - y*log(g)
% where h = g = g(zL3) and zL3 = Theta2*aL2
%
% dJ/dTheta2    = (dJ/dzL3) * dzL3/dTheta2
%
%     dJ/dzL3    = (dJ/dg) * dg/dzL3
%         dJ/dg    = ((y-1)/(1-g))*(-1) - y/g
%                 = (1-y)/(1-g) - y/g
%         dg/dzL3    = g*(1-g)
%     dJ/dzL3    = g*(1-y) - y*(1-g)
%             = g- yg - y + yg
%             = g-y
%            
%     dzL3/dTheta2    = aL2
%
% dJ/dTheta2    = (dJ/dzL3) * dzL3/dTheta2
%             = (g - y) * aL2


%
% dJ/dTheta1 is a bit more tricky
% dJ/dTheta1 = dJ/dzL2 * dzL2/dTheta1
%
    % 1st term
    % dJ/dzL2    = dJ/dzL3 * dzL3/dzL2
        % zL3    = Theta2 * aL2
        %        = Theta2 * g(zL2)
        % dzL3/dzL2    = dzL3/dg(zL2) * dg(zL2)/dzL2
        %            = Theta2 * g*(1-g)    where g = g(zL2)
    % dJ/dzL2    = dJ/dzL3 * dzL3/dzL2
    %             = dJ/dzL3 * Theta2 * g*(1-g)
    %             = [dJ/dzL3 * Theta2] * g'(zL2)
    % note that in [dJ/dzL3 * Theta2], dJ/dzL3 is the "error term" from next layer and we back propagate it by the means of Theta2 to get the weighted average
% dJ/dTheta1     = dJ/dzL2 * dzL2/dTheta1
%                 = [dJ/dzL3 * Theta2] * g'(zL2) * aL1

Sunday, April 6, 2014

Too many features /predictors?

Principal components analysis

If # of observations > features /predictors,
Can also use
Regression on all
pick only the features that are significant.
Or
Regularize the features by adding sum of square regression parameters to penalize big parameters

Thursday, March 27, 2014

Unsupervised learning - PCA

We have a lot of features per observation and would like to know /visualize/summarize the features to explain most of the variability in the data by
Generating principal components which are the linear combination of the features that explain the most variability.

Friday, March 14, 2014

Matrix and vector review and application on linear regression

Matrix is just an array of numbers
  • Take a 2d array with 2 rows and 3 columns for example 
  • It'll be a 2 by 3 matrix, or 2x3 matrix. 
  • Row dimension then column dimension 
  • (Matrix is Really Cool)

Vector is just a 1-column matrix

Matrix addition and subtraction
  • The matrices have to be of same dimensions. 
  • Just add or subtract element by element and result in a matrix with same dimension

Matrix multiplication
  • Number of columns in 1st has to = number of rows in 2nd
  • Say 2x3 * 3x4 will result in a 2x4 matrix 
  • With the element = sum of products between the rows of 1st matrix and columns of the 2nd

Application on linear regression
  • With Model:  h(x) = t0 + t1 * x
  • Say now, we've a bunch of x's and would like to get the corresponding y's
  • One way to program it is to loop:

For i in 0 ... n-1
y[i] = t0 + t1* x[i]
  • Another way is to do it with matrix multiplication, which is more computation efficient: 
| 1 x0 |  *  | t0 |
| 1 x1 |     | t1 |


= | t0 + t1*x0 |  
  | t0 + t1*x1 |

i.e.: data matrix * parameter vector = prediction vector

What if u have multiple hypothesis ?
Then,
data matrix * parameter matrix  = prediction matrix 

Data matrix - each row corresponds to a set of data
Parameter matrix - each column corresponds to the set of parameters for a given hypothesis 
Prediction matrix - each column corresponds to the set of predictions for a given hypothesis 

I = identity matrix has a diagonal of 1s and 0 elsewhere 
It's equivalent to 1 in real numbers; whatever * 1 = whatever 
I * A = A * I = A

A^-1 = inverse matrix of A
A * A^-1 = I 
Just like the inverse in real number, whatever * inverse of itself = 1

A^T = matrix transpose of A
Rows of A become columns of A^T

Friday, March 7, 2014

State-of-the-art prediction models

Random forest
1) simple decision tree - easy to interpret, but not that accurate in predictions
2) bagging
- generate many data set by bootstrapping the training data, grow a tree on each data set, new model = average of the outcome of the individual trees.
- idea is that with n observations, each having variance sigma squared, the variance of the mean of n observations will be sigma squared / n
- ie. We can reduce the model variance by averaging the results of many trees
3) Random forest
- the trees described above with bagging have high correlation and thus doesn't reduce the variance that much
- random forest reduces correlation among trees by only considering a random subset of predictors  (typically square root of full set) at each split 
- by reducing correlation among trees, we reduce more of our model variance
- wouldn't it hurt the tree accuracy by reducing predictors at each split? Not really,  since all predictors will get a chance at some depth of the tree while at the same time, this arrangement gives a better chance for predictors dominated by other predictors.

Boosting
Sequential shrunken tree on the RESIDUAL
Build a tree with depth = 1, say
Shrink the tree by a factor of lambda (0.01 , say) and add the shrunken tree to the model
Calculate the residual and fit a tree on the new residual.
Repeat the above for the number of trees u want
Idea is that we learn and correct the residual slowly by shrinking each tree on the residuals

Wednesday, March 5, 2014

How to turn a quantitative variable into a qualitative one easily in R?

use indicator function I().
Example below makes wage>250 into 1 and 0 otherwise.

glm(I(wage>250) ~ poly(age,3), data=Wage, family=binomial)