By the numbers...: March 2014

Thursday, March 27, 2014

Unsupervised learning - PCA

We have a lot of features per observation and would like to know /visualize/summarize the features to explain most of the variability in the data by
Generating principal components which are the linear combination of the features that explain the most variability.

Friday, March 14, 2014

Matrix and vector review and application on linear regression

Matrix is just an array of numbers

Take a 2d array with 2 rows and 3 columns for example
It'll be a 2 by 3 matrix, or 2x3 matrix.
Row dimension then column dimension
(Matrix is Really Cool)

Vector is just a 1-column matrix

Matrix addition and subtraction

The matrices have to be of same dimensions.
Just add or subtract element by element and result in a matrix with same dimension

Matrix multiplication

Number of columns in 1st has to = number of rows in 2nd
Say 2x3 * 3x4 will result in a 2x4 matrix
With the element = sum of products between the rows of 1st matrix and columns of the 2nd

Application on linear regression

With Model: h(x) = t0 + t1 * x
Say now, we've a bunch of x's and would like to get the corresponding y's
One way to program it is to loop:

For i in 0 ... n-1
y[i] = t0 + t1* x[i]

Another way is to do it with matrix multiplication, which is more computation efficient:

| 1 x0 | * | t0 |
| 1 x1 | | t1 |

= | t0 + t1*x0 |

| t0 + t1*x1 |

i.e.: data matrix * parameter vector = prediction vector

What if u have multiple hypothesis ?

Then,

data matrix * parameter matrix = prediction matrix

Data matrix - each row corresponds to a set of data

Parameter matrix - each column corresponds to the set of parameters for a given hypothesis

Prediction matrix - each column corresponds to the set of predictions for a given hypothesis

I = identity matrix has a diagonal of 1s and 0 elsewhere

It's equivalent to 1 in real numbers; whatever * 1 = whatever

I * A = A * I = A

A^-1 = inverse matrix of A

A * A^-1 = I

Just like the inverse in real number, whatever * inverse of itself = 1

A^T = matrix transpose of A

Rows of A become columns of A^T

Friday, March 7, 2014

State-of-the-art prediction models

Random forest
1) simple decision tree - easy to interpret, but not that accurate in predictions
2) bagging
- generate many data set by bootstrapping the training data, grow a tree on each data set, new model = average of the outcome of the individual trees.
- idea is that with n observations, each having variance sigma squared, the variance of the mean of n observations will be sigma squared / n
- ie. We can reduce the model variance by averaging the results of many trees
3) Random forest
- the trees described above with bagging have high correlation and thus doesn't reduce the variance that much
- random forest reduces correlation among trees by only considering a random subset of predictors (typically square root of full set) at each split
- by reducing correlation among trees, we reduce more of our model variance
- wouldn't it hurt the tree accuracy by reducing predictors at each split? Not really, since all predictors will get a chance at some depth of the tree while at the same time, this arrangement gives a better chance for predictors dominated by other predictors.

Boosting
Sequential shrunken tree on the RESIDUAL
Build a tree with depth = 1, say
Shrink the tree by a factor of lambda (0.01 , say) and add the shrunken tree to the model
Calculate the residual and fit a tree on the new residual.
Repeat the above for the number of trees u want
Idea is that we learn and correct the residual slowly by shrinking each tree on the residuals

Wednesday, March 5, 2014

How to turn a quantitative variable into a qualitative one easily in R?

use indicator function I().
Example below makes wage>250 into 1 and 0 otherwise.

glm(I(wage>250) ~ poly(age,3), data=Wage, family=binomial)

Is one model "SIGNIFICANTLY" better than another? use ANOVA

If model A has lower MSE (mean squared error) than model B does, A is better, right?

not necessarily, since you have a sample, not population, as your training set, which means your models don't have a complete picture.

Now what do we do?
Use ANOVA (analysis of variance).

"When given a single argument it produces a table which tests whether the model terms are significant.

When given a sequence of objects, anova tests the models against one another in the order specified."

why (not) linear regression?

why?

simple
easy to interpret (more important than you can imagine, especially when you need to explain the model to someone)

why not?

relationship is never linear (well, almost never)

example, usually wage varies linearly with age, but it kinda flattens up to a certain age

try smooth spline, which would capture non-linearity.

Limit the degree of freedom to avoid excessive model variance (ie. over fitting )

smooth.spline(age,wage,df=16)

"we can use LOO cross-validation to select the smoothing parameter for us automatically"

smooth.spline(age,wage,cv=TRUE)

Monday, March 3, 2014

"What do we mean by the variance and bias of a statistical learning method?"

"Variance refers to the amount by which ˆ f would change if we
estimated it using a different training data set. Since the training data
are used to fit the statistical learning method, different training data sets
will result in a different ˆ f. But ideally the estimate for f should not vary
too much between training sets. However, if a method has high variance
then small changes in the training data can result in large changes in ˆ f. In
general, more flexible statistical methods have higher variance."

"On the other hand, bias refers to the error that is introduced by approximating
a real-life problem, which may be extremely complicated, by a much
simpler model. For example, linear regression assumes that there is a linear
relationship between Y and X1,X2, . . . , Xp. It is unlikely that any real-life
problem truly has such a simple linear relationship, and so performing linear
regression will undoubtedly result in some bias in the estimate of f."

From ISLR