We have a lot of features per observation and would like to know /visualize/summarize the features to explain most of the variability in the data by
Generating principal components which are the linear combination of the features that explain the most variability.
Thursday, March 27, 2014
Unsupervised learning - PCA
Friday, March 14, 2014
Matrix and vector review and application on linear regression
- Take a 2d array with 2 rows and 3 columns for example
- It'll be a 2 by 3 matrix, or 2x3 matrix.
- Row dimension then column dimension
- (Matrix is Really Cool)
- The matrices have to be of same dimensions.
- Just add or subtract element by element and result in a matrix with same dimension
- Number of columns in 1st has to = number of rows in 2nd
- Say 2x3 * 3x4 will result in a 2x4 matrix
- With the element = sum of products between the rows of 1st matrix and columns of the 2nd
- With Model: h(x) = t0 + t1 * x
- Say now, we've a bunch of x's and would like to get the corresponding y's
- One way to program it is to loop:
y[i] = t0 + t1* x[i]
- Another way is to do it with matrix multiplication, which is more computation efficient:
| 1 x1 | | t1 |
Friday, March 7, 2014
State-of-the-art prediction models
Random forest
1) simple decision tree - easy to interpret, but not that accurate in predictions
2) bagging
- generate many data set by bootstrapping the training data, grow a tree on each data set, new model = average of the outcome of the individual trees.
- idea is that with n observations, each having variance sigma squared, the variance of the mean of n observations will be sigma squared / n
- ie. We can reduce the model variance by averaging the results of many trees
3) Random forest
- the trees described above with bagging have high correlation and thus doesn't reduce the variance that much
- random forest reduces correlation among trees by only considering a random subset of predictors (typically square root of full set) at each split
- by reducing correlation among trees, we reduce more of our model variance
- wouldn't it hurt the tree accuracy by reducing predictors at each split? Not really, since all predictors will get a chance at some depth of the tree while at the same time, this arrangement gives a better chance for predictors dominated by other predictors.
Boosting
Sequential shrunken tree on the RESIDUAL
Build a tree with depth = 1, say
Shrink the tree by a factor of lambda (0.01 , say) and add the shrunken tree to the model
Calculate the residual and fit a tree on the new residual.
Repeat the above for the number of trees u want
Idea is that we learn and correct the residual slowly by shrinking each tree on the residuals
Wednesday, March 5, 2014
How to turn a quantitative variable into a qualitative one easily in R?
Example below makes wage>250 into 1 and 0 otherwise.
glm(I(wage>250) ~ poly(age,3), data=Wage, family=binomial)
Is one model "SIGNIFICANTLY" better than another? use ANOVA
not necessarily, since you have a sample, not population, as your training set, which means your models don't have a complete picture.
Now what do we do?
Use ANOVA (analysis of variance).
"When given a single argument it produces a table which tests whether the model terms are significant.
When given a sequence of objects, anova tests the models against one another in the order specified."
why (not) linear regression?
- simple
- easy to interpret (more important than you can imagine, especially when you need to explain the model to someone)
why not?
- relationship is never linear (well, almost never)
- example, usually wage varies linearly with age, but it kinda flattens up to a certain age
- try smooth spline, which would capture non-linearity.
- Limit the degree of freedom to avoid excessive model variance (ie. over fitting )
- smooth.spline(age,wage,df=16)
- "we can use LOO cross-validation to select the smoothing parameter for us automatically"
- smooth.spline(age,wage,cv=TRUE)
Monday, March 3, 2014
"What do we mean by the variance and bias of a statistical learning method?"
estimated it using a different training data set. Since the training data
are used to fit the statistical learning method, different training data sets
will result in a different ˆ f. But ideally the estimate for f should not vary
too much between training sets. However, if a method has high variance
then small changes in the training data can result in large changes in ˆ f. In
general, more flexible statistical methods have higher variance."
"On the other hand, bias refers to the error that is introduced by approximating
a real-life problem, which may be extremely complicated, by a much
simpler model. For example, linear regression assumes that there is a linear
relationship between Y and X1,X2, . . . , Xp. It is unlikely that any real-life
problem truly has such a simple linear relationship, and so performing linear
regression will undoubtedly result in some bias in the estimate of f."
From ISLR