Tuesday, January 28, 2014

How precise are your regression estimators?

Short answer: standard error

In details:
Let's say we wanna estimate the population mean from sample data.
It makes sense to expect the sample mean to be very close to population mean.
Furthermore, if we keep getting more new sample sets, we would expect the error of our estimate to be smaller.
Taking it one step further, if the means from those sample data set don't differ much, we'd expect the error of estimate to be smaller and vice versa.

So, roughly speaking, standard error of the mean estimate is
* proportional to how much each sample  value deviates from sample mean (actually, square root of sum of square difference)
* inversely proportional to sample size (actually,  square root of n-2)

How about the standard error of slope estimate in linear regression?

roughly speaking, standard error of the slope estimate is
* proportional to how much each regression predicted value deviates from observed value  (actually, square root of sum of square difference)
* inversely proportional to sum of square difference of PREDICTOR.
Another words, wider range of your predictor, smaller your standard error .
It actually makes sense. Think about it. Slope = (y2-y1)/(x2-x1)
If y2 and/or y1 is off a little bit
* how much the slope is gonna change given x2 is far away from x1?
* how much the slope is gonna change given x2 is very close to x1?
So, if u are doing a controlled experiment, choose a wider range of predictors.

Closely related is t-statistics,  which is defined as
Estimated value/ std error
With a given t-statistic, we can look up a corresponding p-value, which is the probability of obtaining the data if the null hypothesis is true.
95% confidence interval is about estimated value +- 2*std error
If the interval contains 0, you cannot reject the null hypothesis.

Sunday, January 26, 2014

How to determine how good your regression model is?

It comes down to model variance - model bias trade off.

What's model variance?
Let's have our model fit the data as much as possible.
If we randomly pick a certain % of our data and fit as much as possible,  we have model 1.
We do this over and over again for n times and end up with n models.
The model variance is basically the variance of the predicted value (y) of these n models for a given x.
Typically,  if we use a less flexible model, say linear, the model variance is gonna be less.

What's model bias?
That's the difference between average predicted y from n models for a given x and the actual y for the same x.
Typically,  if we use a less flexible model, say linear, the model bias is gonna be bigger.

We'd like to arrive at a better model by trading off between model variance and model bias. We don't want our model be too flexible to overfit the data and introduce a huge model variance while we don't want our model be to rigid and have the predicted value be too far away from actual value.

How to tell if u are overfitting?
Use out of sample data.
Let's say u fit the training data and come up with a model with a very small mean sq error on the training data.
Have this model to predict the out of sample data and compare with the actual result.  If u have a large mean sq error on the out of sample data, you are probably overfitting your training data.

Wednesday, January 8, 2014

MySQL Stored Procedures

visit: http://dev.mysql.com/tech-resources/articles/mysql-storedprocedures.pdf

Friday, January 3, 2014

Linux VM + MySQL + R + DBI


  • Linux VM 
    • get it @c9.io
  • MySQL
    • type mysql on c9.io terminal and you'll have mysql server access
  • DBI
    • from R, install.packages('DBI')
  • RMySQL
    • wget 'http://cran.r-project.org/src/contrib/RMySQL_0.9-3.tar.gz'
    • R CMD INSTALL RMySQL_0.9-3.tar.gz 
Try it from R:
> library(DBI)
> library(RMySQL)
> con <- dbConnect(MySQL(), user="USER", dbname="DB", host="HOSTNAME")
> dbGetQuery(con, "select curdate() from dual;")
   curdate()
1 2014-01-03