By the numbers...: Learning algorithm taking too long?

1) map reduce
For batch gradient descent, each iteration requires computing the predicted value for each observation and calculating the difference from actual value.
Map reduce is basically splitting the observations over multiple servers so that the calculations can be done in parallel.
How about splitting over multiple cores in the same server?
If you are using vectorization with good package, the matrix multiplication is already done in parallel for u. :D

2) Instead of using batch gradient descent, randomize your data and use stochastic gradient descent / mini-batch gradient descent

With m observations and each observation having output y and n features denoted by vector x of size n, Gradient descent is the process of tuning learning algorithm parameters vector, theta, bit by bit (iteration by iteration) to minimize the cost, J, which is the sum of squared differece between predicted output, h, and expected output, y.

Tuning theta towards the minimum cost require knowing how the cost changes as theta changes, another word, the partial derivative dJ /dTheta at current point.

For batch, this will require computing h-y for ALL observations for each descent (tuning iteration). Each iteration will result at a lower cost until minimum.
For stochastic, it'll only require computing h-y for ONE observation for each descent. Each iteration will NOT necessarily result at a lower cost. However, running several iterations over the entire set of observations should arrive AROUND cost minimum.
For mini-batch, it computes h-y for SEVERAL observations (mini-batch) for each descent. It takes advantage of (requires) vectorization for parallel processing to achieve an even shorter time. Perhaps batch size should be dictated by number of processor cores.

Start with mini-batch to get theta. Then feed the theta into batch?

By the numbers...

Friday, May 16, 2014

Learning algorithm taking too long?

No comments:

Post a Comment