Have a read here:
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
If you are wondering how we get those equations:
% WHAT DO WE WANT?
% WE WANT dJ/dTheta2 and dJ/dTheta1 for gradient descent
% ie. how much does the cost change as the theta (weights) change
% J = -y* log(h) - (1-y)* log (1-h)
% = (y-1)*log(1-h) - y*log(h)
% = (y-1)*log(1-g) - y*log(g)
% where h = g = g(zL3) and zL3 = Theta2*aL2
%
% dJ/dTheta2 = (dJ/dzL3) * dzL3/dTheta2
%
% dJ/dzL3 = (dJ/dg) * dg/dzL3
% dJ/dg = ((y-1)/(1-g))*(-1) - y/g
% = (1-y)/(1-g) - y/g
% dg/dzL3 = g*(1-g)
% dJ/dzL3 = g*(1-y) - y*(1-g)
% = g- yg - y + yg
% = g-y
%
% dzL3/dTheta2 = aL2
%
% dJ/dTheta2 = (dJ/dzL3) * dzL3/dTheta2
% = (g - y) * aL2
%
% dJ/dTheta1 is a bit more tricky
% dJ/dTheta1 = dJ/dzL2 * dzL2/dTheta1
%
% 1st term
% dJ/dzL2 = dJ/dzL3 * dzL3/dzL2
% zL3 = Theta2 * aL2
% = Theta2 * g(zL2)
% dzL3/dzL2 = dzL3/dg(zL2) * dg(zL2)/dzL2
% = Theta2 * g*(1-g) where g = g(zL2)
% dJ/dzL2 = dJ/dzL3 * dzL3/dzL2
% = dJ/dzL3 * Theta2 * g*(1-g)
% = [dJ/dzL3 * Theta2] * g'(zL2)
% note that in [dJ/dzL3 * Theta2], dJ/dzL3 is the "error term" from next layer and we back propagate it by the means of Theta2 to get the weighted average
% dJ/dTheta1 = dJ/dzL2 * dzL2/dTheta1
% = [dJ/dzL3 * Theta2] * g'(zL2) * aL1
No comments:
Post a Comment