machine_learning Neural Network coursera ML

Forward propagation

  • Components of a neural network
    • Neurons = computational units
    • Dendrites = input features (x0,...,xn)
      • Bias unit: x0=1
    • Axons = output; the result of our hypothesis function
  • Visual diagram

    [x0x1x2x3][a1(2)a2(2)a3(2)]hθ(x)
    • Layer 1 (aka input layer) = input nodes
    • Layer 2 (aka hidden layer)
      • Let a sigmoid function g be the activation function
      • Node ai(j) = activation unit i in layer j
      • Matrix Θ(j) = weights for the mapping from layer j to j+1
    • Layer 3 (aka output layer) = hypothesis function
  • Hidden layers enable non-linear hypotheses

    a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3)a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3)a3(2)=g(Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3)hΘ(x)=a1(3)=g(Θ10(2)a0(2)+Θ11(2)a1(2)+Θ12(2)a2(2)+Θ13(2)a3(2))
    • Each layer gets its own matrix of weights, Θ(j)
    • If the NN has sj units in layer j and has sj+1 units in layer j+1, then the dimension of Θ(j)(sj+1,sj+1)
      • The +1 comes from accounting for the bias node, x0
      • Input nodes include the bias node vs. Output nodes do not

Vectorized implementation

  • Let zk(j) represent node k in layer j
zk(j)=Θk0(j1)x0+Θk1(j1)x1+Θkn(j1)xn vectorize nodes into a column z(j)=Θ(j1)a(j1)

         where a(1)=x=[x0x1...xn], and a0(j)=1 is added in as the bias unit

  • Then, the next layer’s activation nodes are computed
    • Include bias node for hidden layers
    • Do not include bias node for output layer
a(j)=g(z(j))
  • So the idea is to repeat z(j)a(j)z(j+1)a(j+1) until we reach a(final_layer)=hΘ(x)

    a(1)=xz(2)=Θ(1)a(1)a(2)=g(z(2)) (add a0(2))z(3)=Θ(2)a(2)a(3)=g(z(3)) (add a0(3))z(4)=Θ(3)a(3)a(4)=hΘ(x)=g(z(4))

Multiclass classification

  • To classify data into multiple classes, we want our hypothesis function to return a vector of values
    • Ex: We want to classify an image into one of four categories: pedestrian, car, motorcycle, or truck
    • Each vector represents a class:
    y(i){[1000],[0100],[0010],[0001]}
    • The process:
    [x0x1x2...xn][a0(2)a1(2)a2(2)...][a0(3)a1(3)a2(3)...]...[hΘ(x)1hΘ(x)2hΘ(x)3hΘ(x)4]

Cost function

  • Defining variables:
    • L = # of layers
    • sl = # of units in layer l, excluding the bias unit
    • K = # of output units (i.e. # of classes)
    • hΘ(x)k = the hypothesis that results in the kth output
  • Recall the cost function for logistic regression:

    J(θ)=1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]+λ2mj=1nθj2
  • The cost function for NN:

    J(θ)=1mi=1mk=1K[yk(i)log((hΘ(x(i)))k)+(1yk(i))log(1(hΘ(x(i)))k)]+λ2ml=1L1i=1slj=1sl+1(Θj,i(l))2
    • Remember: y here is a vector of 1s and 0s, denoting class
    • First term: take each output node calculate the cost for that node, summed up over all examples sum up cost for all nodes
    • Second term: square each weight, and add them all up

Backpropagation

  • Def: NN terminology for minimizing the cost function
    • Goal is to find the optimal set of weights need to compute their partials:
Θi,j(l)J(Θ)
  • The algorithm:
    1. Initialize all weights to zero. Also set Δij(l)=0 for all l,i,j.
    2. Let m denote the number of samples. For t=1:m
      • Set a(1):=x(t)
      • Perform forward propagation to compute a(l) for l=2,3,...,L
    3. Compute the error in each layer.
      • Last layer: δ(L)=a(L)y(t)
      • Previous layers: δ(L1),δ(L2),...,δ(2)
      • For Θ, be sure to exclude the first column (the bias column) since bias units do not hold any error (they hold no connection to the input layer.)
      δ(l)=[(Θ(l))Tδ(l+1)]g(z(l))=[(Θ(l))Tδ(l+1)]a(l)(1a(l))
    4. Update capital deltas, the accumulators for adding up partials.

      Δij(l):=Δij(l)+aj(l)δi(l+1) vectorized Δ(l):=Δ(l)+δ(l+1)(a(l))T
    5. Outside the for loop, use capital deltas to calculate partial derivatives.
      • Notice: we do not regularize the bias unit.
      Θi,j(l)J(Θ):={1m(Δi,j(l)+λΘi,j(l)) if j01mΔi,j(l) if j=0

Intuition

  • Calculating zkj in FP forward propagation Andrew Ng | Coursera
  • δj(l)=zj(l)cost(t) error for aj(l) (unit j, layer l)
    • i.e.) How much we would like to change NN weights in order to affect intermediate values of computation (z) so as to affect the final output h(x) and therefore affect the overall cost
    • i.e.) How much that node was “responsible” for any errors in output
      • Steeper the slope the more incorrect we are
  • Calculating δj(l)
    • Multiply the deltas that feed into that node from the right with their respective weights, and add. backward propagation Andrew Ng | Coursera
  • BP vs. gradient descent
    • BP computes the direction to go downhill, towards the minimum.
    • Gradient descent takes those small steps towards the minimum.

Random initialization

  • Problem of symmetric ways: Initializing all theta weights to zero (or any same number) causes all nodes to update to the same value
    • e.g.) Θij(l)=0i,j,la1(2)=a2(2)δ1(2)=\delat2(2)Θ01(1)J(Θ)=Θ02(1)J(Θ)a1(2)=a2(2) even after gradient descent and BP! That’s like saying we have only 1 feature in the NN
  • To break this symmetry, do random initialization
    • Initialize each Θij(l) to a random value in [ϵ,ϵ]
      • The epsilon here is unrelated to epsilon in gradient checking
    • Choose ϵ based on the # of units in the NN ϵ=(6)(sl+sl+1)

Training a NN

  • Pick a neural architecture. Choose how many hidden layers, how many units in each layer, etc. neural architecture Andrew Ng | Coursera
    • of input units = dimension of features

    • of output units = number of classes

    • of hidden layers = reasonable default is 1

    • of hidden units = more the better (but computationally expensive)

      • Usually more than the number of features (up to 3-4 times the number of input units)
      • If you choose >1 hidden layer, have the same # of hidden units in each layer
  1. Randomly initialize weights (Θ close to zero)
  2. Implement FP to get h(x(i)) for any (x(i)
  3. Implement code to compute cost function J(Θ)
  4. Implement BP to compute partial derivatives of J(Θ)
  5. Use gradient descent or advanced optimization method with BP to minimize J(Θ)
  6. Note: J(Θ) is non-convex. So in theory, gradient descent / advanced methods can get stuck finding local minima instead of the global minimum. But this is usually not a problem, even with local minima.