Personal website

Binary classification

Def: \(y\) can take on only two possibile values: \(0\) and \(1\)
- 0 = negative class
- 1 = positive class
Label = the \(y^{(i)}\) that corresponds to a given sample of \(x^{(i)}\) features

Hypothesis representation

Since the y-values are discrete, we know that our hypothesis must satisfy \(0\leq h(x)\leq 1\)
- This is accomplished by plugging \(h(x)=\theta^Tx\) into the logistic function, aka sigmoid function:

\[\begin{equation} \boxed{ h_{sigmoid} = g(z) = \frac{1}{1+e^{-z}} } \end{equation}\]

where \(z = \theta^Tx\), the nonbounded hypothesis.

sigmoid function Andrew Ng | Coursera

The sigmoid funtion maps any real number to the interval (0,1), as shown in the graph
- e.g.) \(h_{sigmoid}(x)=0.7 \rightarrow\) there is 70% probability that the output has label 1

\[\begin{equation} P(y=0 | x;\theta) + P(y=1 | x;\theta) =1 \end{equation}\]

Decision boundary

In order to get our discrete 0 or 1 classification, we can translate our sigmoid output as follows:

\[\begin{aligned} h_{sigmoid}(x) \geq 0.5 \rightarrow y=1\\ h_{sigmoid}(x) < 0.5 \rightarrow y=0 \end{aligned}\]

Logistic function \(g(z)\) behaves s.t. when \(x\geq 0\), \(g(z) \geq 0.5\)
- Thus, we can conclude:

\[\begin{aligned} \theta^Tx \geq 0 \rightarrow y=1\\ \theta^Tx < 0 \rightarrow y=0 \end{aligned}\]

Decision boundary = the line that separates the areas where \(y=0\) vs. \(y=1\)
- This boundary is created by our hypothesis function

Logistic regression

Cost function

Logistic function is not a convex function \(\rightarrow\) multiple local optima \(\rightarrow\) cannot use linear regression \(\rightarrow\) new cost function for logistic regression:

\[\begin{equation} J(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_{sigmoid}(x^{(i)}),y^{(i)}) \end{equation}\] \[\begin{aligned} \text{where} \begin{cases} Cost(h_{sigmoid}(x), y) &= -\log(h_{sigmoid}(x)) \text{ if } y=1\\ Cost(h_{sigmoid}(x), y) &= -\log(1-h_{sigmoid}(x)) \text{ if } y=0 \end{cases} \end{aligned}\] \[\begin{equation} \downarrow \small\text{combining the two cases into one equation}\\ \end{equation}\] \[\begin{aligned} \boxed{ J(\theta)=-\frac{1}{m}\sum_{i=1}^m\left[y^{(i)}\log(h_{sigmoid}(x^{(i)})) + (1-y^{(i)})\log(1-h_{sigmoid}(x^{(i)})) \right] } \end{aligned}\]

Notice:
- When \(y=0\), the first term becomes zero
- When \(y=1\), the second term becomes zero
Vectorized:

\[\begin{aligned} h &= g(X\theta)\\[5pt] J(\theta) &= \frac{1}{m}\left(-y^T\log(h_{sigmoid})-(1-y)^T\log(1-h_{sigmoid})\right) \end{aligned}\]

Plotting \(h_{sigmoid}(x)\) vs. \(J(\theta)\) Andrew Ng | Coursera
- For \(y=1\), the cost decreases as \(h(x)\rightarrow 1\)
- For \(y=0\), the cost decreases as \(h(x)\rightarrow 0\)

Gradient descent

Recall the general form of gradient descent:

\[\begin{aligned} &\small\text{Repeat}\{\\ &\theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ &\} \end{aligned}\]

If you take partials of the cost function above, you end up with the same formula from linear regression:

\[\begin{aligned} &\small\text{Repeat until convergence} \normalsize\{\\ &\theta_j := \theta_j-\alpha\frac{1}{m}\sum_{i=1}^m\left[(h_{\theta}(x_i)-y_i)x_i\right]\\ &\}\\[10pt] \end{aligned}\] \[\begin{equation} \downarrow \small\text{vectorized}\\[10pt] \end{equation}\] \[\begin{equation} \theta := \theta-\frac{\alpha}{m}X^T\left[g(X\theta)-y\right] \end{equation}\]

Multiclass classification

Instead of having \(y=\{0,1\}\), we now have \(y=\{0,1,...,n\}\)
- So we divide our problem into n+1 binary classification problems
- (n+1 because \(y\) starts at 0)
One-vs-all (aka one-vs-rest)
1. Choose one class (i.e. one \(y\) label)
2. Lump all other labels into a single second class
3. Do this repeatedly, applying binary logistic regression to each case
4. Use the hypothesis that returns the highest probability as our prediction

Andrew Ng | Coursera

To summarize:
- To train your \(n\) different classifiers:
  1. Find \(h(x)\) for the label you have chosen
  2. Plug into the sigmoid function to transform it into a probability value
- Then, to predict which class the test data belongs to:
  1. Plug the test data into all trained logistic regression classifiers
  2. The classifier that results in the highest probability is the right class