Binary classification
- Def: \(y\) can take on only two possibile values: \(0\) and \(1\)
- 0 = negative class
- 1 = positive class
- Label = the \(y^{(i)}\) that corresponds to a given sample of \(x^{(i)}\) features
Hypothesis representation
- Since the y-values are discrete, we know that our hypothesis must satisfy \(0\leq h(x)\leq 1\)
- This is accomplished by plugging \(h(x)=\theta^Tx\) into the logistic function, aka sigmoid function:
\[\begin{equation}
\boxed{
h_{sigmoid} = g(z) = \frac{1}{1+e^{-z}}
}
\end{equation}\]
where \(z = \theta^Tx\), the nonbounded hypothesis.
Andrew Ng | Coursera
- The sigmoid funtion maps any real number to the interval (0,1), as shown in the graph
- e.g.) \(h_{sigmoid}(x)=0.7 \rightarrow\) there is 70% probability that the output has label 1
\[\begin{equation}
P(y=0 | x;\theta) + P(y=1 | x;\theta) =1
\end{equation}\]
Decision boundary
- In order to get our discrete 0 or 1 classification, we can translate our sigmoid output as follows:
\[\begin{aligned}
h_{sigmoid}(x) \geq 0.5 \rightarrow y=1\\
h_{sigmoid}(x) < 0.5 \rightarrow y=0
\end{aligned}\]
- Logistic function \(g(z)\) behaves s.t. when \(x\geq 0\), \(g(z) \geq 0.5\)
\[\begin{aligned}
\theta^Tx \geq 0 \rightarrow y=1\\
\theta^Tx < 0 \rightarrow y=0
\end{aligned}\]
- Decision boundary = the line that separates the areas where \(y=0\) vs. \(y=1\)
- This boundary is created by our hypothesis function
Logistic regression
Cost function
- Logistic function is not a convex function \(\rightarrow\) multiple local optima \(\rightarrow\) cannot use linear regression \(\rightarrow\) new cost function for logistic regression:
\[\begin{equation}
J(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_{sigmoid}(x^{(i)}),y^{(i)})
\end{equation}\]
\[\begin{aligned}
\text{where}
\begin{cases}
Cost(h_{sigmoid}(x), y) &= -\log(h_{sigmoid}(x)) \text{ if } y=1\\
Cost(h_{sigmoid}(x), y) &= -\log(1-h_{sigmoid}(x)) \text{ if } y=0
\end{cases}
\end{aligned}\]
\[\begin{equation}
\downarrow \small\text{combining the two cases into one equation}\\
\end{equation}\]
\[\begin{aligned}
\boxed{
J(\theta)=-\frac{1}{m}\sum_{i=1}^m\left[y^{(i)}\log(h_{sigmoid}(x^{(i)})) +
(1-y^{(i)})\log(1-h_{sigmoid}(x^{(i)})) \right]
}
\end{aligned}\]
- Notice:
- When \(y=0\), the first term becomes zero
- When \(y=1\), the second term becomes zero
- Vectorized:
\[\begin{aligned}
h &= g(X\theta)\\[5pt]
J(\theta) &= \frac{1}{m}\left(-y^T\log(h_{sigmoid})-(1-y)^T\log(1-h_{sigmoid})\right)
\end{aligned}\]
- Plotting \(h_{sigmoid}(x)\) vs. \(J(\theta)\)
Andrew Ng | Coursera
- For \(y=1\), the cost decreases as \(h(x)\rightarrow 1\)
- For \(y=0\), the cost decreases as \(h(x)\rightarrow 0\)
Gradient descent
- Recall the general form of gradient descent:
\[\begin{aligned}
&\small\text{Repeat}\{\\
&\theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\
&\}
\end{aligned}\]
- If you take partials of the cost function above, you end up with the same formula from linear regression:
\[\begin{aligned}
&\small\text{Repeat until convergence} \normalsize\{\\
&\theta_j := \theta_j-\alpha\frac{1}{m}\sum_{i=1}^m\left[(h_{\theta}(x_i)-y_i)x_i\right]\\
&\}\\[10pt]
\end{aligned}\]
\[\begin{equation}
\downarrow \small\text{vectorized}\\[10pt]
\end{equation}\]
\[\begin{equation}
\theta := \theta-\frac{\alpha}{m}X^T\left[g(X\theta)-y\right]
\end{equation}\]
Multiclass classification
- Instead of having \(y=\{0,1\}\), we now have \(y=\{0,1,...,n\}\)
- So we divide our problem into n+1 binary classification problems
- (n+1 because \(y\) starts at 0)
- One-vs-all (aka one-vs-rest)
- Choose one class (i.e. one \(y\) label)
- Lump all other labels into a single second class
- Do this repeatedly, applying binary logistic regression to each case
- Use the hypothesis that returns the highest probability as our prediction
Andrew Ng | Coursera
- To summarize:
- To train your \(n\) different classifiers:
- Find \(h(x)\) for the label you have chosen
- Plug into the sigmoid function to transform it into a probability value
- Then, to predict which class the test data belongs to:
- Plug the test data into all trained logistic regression classifiers
- The classifier that results in the highest probability is the right class