Personal website

Overfitting the data

Adding more features \(\rightarrow\) danger of overfitting Andrew Ng | Coursera
- Underfitting (left) = model does not capture the structure of the data
  - Leads to high bias
  - Caused by a function that is too simple or uses too few features
- Overfitting (right) = model fits the data but does not generalize to predict new data
  - Leads to high variance
  - Caused by a complicated function that creates many unnecessary curves
Two ways to address overfitting:
- Reduce the # of features
  - Manually select which features to keep
- Regularization
  - Keep all the features but reduce the magnitude of parameters \(\theta_j\)
  - Regularization works well when we have a lot of slightly useful features

Regularization

When we have overfitting, we can reduce the magnitude of our parameters by increasing the cost of those parameters
- Ex: We want to make the following function more quadratic:

\[\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3+\theta_4x^4\]

To reduce the influence of \(\theta_3x^3\) and \(\theta_4x^4\), modify the cost function:

\[\min_{\theta} \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2 + 1000\cdot \theta_3^2 + 1000\cdot \theta_4^2\]

Result: The function (pink) looks more quadratic but fits the data better than just a simple quadratic equation Andrew Ng | Coursera
Regularization parameter (\(\lambda\)) = determines how much to inflate the costs of the \(\theta\) paramters
- \(\lambda\) is too large \(\rightarrow\) smooths out the functions too much and cases underfitting
- \(\lambda\) is too small \(\rightarrow\) does not fix the overfitting problem
Note: the intercept term should not be penalized

Linear regression

Cost function with regularization:
- Notice: the intercept term (\(j=0\)) is not penalized

\[\begin{equation} \boxed{ \min_{\theta}\frac{1}{2m}\sum_{i=1}^m\left[h(x^{(i)})-y^{(i)}\right]^2+\lambda\sum_{j=1}^n\theta_j^2 } \end{equation}\]

Gradient descent
- Remember to not penalize \(\theta_0\), the intercept term. Regularize the rest of the parameters.

\[\begin{equation} \boxed{ \theta_j := \theta_j\left(1-\alpha\frac{\lambda}{m}\right)-\alpha\frac{1}{m}\sum_{i=1}^m\left[h(x^{(i)})-y^{(i)}\right]x_j^{(i)} } \end{equation}\]

where \(\lambda=0\) for \(j=0\)

Normal equation Andrew Ng | Coursera
- The zero in the corner excludes \(x_0\)
- dim(L) = (n+1) by (n+1)
- Regularization solves the issue of non-invertibility

Logistic regression

Cost function with regularization:
- Again, the bias term is excluded from being penalized

\[\begin{equation} \boxed{ J(\theta)=-\frac{1}{m}\sum_{i=1}^m\left[y^{(i)}\log(h_{sigmoid}(x^{(i)})) + (1-y^{(i)})\log(1-h_{sigmoid}(x^{(i)})) \right]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2 } \end{equation}\]

Gradient descent
- The same as the linear regression case

machine_learning Regularization coursera ML

Overfitting the data

Regularization

Linear regression

Logistic regression