Overfitting the data
  - Adding more features \(\rightarrow\) danger of overfitting
Andrew Ng | Coursera
    
      - Underfitting (left) = model does not capture the structure of the data
        
          - Leads to high bias
 
          - Caused by a function that is too simple or uses too few features
 
        
       
      - Overfitting (right) = model fits the data but does not generalize to predict new data
        
          - Leads to high variance
 
          - Caused by a complicated function that creates many unnecessary curves
 
        
       
    
   
  - Two ways to address overfitting:
    
      - Reduce the # of features
        
          - Manually select which features to keep
 
        
       
      - Regularization
        
          - Keep all the features but reduce the magnitude of parameters \(\theta_j\)
 
          - Regularization works well when we have a lot of slightly useful features
 
        
       
    
   
Regularization
  - When we have overfitting, we can reduce the magnitude of our parameters by increasing the cost of those parameters
    
      - Ex: We want to make the following function more quadratic:
 
    
   
\[\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3+\theta_4x^4\]
  - To reduce the influence of \(\theta_3x^3\) and \(\theta_4x^4\), modify the cost function:
 
\[\min_{\theta} \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2 + 1000\cdot \theta_3^2 + 1000\cdot \theta_4^2\]
Linear regression
  - Cost function with regularization:
    
      - Notice: the intercept term (\(j=0\)) is not penalized
 
    
   
\[\begin{equation}
  \boxed{
    \min_{\theta}\frac{1}{2m}\sum_{i=1}^m\left[h(x^{(i)})-y^{(i)}\right]^2+\lambda\sum_{j=1}^n\theta_j^2
  }
\end{equation}\]
  - Gradient descent
    
      - Remember to not penalize \(\theta_0\), the intercept term. Regularize the rest of the parameters.
 
    
   
\[\begin{equation}
  \boxed{
    \theta_j := \theta_j\left(1-\alpha\frac{\lambda}{m}\right)-\alpha\frac{1}{m}\sum_{i=1}^m\left[h(x^{(i)})-y^{(i)}\right]x_j^{(i)}
  }
\end{equation}\]
        
        
where \(\lambda=0\) for \(j=0\)
  - Normal equation
Andrew Ng | Coursera
    
      - The zero in the corner excludes \(x_0\)
 
      - dim(L) = (n+1) by (n+1)
 
      - Regularization solves the issue of non-invertibility
 
    
   
Logistic regression
  - Cost function with regularization:
    
      - Again, the bias term is excluded from being penalized
 
    
   
\[\begin{equation}
  \boxed{
    J(\theta)=-\frac{1}{m}\sum_{i=1}^m\left[y^{(i)}\log(h_{sigmoid}(x^{(i)})) +
    (1-y^{(i)})\log(1-h_{sigmoid}(x^{(i)})) \right]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2
  }
\end{equation}\]
  - Gradient descent
    
      - The same as the linear regression case