What is the difference between L1 and L2 regularization
L1 and L2 regularization are techniques used to prevent overfitting in machine learning models by adding a penalty term to the loss function. This penalty term discourages the model from fitting too closely to the training data, which can help it generalize better to new data. The key difference between L1 and L2 regularization lies in the form of the penalty term they add to the loss function.
L1 Regularization (Lasso Regularization)
-
Penalty Term: The penalty term added to the loss function is the sum of the absolute values of the model parameters (weights). Mathematically, it can be expressed as:
L1 penalty=λ∑i?wi?\text{L1 penalty} = \lambda \sum_{i} |w_i|L1 penalty=λi∑??wi??where λ\lambdaλ is a hyperparameter that controls the strength of the regularization, and wiw_iwi? are the model parameters.
-
Effect on Weights: L1 regularization tends to drive some weights to exactly zero, effectively performing feature selection. This means that L1 regularization can produce sparse models where some features are entirely ignored.
-
Use Cases: L1 regularization is useful when you have a high-dimensional dataset with many features, and you suspect that only a small subset of these features is relevant.
L2 Regularization (Ridge Regularization)
-
Penalty Term: The penalty term added to the loss function is the sum of the squares of the model parameters (weights). Mathematically, it can be expressed as:
L2 penalty=λ∑iwi2\text{L2 penalty} = \lambda \sum_{i} w_i^2L2 penalty=λi∑?wi2?where λ\lambdaλ is a hyperparameter that controls the strength of the regularization, and wiw_iwi? are the model parameters.
-
Effect on Weights: L2 regularization tends to shrink the weights but does not drive them to exactly zero. This results in models where all features are considered, but their impact is reduced.
-
Use Cases: L2 regularization is useful when you believe all features might be relevant, but you want to prevent any single feature from having an excessive influence on the model.
Comparison Summary
- Penalty Type: L1 uses absolute values of weights, while L2 uses squared values of weights.
- Effect on Weights: L1 can produce sparse models with some weights exactly zero, while L2 produces dense models with small weights.
- Feature Selection: L1 performs feature selection by driving some weights to zero. L2 does not perform feature selection but rather regularizes all weights to prevent overfitting.
- Optimization: L1 regularization leads to a non-differentiable point at zero, which can make optimization more challenging. L2 regularization, being smooth, is easier to optimize.
Combined Regularization
In practice, L1 and L2 regularization can be combined into a technique known as Elastic Net regularization, which includes both penalty terms:
Elastic Net penalty=α(λ1∑i?wi?+λ2∑iwi2)\text{Elastic Net penalty} = \alpha \left( \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2 \right)Elastic Net penalty=α(λ1?i∑??wi??+λ2?i∑?wi2?)
Here, α\alphaα, λ1\lambda_1λ1?, and λ2\lambda_2λ2? are hyperparameters that control the strength of the L1 and L2 penalties. Elastic Net regularization leverages the benefits of both L1 and L2 regularization, providing a balance between feature selection and weight regularization.