L2 Ridge Regression from Lagrange Multipliers

by Marcus Generalization

🔍

## What is L2 Ridge Regression?

Today, we'll be discussing the ubiquitous L2 Ridge Regression formula, where we have the ordinary least squares loss with a regularizer term that acts to minimize the parameter weights of the model using the Frobenius norm, i.e.

$$ L(W, \lambda) = \underbrace{\|Wx - b\|^2}_{\text{Squared Error Loss}} + \lambda \underbrace{\|W\|^2_F}_{\text{L2 Regularization Term}}$$
- $L(W) \in \mathbb{R}$ is the scalar loss
- $W \in \mathbb{R}^{m \times n}$ is our weight matrix 
    - Assuming that this is a linear regression type model/one layer neural network
    - Idea typically extended to $\theta$ where $\theta$ is *all* of the parameters of the network (or some subset)
- $b \in \mathbb{R}^m$ is our ground truth output for input $x$
    - Often put as $y$, but wanted to explicitly connect to L2 Loss so used $b$
- $x \in \mathbb{R}^n$ is our input 
- $\lambda \in \mathbb{R}$ is a scalar that scales $||W||^2$
    - How much weight do we give to the constraint to minimize the size of weights
    - Note $\lambda \in \mathbb{R}$ since we are taking the Frobenius norm of $W$, $||W||^2_F \in \mathbb{R}$
- $||W||^2_F \in \mathbb{R}$ is the Frobenius norm of $W$
    - $||W||^2_F = vec(W)^T \cdot vec(W) \in \mathbb{R}$
    - How large is the vectorized weight matrix $W$?

For simplicity we show a single data point $x$, in practice we can extend this to datasets with 
$$ L(W, \lambda) = ||XW^T - B||^2 + \lambda ||W^T||^2_F$$
- $X \in \mathbb{R}^{d \times n}$, where we have $D$ data samples each of degree $n$
    - From the $\|Wx - b\|^2$ case, $\forall x_i$, transpose and stack them 
- $W^T \in \mathbb{R}^{n \times m}$ which is the transpose of $W$ in the previous convention 
    - $XW^T \in \mathbb{R}^{d \times m}$ producing $d$ vectors of size $\mathbb{R}^m$
- $B \in \mathbb{R}^{d \times m}$ stating our output is $d$ entries of size $\mathbb{R}^m$

Note that in statistics and data science the **row vector** convention is quite common, which is why we stated it this way here (although **I** prefer the **column vector** convention).

## Lagrange Multiplier Equivalent with Constant Constraint

The Lagrange Multiplier method states that given we have 
- Optimization function $f: U \subset \mathbb{R}^n \rightarrow \mathbb{R}$
- Constraint function $g: U \subset \mathbb{R}^n \rightarrow \mathbb{R}$
- Input $x \in \mathbb{R}^n$ that is our input to $f,g$
- Constraint $c \in \mathbb{R}$ such that $g(x) = c$ 
    - $g(x) = c$ defines a level set for $g$ such that all solutions must live on this level set
- Lagrange Multiplier $\lambda \in \mathbb{R}$ is the factor by which $\nabla f(x_0) = \lambda \nabla g(x_0)$ for optimal $x_0$

If $f|S$ ($f$ restricted to set $S$) has a local extrema on $S$ at $x_0$, then there's a real number $\lambda$ such that the gradient of $f$ and $g$ are parallel, for $g$ defined as the level set at value $c$, i.e. $g(x) = c$. 
$$ \nabla f(x_0) = \lambda \nabla g(x_0) $$

The reason why the gradient of $f$ and $g$ must be parallel is because at a constrained optimum you cannot improve $f$ by moving along the constraint surface. The only direction left for $\nabla f$ is normal to the constraint surface $g(x) = c$.

<img src="/media/images/20260421_235938_Screenshot_2026-04-21_at_4.57.21_PM.png" alt="Screenshot 2026 04 21 at 4.57.21 PM" style="max-width: min(650px, 80%); height: auto;">
<center> 
$\textbf{Figure 1}$ Lagrange Multiplier Condition showing $\nabla f(x_0) = \lambda g(x_0)$ at Minimum $f(x)$ on Constraint Surface $g(x) = k$.

$f$ is a parabola centered at $x_{free}*$, which is the minimum point

$g = ||x||^2 = k$ is the squared norm of $x$ equal to k, i.e. a circle with radius $\sqrt{k}$ centered at 0
</center>

As we can see in **Figure 1** we have that if we are at the optimal point $x_0$ that the gradient of $f$ and $g$ are parallel since there is no component for which we can move along the constraint surface that will produce a smaller value. Note that for the suboptimal $x_1$ point chosen, we have that $\nabla f(x_1) \neq \lambda \nabla g(x_1)$, and as we can see by the green line, we can move in that direction along $g(x) = k$ for a smaller value on $f$.

Using these conditions $\nabla f(x_0) = \lambda \nabla g(x_0)$ and $g(x) = c$, let's define function $h$ that combines these conditions. Mainly that $\nabla f(x_0) - \lambda \nabla g(x_0) = 0$ and that $g(x) = c$  
$$ h(x, \lambda) = f(x) - \lambda (g(x) -c) $$

Note that we can also phrase it as 
$$ h(x, \lambda) = f(x) + \lambda (g(x) -c) $$
since $\lambda$ itself could be negative. We are going to stick with the latter convention.

We can see that when we set $\nabla h(x) = 0$ that we find the extrema values of $f$ on constraint function $g(x) = c$

$$ \nabla_x h(x, \lambda) = \nabla_x f(x) + \lambda \nabla_x g(x) = 0 \rightarrow \nabla_x f(x) = \lambda \nabla_x g(x)$$

Note that since $\lambda$ is a parameter, that we must consider the equation 
$$g(x) = c$$

Since for $x \in \mathbb{R}^n$, $\nabla_x f = \lambda \nabla_x g$ gives us $n$ equations for $n$ parameters in $x$, however $h(x, \lambda)$ has $n+1$ free parameters, so we need $n+1$ equations to exactly solve for them all.

## Lagrange Multipliers $\rightarrow$ Lagrangian

Now let's suppose that for our system of equations 
$$ h(x, \lambda) = f(x) + \lambda (g(x) -c) $$

Instead of setting the parameter $c$ and having $\lambda$ determined by it and the other parameters of $h$, we specify $\lambda$ and implicitly set $c$ from this. This is commonly done since it is much more convenient implementationally and pragmatically. Note that under convexity assumptions, the constrained and penalized formulations are equivalent up to a mapping between $c$ and $\lambda$.

The objective of using the regularizer in the first place is to constrain the optimization of our loss function $h(x,\lambda)$, which in our L2 Ridge Regression loss is stated as $L(W,\lambda)$ such that the magnitude of the weights $W$ constrains the solution. Furthermore, we consider $\lambda$ to be a hyperparameter, that we the user set such that we can specify *how much* we care about the constraint $||W||^2_F$, with higher $\lambda$ having higher importance for this metric.

Since now we have our function $h$ with $\lambda$ as a specified constant, $h$ has $n$ free parameters ($x \in \mathbb{R}^n$), we can restate the Lagrangian version of $h$ as follows:
$$ h(x, \lambda) = f(x) + \lambda g(x) $$

Giving us that 
$$ \nabla_x h(x, \lambda) = \nabla_x f(x) + \lambda \nabla_x g(x) $$

For our L2 Ridge Regression Loss function, we care about the gradient produced by the loss $L(W,\lambda)$ since this is used to optimize the weights. Notice here the similarity of form to the above equation, where for $L(W, \lambda)$ we are optimizing $W$ with input $x$.

$$ L(W, \lambda) = \|Wx - b\|^2 + \lambda \|W\|^2_F$$

<img src="/media/images/20260422_012240_Screenshot_2026-04-21_at_6.21.55_PM.png" alt="Screenshot 2026 04 21 at 6.21.55 PM" style="max-width: min(650px, 80%); height: auto;">
<center>
$\textbf{Figure 2}$ Same functions $f,g$ as in $\textbf{Figure 1}$, however now $\lambda$ is a Specified hyperparameter, instead of a variable implicitly determined by constraint $k$.

For different values of $\lambda$, level sets of $g(x) = ||x||^2$ with the optimal value specified by objective function $f(x) = ||x - x_{free}*||$, where $x_{free}*$ is the unconstrained optimal value.

$\lambda = \infty$ corresponds to $h(x,\lambda) = g(x)$, $\lambda = 0$ corresponds to $h(x,\lambda) = f(x)$ as seen in diagram
</center>

As we can see in **Figure 2** with this formulation where $\lambda$ is specified as a hyperparameter, we can see that the optimal value of the function is reached when we are at the point closest to $x_{free}*$ (optimal unconstrained $f(x)$) with $\nabla_x f(x) = \nabla g(x)$.

This directly relates to our Ridge Regression formulation, showing how the magnitude of $\lambda$ affects the strength of the regularizer.

<img src="/media/images/20260422_014112_lagrange_ridge_regression_spin_thumbnail.gif" alt="lagrange ridge regression spin thumbnail" style="max-width: min(650px, 80%); height: auto;">
<center>
$\textbf{Figure 3}$ Same functions as $\textbf{Figure 2}$ explicitly stating that it's L2 Ridge Regression, with the constraint boundary specified by constraint $k$ and not by $\lambda$.

3D Animation showing clearly what the Constrained Loss Landscape looks like 
</center>

We can also in **Figure 3** (and the title picture) what the Loss Landscape looks like for L2 Ridge Regression, with constraint function $||w||^2 = k$. Note that here the constraint function is specified through $||w||^2 = k$ not through setting the value of $\lambda$. However, these are both equivalent statements of the same underlying equation.