L2 Ridge Regression from Lagrange Multipliers Take AI Test by Marcus Generalization 🔍 ## What is L2 Ridge Regression? Today, we'll be discussing the ubiquitous L2 Ridge Regression formula, where we have the ordinary least squares loss with a regularizer term that acts to minimize the parameter weights of the model using the Frobenius norm, i.e. $$ L(W, \lambda) = \underbrace{\|Wx - b\|^2}_{\text{Squared Error Loss}} + \lambda \underbrace{\|W\|^2_F}_{\text{L2 Regularization Term}}$$ - $L(W) \in \mathbb{R}$ is the scalar loss - $W \in \mathbb{R}^{m \times n}$ is our weight matrix - Assuming that this is a linear regression type model/one layer neural network - Idea typically extended to $\theta$ where $\theta$ is *all* of the parameters of the network (or some subset) - $b \in \mathbb{R}^m$ is our ground truth output for input $x$ - Often put as $y$, but wanted to explicitly connect to L2 Loss so used $b$ - $x \in \mathbb{R}^n$ is our input - $\lambda \in \mathbb{R}$ is a scalar that scales $||W||^2$ - How much weight do we give to the constraint to minimize the size of weights - Note $\lambda \in \mathbb{R}$ since we are taking the Frobenius norm of $W$, $||W||^2_F \in \mathbb{R}$ - $||W||^2_F \in \mathbb{R}$ is the Frobenius norm of $W$ - $||W||^2_F = vec(W)^T \cdot vec(W) \in \mathbb{R}$ - How large is the vectorized weight matrix $W$? For simplicity we show a single data point $x$, in practice we can extend this to datasets with $$ L(W, \lambda) = ||XW^T - B||^2 + \lambda ||W^T||^2_F$$ - $X \in \mathbb{R}^{d \times n}$, where we have $D$ data samples each of degree $n$ - From the $\|Wx - b\|^2$ case, $\forall x_i$, transpose and stack them - $W^T \in \mathbb{R}^{n \times m}$ which is the transpose of $W$ in the previous convention - $XW^T \in \mathbb{R}^{d \times m}$ producing $d$ vectors of size $\mathbb{R}^m$ - $B \in \mathbb{R}^{d \times m}$ stating our output is $d$ entries of size $\mathbb{R}^m$ Note that in statistics and data science the **row vector** convention is quite common, which is why we stated it this way here (although **I** prefer the **column vector** convention). ## Lagrange Multiplier Equivalent with Constant Constraint The Lagrange Multiplier method states that given we have - Optimization function $f: U \subset \mathbb{R}^n \rightarrow \mathbb{R}$ - Constraint function $g: U \subset \mathbb{R}^n \rightarrow \mathbb{R}$ - Input $x \in \mathbb{R}^n$ that is our input to $f,g$ - Constraint $c \in \mathbb{R}$ such that $g(x) = c$ - $g(x) = c$ defines a level set for $g$ such that all solutions must live on this level set - Lagrange Multiplier $\lambda \in \mathbb{R}$ is the factor by which $\nabla f(x_0) = \lambda \nabla g(x_0)$ for optimal $x_0$ If $f|S$ ($f$ restricted to set $S$) has a local extrema on $S$ at $x_0$, then there's a real number $\lambda$ such that the gradient of $f$ and $g$ are parallel, for $g$ defined as the level set at value $c$, i.e. $g(x) = c$. $$ \nabla f(x_0) = \lambda \nabla g(x_0) $$ The reason why the gradient of $f$ and $g$ must be parallel is because at a constrained optimum you cannot improve $f$ by moving along the constraint surface. The only direction left for $\nabla f$ is normal to the constraint surface $g(x) = c$. <img src="/media/images/20260421_235938_Screenshot_2026-04-21_at_4.57.21_PM.png" alt="Screenshot 2026 04 21 at 4.57.21 PM" style="max-width: min(650px, 80%); height: auto;"> <center> $\textbf{Figure 1}$ Lagrange Multiplier Condition showing $\nabla f(x_0) = \lambda g(x_0)$ at Minimum $f(x)$ on Constraint Surface $g(x) = k$. $f$ is a parabola centered at $x_{free}*$, which is the minimum point $g = ||x||^2 = k$ is the squared norm of $x$ equal to k, i.e. a circle with radius $\sqrt{k}$ centered at 0 </center> As we can see in **Figure 1** we have that if we are at the optimal point $x_0$ that the gradient of $f$ and $g$ are parallel since there is no component for which we can move along the constraint surface that will produce a smaller value. Note that for the suboptimal $x_1$ point chosen, we have that $\nabla f(x_1) \neq \lambda \nabla g(x_1)$, and as we can see by the green line, we can move in that direction along $g(x) = k$ for a smaller value on $f$. Using these conditions $\nabla f(x_0) = \lambda \nabla g(x_0)$ and $g(x) = c$, let's define function $h$ that combines these conditions. Mainly that $\nabla f(x_0) - \lambda \nabla g(x_0) = 0$ and that $g(x) = c$ $$ h(x, \lambda) = f(x) - \lambda (g(x) -c) $$ Note that we can also phrase it as $$ h(x, \lambda) = f(x) + \lambda (g(x) -c) $$ since $\lambda$ itself could be negative. We are going to stick with the latter convention. We can see that when we set $\nabla h(x) = 0$ that we find the extrema values of $f$ on constraint function $g(x) = c$ $$ \nabla_x h(x, \lambda) = \nabla_x f(x) + \lambda \nabla_x g(x) = 0 \rightarrow \nabla_x f(x) = \lambda \nabla_x g(x)$$ Note that since $\lambda$ is a parameter, that we must consider the equation $$g(x) = c$$ Since for $x \in \mathbb{R}^n$, $\nabla_x f = \lambda \nabla_x g$ gives us $n$ equations for $n$ parameters in $x$, however $h(x, \lambda)$ has $n+1$ free parameters, so we need $n+1$ equations to exactly solve for them all. ## Lagrange Multipliers $\rightarrow$ Lagrangian Now let's suppose that for our system of equations $$ h(x, \lambda) = f(x) + \lambda (g(x) -c) $$ Instead of setting the parameter $c$ and having $\lambda$ determined by it and the other parameters of $h$, we specify $\lambda$ and implicitly set $c$ from this. This is commonly done since it is much more convenient implementationally and pragmatically. Note that under convexity assumptions, the constrained and penalized formulations are equivalent up to a mapping between $c$ and $\lambda$. The objective of using the regularizer in the first place is to constrain the optimization of our loss function $h(x,\lambda)$, which in our L2 Ridge Regression loss is stated as $L(W,\lambda)$ such that the magnitude of the weights $W$ constrains the solution. Furthermore, we consider $\lambda$ to be a hyperparameter, that we the user set such that we can specify *how much* we care about the constraint $||W||^2_F$, with higher $\lambda$ having higher importance for this metric. Since now we have our function $h$ with $\lambda$ as a specified constant, $h$ has $n$ free parameters ($x \in \mathbb{R}^n$), we can restate the Lagrangian version of $h$ as follows: $$ h(x, \lambda) = f(x) + \lambda g(x) $$ Giving us that $$ \nabla_x h(x, \lambda) = \nabla_x f(x) + \lambda \nabla_x g(x) $$ For our L2 Ridge Regression Loss function, we care about the gradient produced by the loss $L(W,\lambda)$ since this is used to optimize the weights. Notice here the similarity of form to the above equation, where for $L(W, \lambda)$ we are optimizing $W$ with input $x$. $$ L(W, \lambda) = \|Wx - b\|^2 + \lambda \|W\|^2_F$$ <img src="/media/images/20260422_012240_Screenshot_2026-04-21_at_6.21.55_PM.png" alt="Screenshot 2026 04 21 at 6.21.55 PM" style="max-width: min(650px, 80%); height: auto;"> <center> $\textbf{Figure 2}$ Same functions $f,g$ as in $\textbf{Figure 1}$, however now $\lambda$ is a Specified hyperparameter, instead of a variable implicitly determined by constraint $k$. For different values of $\lambda$, level sets of $g(x) = ||x||^2$ with the optimal value specified by objective function $f(x) = ||x - x_{free}*||$, where $x_{free}*$ is the unconstrained optimal value. $\lambda = \infty$ corresponds to $h(x,\lambda) = g(x)$, $\lambda = 0$ corresponds to $h(x,\lambda) = f(x)$ as seen in diagram </center> As we can see in **Figure 2** with this formulation where $\lambda$ is specified as a hyperparameter, we can see that the optimal value of the function is reached when we are at the point closest to $x_{free}*$ (optimal unconstrained $f(x)$) with $\nabla_x f(x) = \nabla g(x)$. This directly relates to our Ridge Regression formulation, showing how the magnitude of $\lambda$ affects the strength of the regularizer. <img src="/media/images/20260422_014112_lagrange_ridge_regression_spin_thumbnail.gif" alt="lagrange ridge regression spin thumbnail" style="max-width: min(650px, 80%); height: auto;"> <center> $\textbf{Figure 3}$ Same functions as $\textbf{Figure 2}$ explicitly stating that it's L2 Ridge Regression, with the constraint boundary specified by constraint $k$ and not by $\lambda$. 3D Animation showing clearly what the Constrained Loss Landscape looks like </center> We can also in **Figure 3** (and the title picture) what the Loss Landscape looks like for L2 Ridge Regression, with constraint function $||w||^2 = k$. Note that here the constraint function is specified through $||w||^2 = k$ not through setting the value of $\lambda$. However, these are both equivalent statements of the same underlying equation. ## What is L2 Ridge Regression? Today, we'll be discussing the ubiquitous L2 Ridge Regression formula, where we have the ordinary least squares loss with a regularizer term that acts to minimize the parameter weights of the model using the Frobenius norm, i.e. $$ L(W, \lambda) = \underbrace{\|Wx - b\|^2}_{\text{Squared Error Loss}} + \lambda \underbrace{\|W\|^2_F}_{\text{L2 Regularization Term}}$$ - $L(W) \in \mathbb{R}$ is the scalar loss - $W \in \mathbb{R}^{m \times n}$ is our weight matrix - Assuming that this is a linear regression type model/one layer neural network - Idea typically extended to $\theta$ where $\theta$ is *all* of the parameters of the network (or some subset) - $b \in \mathbb{R}^m$ is our ground truth output for input $x$ - Often put as $y$, but wanted to explicitly connect to L2 Loss so used $b$ - $x \in \mathbb{R}^n$ is our input - $\lambda \in \mathbb{R}$ is a scalar that scales $||W||^2$ - How much weight do we give to the constraint to minimize the size of weights - Note $\lambda \in \mathbb{R}$ since we are taking the Frobenius norm of $W$, $||W||^2_F \in \mathbb{R}$ - $||W||^2_F \in \mathbb{R}$ is the Frobenius norm of $W$ - $||W||^2_F = vec(W)^T \cdot vec(W) \in \mathbb{R}$ - How large is the vectorized weight matrix $W$? For simplicity we show a single data point $x$, in practice we can extend this to datasets with $$ L(W, \lambda) = ||XW^T - B||^2 + \lambda ||W^T||^2_F$$ - $X \in \mathbb{R}^{d \times n}$, where we have $D$ data samples each of degree $n$ - From the $\|Wx - b\|^2$ case, $\forall x_i$, transpose and stack them - $W^T \in \mathbb{R}^{n \times m}$ which is the transpose of $W$ in the previous convention - $XW^T \in \mathbb{R}^{d \times m}$ producing $d$ vectors of size $\mathbb{R}^m$ - $B \in \mathbb{R}^{d \times m}$ stating our output is $d$ entries of size $\mathbb{R}^m$ Note that in statistics and data science the **row vector** convention is quite common, which is why we stated it this way here (although **I** prefer the **column vector** convention). ## Lagrange Multiplier Equivalent with Constant Constraint The Lagrange Multiplier method states that given we have - Optimization function $f: U \subset \mathbb{R}^n \rightarrow \mathbb{R}$ - Constraint function $g: U \subset \mathbb{R}^n \rightarrow \mathbb{R}$ - Input $x \in \mathbb{R}^n$ that is our input to $f,g$ - Constraint $c \in \mathbb{R}$ such that $g(x) = c$ - $g(x) = c$ defines a level set for $g$ such that all solutions must live on this level set - Lagrange Multiplier $\lambda \in \mathbb{R}$ is the factor by which $\nabla f(x_0) = \lambda \nabla g(x_0)$ for optimal $x_0$ If $f|S$ ($f$ restricted to set $S$) has a local extrema on $S$ at $x_0$, then there's a real number $\lambda$ such that the gradient of $f$ and $g$ are parallel, for $g$ defined as the level set at value $c$, i.e. $g(x) = c$. $$ \nabla f(x_0) = \lambda \nabla g(x_0) $$ The reason why the gradient of $f$ and $g$ must be parallel is because at a constrained optimum you cannot improve $f$ by moving along the constraint surface. The only direction left for $\nabla f$ is normal to the constraint surface $g(x) = c$. <img src="/media/images/20260421_235938_Screenshot_2026-04-21_at_4.57.21_PM.png" alt="Screenshot 2026 04 21 at 4.57.21 PM" style="max-width: min(650px, 80%); height: auto;"> <center> $\textbf{Figure 1}$ Lagrange Multiplier Condition showing $\nabla f(x_0) = \lambda g(x_0)$ at Minimum $f(x)$ on Constraint Surface $g(x) = k$. $f$ is a parabola centered at $x_{free}*$, which is the minimum point $g = ||x||^2 = k$ is the squared norm of $x$ equal to k, i.e. a circle with radius $\sqrt{k}$ centered at 0 </center> As we can see in **Figure 1** we have that if we are at the optimal point $x_0$ that the gradient of $f$ and $g$ are parallel since there is no component for which we can move along the constraint surface that will produce a smaller value. Note that for the suboptimal $x_1$ point chosen, we have that $\nabla f(x_1) \neq \lambda \nabla g(x_1)$, and as we can see by the green line, we can move in that direction along $g(x) = k$ for a smaller value on $f$. Using these conditions $\nabla f(x_0) = \lambda \nabla g(x_0)$ and $g(x) = c$, let's define function $h$ that combines these conditions. Mainly that $\nabla f(x_0) - \lambda \nabla g(x_0) = 0$ and that $g(x) = c$ $$ h(x, \lambda) = f(x) - \lambda (g(x) -c) $$ Note that we can also phrase it as $$ h(x, \lambda) = f(x) + \lambda (g(x) -c) $$ since $\lambda$ itself could be negative. We are going to stick with the latter convention. We can see that when we set $\nabla h(x) = 0$ that we find the extrema values of $f$ on constraint function $g(x) = c$ $$ \nabla_x h(x, \lambda) = \nabla_x f(x) + \lambda \nabla_x g(x) = 0 \rightarrow \nabla_x f(x) = \lambda \nabla_x g(x)$$ Note that since $\lambda$ is a parameter, that we must consider the equation $$g(x) = c$$ Since for $x \in \mathbb{R}^n$, $\nabla_x f = \lambda \nabla_x g$ gives us $n$ equations for $n$ parameters in $x$, however $h(x, \lambda)$ has $n+1$ free parameters, so we need $n+1$ equations to exactly solve for them all. ## Lagrange Multipliers $\rightarrow$ Lagrangian Now let's suppose that for our system of equations $$ h(x, \lambda) = f(x) + \lambda (g(x) -c) $$ Instead of setting the parameter $c$ and having $\lambda$ determined by it and the other parameters of $h$, we specify $\lambda$ and implicitly set $c$ from this. This is commonly done since it is much more convenient implementationally and pragmatically. Note that under convexity assumptions, the constrained and penalized formulations are equivalent up to a mapping between $c$ and $\lambda$. The objective of using the regularizer in the first place is to constrain the optimization of our loss function $h(x,\lambda)$, which in our L2 Ridge Regression loss is stated as $L(W,\lambda)$ such that the magnitude of the weights $W$ constrains the solution. Furthermore, we consider $\lambda$ to be a hyperparameter, that we the user set such that we can specify *how much* we care about the constraint $||W||^2_F$, with higher $\lambda$ having higher importance for this metric. Since now we have our function $h$ with $\lambda$ as a specified constant, $h$ has $n$ free parameters ($x \in \mathbb{R}^n$), we can restate the Lagrangian version of $h$ as follows: $$ h(x, \lambda) = f(x) + \lambda g(x) $$ Giving us that $$ \nabla_x h(x, \lambda) = \nabla_x f(x) + \lambda \nabla_x g(x) $$ For our L2 Ridge Regression Loss function, we care about the gradient produced by the loss $L(W,\lambda)$ since this is used to optimize the weights. Notice here the similarity of form to the above equation, where for $L(W, \lambda)$ we are optimizing $W$ with input $x$. $$ L(W, \lambda) = \|Wx - b\|^2 + \lambda \|W\|^2_F$$ <img src="/media/images/20260422_012240_Screenshot_2026-04-21_at_6.21.55_PM.png" alt="Screenshot 2026 04 21 at 6.21.55 PM" style="max-width: min(650px, 80%); height: auto;"> <center> $\textbf{Figure 2}$ Same functions $f,g$ as in $\textbf{Figure 1}$, however now $\lambda$ is a Specified hyperparameter, instead of a variable implicitly determined by constraint $k$. For different values of $\lambda$, level sets of $g(x) = ||x||^2$ with the optimal value specified by objective function $f(x) = ||x - x_{free}*||$, where $x_{free}*$ is the unconstrained optimal value. $\lambda = \infty$ corresponds to $h(x,\lambda) = g(x)$, $\lambda = 0$ corresponds to $h(x,\lambda) = f(x)$ as seen in diagram </center> As we can see in **Figure 2** with this formulation where $\lambda$ is specified as a hyperparameter, we can see that the optimal value of the function is reached when we are at the point closest to $x_{free}*$ (optimal unconstrained $f(x)$) with $\nabla_x f(x) = \nabla g(x)$. This directly relates to our Ridge Regression formulation, showing how the magnitude of $\lambda$ affects the strength of the regularizer. <img src="/media/images/20260422_014112_lagrange_ridge_regression_spin_thumbnail.gif" alt="lagrange ridge regression spin thumbnail" style="max-width: min(650px, 80%); height: auto;"> <center> $\textbf{Figure 3}$ Same functions as $\textbf{Figure 2}$ explicitly stating that it's L2 Ridge Regression, with the constraint boundary specified by constraint $k$ and not by $\lambda$. 3D Animation showing clearly what the Constrained Loss Landscape looks like </center> We can also in **Figure 3** (and the title picture) what the Loss Landscape looks like for L2 Ridge Regression, with constraint function $||w||^2 = k$. Note that here the constraint function is specified through $||w||^2 = k$ not through setting the value of $\lambda$. However, these are both equivalent statements of the same underlying equation. Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Training
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!