In a previous post, we demonstrated that using the usual least-squares cost function can fail horribly in the presence of heavy-tailed noise. We also demonstrated that using the $\ell^1$ cost (or a smoothed out version thereof) leads to much better results.

Intuitively, it is clear that $\ell^1$ is less sensitive to outliers than squared $\ell^2$. But why do we need linear growth of cost? Why is it not enough to have growth at, say, a power of $\frac{3}{2}$? In this post, we attempt to go into this in a little more depth.

First we’ll state the general problem, then review the analysis for least-squares regression, and finally look into the $\ell^1$ case.

Setup

Suppose we have an $n \times m$ matrix $X$ of training data (where $n$ is the number of samples and $m$ is the number of features), and a target vector of $n$ observations $y$. We believe $y$ obeys

where $\beta^\ast$ is an $m$-dimensional vector and $\xi$ is noise.

The goal is to find the ground truth $\beta^\ast$.

In this post, we will analyze the following ansatz: following ansatz:

where $F : \mathbb{R}_+ \to \mathbb{R}$ is everywhere differentiable and strictly monotonically increasing: $F’ > 0$. (For least squares regression, we simply have $F(z) = z$. For $\ell^1$ minimization, $F(z) = \sqrt{z}$.) Setting the gradient of the above expression to zero and noting that

we arrive after a little rearranging at the equation

The squared $\ell^2$ cost function

Let us first consider least-squares regression, i.e. the case $F(z) = z$. Then $F’(z) = 1$ and equation $\eqref{key-eq}$ simplifies to:

To ease analysis of this equation, we first note that, since $X^T X$ is positive-definite, we can use an orthogonal change of basis in the feature space $\mathbb{R}^m$ to diagonalize it.

More explicitly, $X^T X$ has a singular-value decomposition of the form

for some diagonal matrix $D$ and some orthogonal $m \times m$ matrix $\mathcal{O}$. Thus, replacing $X$ by $X \mathcal{O}$, we have

for some real values $\lambda_1, \dotsc, \lambda_m$, and so $\eqref{key-eq}$ becomes simply

Assuming $\lambda_k^2 > 0$ for all $k$, we can further simplify. Note that

So if we $\ell^2$-normalize the $m$ feature vectors $X_{\cdot k}$, we have the simple relationship

Just to be clear: This simple form is valid only after the following two preprocessing steps we described above (and these steps do not commute!):

  1. Rotate feature space to make $X^T X$ diagonal.
  2. $\ell^2$-normalize the columns of $X$.

In Python, these steps look something like this:

import numpy as np
from scipy import linalg

def normalize(X):
    # rotate feature space to diagonalize $X^T X$
    O, D, OT = linalg.svd(np.dot(X.T, X))
    X = np.dot(X, O)

    # $\ell^2$ normalize columns of $X$
    X = X / np.sqrt(np.sum(X * X, axis=0))

    return X

Equation $\eqref{mielc}$ relates the noise $\xi$ to how far our recovered $\beta$ is from the ground truth $\beta^\ast$. Note that if the components of $\xi$ are mean-zero random variables, then taking the expectation of this equation gives $\beta = \beta^\ast$, so the minimizer $\beta$ is an unbiased estimator.

Assume $n \geq m$. Then if $X^T X = \mathrm{Id}$, the matrix $X X^T$ is simply the projection onto the linear space spanned by the columns. Thus, we have

with equality iff $\xi$ is in the column space of $X$.

This tells us how to adversarially pick $\xi$ so as to maximize the error in recovery of $\beta$. This worst case is always a possibility. However, probabilistically, we would typically expect the error to be smaller. For example, if the entries of $\xi$ are i.i.d. normal random variables with mean zero, then $\xi$ has no preferred direction. If the variance of the random variables is $\sigma^2$, then the expected value of the magnitude of the projection is just

and hence

This is one version of the familiar result that the error decreases like $\frac{1}{\sqrt{n}}$ as we collect more samples.

However, note that we assumed here finite variance: $\sigma^2 < \infty$. For heavy-tailed noise all bets are off. As we saw in a previous post, Cauchy noise indeed wreaks havoc with least-squares regression.

Aside 1: Regularization and well-posedness

From the formula

we have

where $(\cdot)^+$ denotes the pseudo-inverse of the matrix. (Theoretically, if $X^T X$ had strictly positive eigenvalues, we could just take the regular inverse of the matrix. Numerically, problems arise if the any of the $\lambda_k^2$’s is even just close to zero.)

One purpose of regularization is to avoid the problem of small eigenvalues. If we add a regularization term $C ||\beta||_{\ell^2}^2$ to our minimization problem, then equation $\eqref{jecio}$ instead becomes

The eigenvalues of $X^T X + C\ \mathrm{Id}$ are going to be at least $C > 0$, and so we can comfortably take the inverse:

Mathematically, the problem is ill-posed if any of the eigenvalues of $X^T X$ vanishes. The pseudo-inverse gives us a unique inverse, but in principle many other solutions exist, too. Adding some regularization guarantees that all the relevant eigenvalues will be positive and hence that the problem will be well-posed.

Aside 2: Regularization and the bias-variance tradeoff

Let us again transform $X$ as we did for equation $\eqref{mielc}$ above and then add a regularization term $C ||\beta||_{\ell^2}^2$ to our minimization problem. Then equation $\eqref{mielc}$ becomes

Subtracting $C \beta^\ast$ from both sides and dividing by $1+C$, this says

Dividing by the factor $1 + C$ reduces the variance for $C > 0$, but at the expense of adding a bias.

The $\ell^1$ cost function

Intuitively, it is clear that a linearly growing cost function should be less sensitive to outliers than a quadratically growing one: outliers literally cost much more. But how exactly does this work out and why do we need linear growth as opposed to, say, just growth at a power of $\frac{3}{2}$? I’ll try to shed some light on this here.

First let us return briefly to the case of general $F$. Recall equation $\eqref{key-eq}$, which tells us that the minimizer $\beta$ satisfies

In the least-squares case, we simplified the analysis by transforming $X$ so as to make $X^T X = \mathrm{Id}$ hold. In the more general case, we note that the matrix

is still positive-definite due to our assumption that $F$ is increasing, i.e. $F’ > 0$. So in principle we could do similar transformations as before to turn $A$ into the identity matrix. Unfortunately, these transformations now depend on the residues

which makes the analysis less useful.

Instead, let us focus for now just on the right-hand side of equation $\eqref{key-eq-restated}$. In the case of the $\ell^1$ cost, we have $F(z) = 2 \sqrt{z}$ and hence $F’(z) = \frac{1}{\sqrt{z}}$. Thus, the right-hand side is

Here finally we can see why the $\ell^1$ cost function deals better with outliers: When $|\xi_j|$ is very large, we expect $|r_j| \sim |\xi_j|$, and hence

which is much smaller than the potential contribution $X_{jk} \xi_j$ we would have seen for the least-squares case. The cancellation is not exact, but in practice it is usually enough to make $\ell^1$ regression work in the presence of severe outliers.

Note that we just needed that $|F’(r^2)| \sim \frac{1}{|r|}$ for large $|r|$. This is true for smoothed-out versions of the $\ell^1$ cost function as well. All we require is that the cost function be asymptotically linear.

Finally, note that for cost functions that asymptotically grow as some power $p > 1$, we have $|F’(r^2)| \sim |r|^{p - 2}$, and hence for a large outlier $\xi_j$, we would have a term like

in the right-hand side of equation $\eqref{key-eq-restated}$. Compared to least squares, the influence of the outlier is certainly reduced for $1 < p < 2$, but only at $p = 1$ is its influence (approximately) bounded.

While this analysis is not completely rigorous, I hope it gives a better sense of just why the $\ell^1$ cost works so well for heavy-tailed noise and extreme outliers.