Post

Measuring Space & Invariants (Part 2)

This series of posts cover different types of distances and their invariant properties. This includes:

Coordinate Frames Relative To Distributions

Now instead of the $xy$ plane transforming underneath points, let’s consider probability distributions moving beneath points. For example, let’s say we are modeling the spatial distribution of stars in a (2D) galaxy. We may want to model a galaxy where stars can stretch uniformly:

And another where stars on an ellipse can rotate:

We want a quantity that can measure distance with respect to the distribution of stars. That is:

  • If there are two points located on the opposite extremes of the galaxy, this should be a far distance.
  • As the galaxy expands outwards, forcing two points to the center of the galaxy, the distance should update and become smaller.

Euclidean distance will not meet this criteria because it will stay the same (independent) as the distribution transforms.

Mahalanobis Distance

We can update our notion of distance to something called Mahalanobis distance: \(\begin{equation} d_M(\vec x, \vec y; \mathcal D) = \sqrt{(\vec x - \vec y)^TS^{-1}(\vec x - \vec y)} \end{equation}\) Where $S$ is a positive-definite covariance matrix from distribution $\mathcal D$. We can see that as the distribution transforms, the mahalanobis distance updates to reflect the change:

Since $S^{-1}$ decomposes as $S^{-1} = W^TW$ (by the spectral theorem), then we can actually relate this back to euclidean distance: \(\begin{equation} d_M(\vec x, \vec y; \mathcal D) = ||W(\vec x - \vec y)|| \end{equation}\) Which reveals why mahalanobis distance works; it measures the euclidean distance between two points with respect to the geometry of the distribution (as measured by $W$).

Whitening Transformation

The whitening transformation generalizes Mahalanobis distance to a random vector $X$: \(\begin{equation} Y = W X \end{equation}\) If $X \sim \mathcal D(\vec 0, S)$, then $Y \sim \mathcal D(\vec 0, I)$ where $I$ is the identity matrix.

Starting with the definition of $Cov[Y]$:

\[\begin{flalign*} Cov[Y] &= Cov[WX]\\ &= W S W^T \end{flalign*}\]

We can show that $W^T W = S^{-1}$ is sufficient for $Cov[Y] = I$:

\[\begin{flalign*} W S W^T & = I \\ \iff W S W^T W &= W \\ \iff W^T W &= S^{-1} \end{flalign*}\]

Commonly, we will use $W = S^{-1/2}$ the matrix square root of $S$ as the whitening matrix which directly corresponds with the Mahalanobis distance. If we instead used the eigendecomposition of the positive-definite matrix $S = UDU^{T}$ and $W = D^{-1/2}U^{T}$ then the constraint is also satisfied:

\[\begin{flalign*} Cov[Y] &= W S W^T \\ &= (D^{-1/2}U^{T}) (UDU^T) (UD^{-1/2}) \\ &= D^{-1/2} (U^TU)D(U^TU)D^{-1/2} \\ &= D^{-1/2} D D^{-1/2} \\ &= I \end{flalign*}\]

Indeed, we can arbitrarily rotate any of these solutions by an orthogonal matrix:

\[\begin{flalign*} W^* &= QW \\ {W^*}^T W^* &= W^TQ^TQW \\ &= W^TW \\ &= S^{-1} \end{flalign*}\]

Or rotate the Mahalanobis distance by an arbitrary orthogonal matrix:

\[\begin{flalign*} ||W^*(\vec x - \vec y)|| &= (\vec x - \vec y)^TW^TQ^TQW(\vec x - \vec y) \\ &= (\vec x - \vec y)W^TW(\vec x - \vec y) \\ &= ||W(\vec x - \vec y)|| \end{flalign*}\]

And still achieved the desired result. See this paper to see more choices of $W$ and a longer discussion of this rotational freedom in whitening.

For the rest of this post, I will use the Mahalanobis whitening, $W = S^{-1/2}$.

Regardless of which $W$ we choose, we can sphere the entire space (meaning turn the space into a unit sphere) with respect to $W$ and then compute the euclidean distance:

After completing the whitening transformation, the resulting euclidean distance computations are invariant (and equal to $d_M$). Regardless of which distribution $\mathcal D$ we started from, $d_M$ is a scale-free measure of distance. This allows us to directly compare distances across distributions.

Linear Discriminant Analysis (LDA) and Gaussian Mixture Models (GMMs) use whitening to amortize repeated computations of expensive/high dimensional distance calculations with respect to Multivariate Normal Distributions. Other use cases are anomaly detection and image processing. See Section 4.3.2 of Elements of Statistical Learning for more details on the connection between LDA and whitening transformations.

Whitening in Higher Dimensions

Whitening doesn’t need to be limited to only two dimensions and distance calculations. For example, we can whiten a three dimensional distribution and measure the invariant length of a ring:

Transforming entire spaces (not just distances or rings) will be an important feature of the next post on Spacetime intervals.

This post is licensed under CC BY 4.0 by the author.