Why is projection in linear regression?
Published:
** this blog is still a work in progress - thanks to Quang-Vinh Do for the edits **
In classic linear regression, we are given a matrix $X$, where the rows are observations and the columns are features. We also have a column of responses $y$.
Our goal is to estimate $\beta$, so that $X \beta$ most closely approximates $y$. We are told that the square difference between $X \beta$ and $y$ is minimized, when we set:
\[\beta = (X^{T} X)^{-1}X^{T}y\]… a formula that feels intimidating, and seems to come from nowhere. (We could bash it out by taking the derivative of the loss function minimizing mean-squared error $(y - X \beta) ^T(y-X \beta)$, but that is not very insightful.)
At a high level, the formula does feel reasonable: we would indeed expect $\beta$ to depend on both $X$ and $y$. Though if we view matrices as linear transformations, it is not obvious why we can get $\beta$ by transforming $y$.
The $\beta$ formula also resembles that of a projection matrix $P = A(A^T A)^{-1}A^T$ where $Pb$ would project $b$ into the space spanned by $A$. It doesn’t match completely, but matches enough for us to suspect there is a projection hidden somewhere.
The projection formula itself also doesn’t feel that approachable, while we do get some intuition with
\[\require{cancel} P^2 = A\cancel{(A^T A)^{-1}}\cancel{A^T A}(A^T A)^{-1}A^T = P\], since re-projecting into the same space shouldn’t do anything additional.
The goal here is to connect these two formulas, and build intuition on where they came from.
First, where is it that we are doing a projection when doing linear regression?
- To me, projection suggests taking a vector that can’t be expressed in some subspace and finding the closest approximation expressible in that subspace.
- In linear regression, it is generally not possible to find a $\beta$, so that $X \beta$ is exactly $y$ (further explained in appendix).
- Instead, we have to settle on $X\beta$ being equal to some $\hat{y}$ that is as close to $y$ as possible.
- This is the same as saying we want to project $y$ into the space that is spanned by the columns of $X$.
- Further rephrased: we are constrained to using linear combinations of the features, while trying to get as close to $y$ as possible.
Let $\hat{y}$ be that closest approximation of $y$, and let $y_{err}$ be the portion of $y$ that can’t be represented in $X$ (and is therefore perpendicular to $X$). We have:
\[y = \hat{y} + y_{err}\]Since $y_{err}$ is outside of the space spanned by $X$’s columns, its dot product with every column of $X$ is going to be 0. We can write that as:
\[X^{T} y_{err} = 0\]If somehow, we can determine the value of $\hat{y}$ or $y_{err}$, then it is pretty straightforward to work backward and find $\beta$. To that end, let’s take advantage of the property above, and get:
\[\require{cancel} X^T y = X^T(\hat{y} + y_{err}) = X^T \hat{y} + \cancel{X^T y_{err}} = X^T \hat{y}\]If we could just get the inverse of $X^T$, then we’d be able to isolate $\hat{y}$. However, that inverse does not exist. Because thee are many more columns in $X^T$ than there are rows, applying $X^T$ to a vector maps it from a higher dimension space to a lower dimension one. Said another way, $X^T$ maps many different inputs to the same output (namely, $y$ and $\hat{y}$ themselves, as the equation above shows).
We will have to look elsewhere for an isolating inverse. To do that, let’s ponder on the properties of $\hat{y}$. We know $\hat{y}$ can be represented as some linear combination of the columns of $X$, and so there must exist some $\beta$ where $\hat{y} = X\beta$. We express this as:
\[X^Ty = X^T X \beta\]From there, we note: $X^TX$ is a symmetric square matrix that is generally invertible. This is because the columns of $X$ are generally linearly independent (in appendix), and through that independence, we can show $X^TX$ is also invertible (also in appendix).
Using that inverse, we can get:
\[\require{cancel} (X^TX)^{-1}X^Ty = \cancel{(X^TX)^{-1}}\cancel{ X^T X} \beta = \beta\]and to get $\hat{y}$ from there, we just multiply $\beta$ with $X$ to get:
\[\hat{y} = X\beta = X(X^TX)^{-1}X^Ty\]We can see that $X(X^TX)^{-1}X^T$ is in the format of the previously mentioned projection matrix. It serves to project $y$ into the space spanned by the columns of $X$.
Hopefully, the formula for $\beta$ now feels a bit less arbitrary, but I have to say all these manipulations make me feel a bit uncomfortable - it all seems somewhat circular, and as though we are just messing around with definitions.
We substituted $\hat{y}$ with $X\beta$, and than we inverted the $X$ away; yet at the end, we multiply the $X$ back to get back our $\hat{y}$, and all that to sidestep the fact that $X^T$ is not invertible…
It might help to look at this from a different perspective. What is it that we are doing when we apply a projection matrix to $y$ in order to get to $\hat{y}$.
There is room to expand on the full intuition, and I we glossed a bit over the role of $X^T X$ (the Gram matrix), whose invertibility was what motivated our algebraic manipulations. The Gram matrix is an interesting matrix, since if $X$ is centered, it is proportional to the covariance matrix of the features columns: $x_1, x_2, …, x_n$.
In a future blog, I hope to expand on the connection between projections and covariances.
Why is it generally not possible to find a $\beta$ so that $X\beta$ is exactly $y$?
If we increase the number of features (columns in the matrix), the task intuitively feels easier, because we can be more expressive using the larger number of vectors available in the space. Conversely, if we increase the number of rows (i.e., get more data), we make the task harder. That claim feels counterintuitive at first, since it feels like it’d be easier to do regression well and get a better $\beta$ estimate with more data. However, having more data makes it more difficult to fit each observation perfectly, given that the only dials we have are the values we choose for $\beta$. We can think of adding each new row or observation as stacking on an additional constraint, making it harder to get an exact match. Generally the number of rows far exceed the number of columns, and it is thus impossible to get to $y$ perfectly through a linear combination of columns of $X$.
Why are the columns of $X$ linearly independent?
In the linear regression world, linear dependence would mean one of the features is a total duplicate of another. It is not something that should happen, since (1) it would be sensible to remove the redundant feature before regressing (2) when we have many observation, even a small amount of noise will prevent two columns from matching entirely (though highly correlated features are nonetheless troublesome and unstable).
Why does the linear independence of $X$ imply $X^TX$ is invertible?
A feature of linear independence is that for any vector $u$, $Xu = 0$ only if $u$ is 0.
Now, suppose $X^TX$ is not invertible, then there exist a $w \ne 0$ such that $X^TXw= 0$ (as a null space that is just the zero vector would imply linear independence and invertibility), which would also mean that $w^T (X^TXw) = w^T(0) = 0$, for a non-zero $w$.
Yet, since $w^T(X^TX)w = (Xw)^T(Xw) = ||Xw||^2$, which is only 0 if $w=0$. We have a contradiction.