On the other hand, in the presence of the spurious feature, the full model can fit OntarioCA escort the training data perfectly with a smaller norm by assigning weight \(1\) for the feature \(s\) (\(|<\theta^\text<-s>>|_2^2 = 4\) while \(|<\theta^\text<+s>>|_2^2 + w^2 = 2 < 4\)).
Generally, in the overparameterized regime, since the number of training examples is less than the number of features, there are some directions of data variation that are not observed in the training data. In this example, we do not observe any information about the second and third features. However, the non-zero weight for the spurious feature leads to a different assumption for the unseen directions. In particular, the full model does not assign weight \(0\) to the unseen directions. Indeed, by substituting \(s\) with \(<\beta^\star>^\top z\), we can view the full model as not using \(s\) but implicitly assigning weight \(\beta^\star_2=2\) to the second feature and \(\beta^\star_3=-2\) to the third feature (unseen directions at training).
Contained in this example, removing \(s\) decreases the mistake having a test distribution with a high deviations out-of no to the second function, whereas deleting \(s\) boosts the mistake to own an examination delivery with a high deviations regarding no with the third feature.
As we saw in the previous example, by using the spurious feature, the full model incorporates \(<\beta^\star>\) into its estimate. The true target parameter (\(\theta^\star\)) and the true spurious feature parameters (\(<\beta^\star>\)) agree on some of the unseen directions and do not agree on the others. Thus, depending on which unseen directions are weighted heavily in the test time, removing \(s\) can increase or decrease the error.
More formally, the weight assigned to the spurious feature is proportional to the projection of \(\theta^\star\) on \(<\beta^\star>\) on the seen directions. If this number is close to the projection of \(\theta^\star\) on \(<\beta^\star>\) on the unseen directions (in comparison to 0), removing \(s\) increases the error, and it decreases the error otherwise. Note that since we are assuming noiseless linear regression and choose models that fit training data, the model predicts perfectly in the seen directions and only variations in unseen directions contribute to the error.
(Left) The fresh projection from \(\theta^\star\) into the \(\beta^\star\) is self-confident throughout the seen advice, however it is negative in the unseen assistance; for this reason, deleting \(s\) reduces the mistake. (Right) The newest projection out-of \(\theta^\star\) to your \(\beta^\star\) is similar both in seen and unseen directions; for this reason, deleting \(s\) increases the mistake.
Let’s now formalize the conditions under which removing the spurious feature (\(s\)) increases the error. Let \(\Pi = Z(ZZ^\top)^<-1>Z\) denote the column space of training data (seen directions), thus \(I-\Pi\) denotes the null space of training data (unseen direction). The below equation determines when removing the spurious feature decreases the error.
The fresh new left front is the difference between the fresh new projection out of \(\theta^\star\) towards the \(\beta^\star\) regarding viewed direction with regards to projection regarding unseen assistance scaled from the decide to try go out covariance. Suitable top is the difference between 0 (we.e., staying away from spurious features) and the projection of \(\theta^\star\) towards the \(\beta^\star\) throughout the unseen advice scaled by take to time covariance. Removing \(s\) assists in the event the leftover front is actually higher than just the right front.
Since the principle applies only to linear patterns, we now demonstrate that inside non-linear habits educated into the real-community datasets, deleting an excellent spurious function decreases the reliability and you can affects communities disproportionately.