Regression Diagnostics

Next: M-estimators Up: Robust Estimation Previous: Clustering or Hough Transform

Regression Diagnostics

Another old robust method is the so-called regression diagnostics. It tries to iteratively detect possibly wrong data and reject them through analysis of globally fitted model. The classical approach works as follows:

Determine an initial fit to the whole set of data through least squares.
Compute the residual for each datum.
Reject all data whose residuals exceed a predetermined threshold; if no data have been removed, then stop.
Determine a new fit to the remaining data, and goto step 2.

Clearly, the success of this method depends tightly upon the quality of the initial fit. If the initial fit is very poor, then the computed residuals based on it are meaningless; so is the diagnostics of them for outlier rejection. As pointed out by Barnett and Lewis, with least-squares techniques, even one or two outliers in a large set can wreak havoc! This technique thus does not guarantee for a correct solution. However, experiences have shown that this technique works well for problems with a moderate percentage of outliers and more importantly outliers only having gross errors less than the size of good data.

The threshold on residuals can be chosen by experiences using for example graphical methods (plotting residuals in different scales). Better is to use a priori statistical noise model of data and a chosen confidence level. Let be the residual of the idata, and be the predicted variance of the iresidual based on the characteristics of the data nose and the fit, the standard test statistics can be used. If is not acceptable, the corresponding datum should be rejected.

One improvement to the above technique uses influence measures to pinpoint potential outliers. These measures asses the extent to which a particular datum influences the fit by determining the change in the solution when that datum is omitted. The refined technique works as follows:

Determine an initial fit to the whole set of data through least squares.
Conduct a statistic test whether the measure of fit f (e.g. sum of square residuals) is acceptable; if it is, then stop.
For each datum I, delete it from the data set and determine the new fit, each giving a measure of fit denoted by . Hence determine the change in the measure of fit, , when datum i is deleted.
Delete datum i for which is the largest, and goto step 2.

It can be shown [23] that the above two techniques agrees with each other at the first order approximation, the datum with the largest residual is also that datum inducing maximum change in the measure of fit at a first order expansion. The difference is that whereas the first technique simply rejects the datum that deviates most from the current fit, the second technique rejects the point whose exclusion will result in the best fit on the next iteration. In other words, the second technique looks ahead to the next fit to see what improvements will actually materialize.

As can be remarked, the regression diagnostics approach depends heavily on a priori knowledge in choosing the thresholds for outlier rejection.

Next: M-estimators Up: Robust Estimation Previous: Clustering or Hough Transform

Zhengyou Zhang
Thu Feb 8 11:42:20 MET 1996