A note on using R: Residuals from a linear model with missing values

(Not på svenska: Det här är något jag kanske kommer göra då och då — skriva en liten praktiskt inriktad kommentar om något jag upptäckt i arbetet med något visst (oftast datorbaserat) verktyg — något jag skulle velat hitta när jag googlade problemet.)

(This is something I might do more often: posting a small practical thing I’ve found useful, as an attempt to help a fellow user who’s trying to google his or her way to a solution.)

Occasionally when analysing data, you feel the need to pull out the residuals from a linear model — e.g. when trying to control for a bunch of covariates. In R, you can do this very easily with the residuals() function. This works fine with no NAs:

> data(BOD)
   Time demand
1    1    8.3
2    2   10.3
3    3   19.0
4    4   16.0
5    5   15.6
6    7   19.8
> residuals(lm(demand ~ Time, data=BOD))
         1          2         3         4          5          6
-1.9428571 -1.6642857 5.3142857 0.5928571 -1.5285714 -0.7714286

However, there’s a slight difficulty when there are NAs in the data. If you assume the residuals will have the same dimensions and order of elements as the input data, your stuff might break.

> BOD$demand[5] <- NA
residuals(lm(demand ~ Time, data=BOD))
         1          2         3         4          6 
-1.9716981 -1.8084906 5.0547170 0.2179245 -1.4924528

I used to use a small work-around that used the fact that residuals() saves the row names of the original data as names in the residual vector. Then I found that you could get the desired behaviour — at least, what I want is usually for the function to return the a vector of the same length as the input, where NA data points give NA residual values — by simply putting in a na.action argument, like so:

> residuals(lm(demand ~ Time, data=BOD, na.action=na.exclude))
         1          2         3        4         5          6
-1.9716981 -1.8084906 5.0547170 0.2179245       NA -1.4924528

Sometimes, R is pretty neat.