This is your brain on Haskell

For the last week or so I’ve been playing a little with Haskell, which seems to be great fun. At pretty low intensity, that is, in the evenings. I’ve never done anything with functional programming before, though I’ve taken courses that involved a little recursion, a bit of algorithms etc, so everything is not completely foreign to me. Well, it feels very foreign, a bit like trying to read French (I don’t read French).

Anyway, I felt like it was time to write a little program that actually does something, so I’ll try to make a script for bootstrapping. Even though bootstrapping is not the best thing in the world, a bootstrap comparison between the means of two groups seems a nice task to try: it is not completely trivial, but will make me try useful things like pseudorandom numbers and reading a csv file.

Of course, I haven’t got that far yet. Here is the script on github. (Yes, I’m also trying to get used to github.) When trying to do anything I quickly find myself thinking in terms of procedures, states and side-effects, or otherwise thinking wrong. On the other hand, yesterday evening just before sleep, I had one of those code epiphanies. ”Of course, I just need to do that . . . ” So, while the half-baked script is not that impressive, the process of writing it is probably the most fun I’ve had with a computer since Starcraft II: Wings of Liberty.

Right now, it can make bootstrap replicates of a list of integers, using a sequence of pseudorandom numbers (derived from the seed 42, so not random at all, but I’m sure you can get entropy through some monad later). I think the next thing will something to get data into the script.

I’m just remembering how to do recursion. Like this little thing, that puts data in some given order. It just handles the first element, then calls itself to deal with the rest. I think this is why I like the apply-split-combine approach in R so much, because it really feels like you’re doing almost no work at all.

applyShuffle x shuffle =
  if shuffle == [] then
    []
  else
    [x !! head shuffle] ++ applyShuffle x (tail shuffle)

To be continued, I guess.

A note on using R: Residuals from a linear model with missing values

(Not på svenska: Det här är något jag kanske kommer göra då och då — skriva en liten praktiskt inriktad kommentar om något jag upptäckt i arbetet med något visst (oftast datorbaserat) verktyg — något jag skulle velat hitta när jag googlade problemet.)

(This is something I might do more often: posting a small practical thing I’ve found useful, as an attempt to help a fellow user who’s trying to google his or her way to a solution.)

Occasionally when analysing data, you feel the need to pull out the residuals from a linear model — e.g. when trying to control for a bunch of covariates. In R, you can do this very easily with the residuals() function. This works fine with no NAs:

> data(BOD)
> BOD
   Time demand
1    1    8.3
2    2   10.3
3    3   19.0
4    4   16.0
5    5   15.6
6    7   19.8
> residuals(lm(demand ~ Time, data=BOD))
         1          2         3         4          5          6
-1.9428571 -1.6642857 5.3142857 0.5928571 -1.5285714 -0.7714286

However, there’s a slight difficulty when there are NAs in the data. If you assume the residuals will have the same dimensions and order of elements as the input data, your stuff might break.

> BOD$demand[5] <- NA
residuals(lm(demand ~ Time, data=BOD))
         1          2         3         4          6 
-1.9716981 -1.8084906 5.0547170 0.2179245 -1.4924528

I used to use a small work-around that used the fact that residuals() saves the row names of the original data as names in the residual vector. Then I found that you could get the desired behaviour — at least, what I want is usually for the function to return the a vector of the same length as the input, where NA data points give NA residual values — by simply putting in a na.action argument, like so:

> residuals(lm(demand ~ Time, data=BOD, na.action=na.exclude))
         1          2         3        4         5          6
-1.9716981 -1.8084906 5.0547170 0.2179245       NA -1.4924528

Sometimes, R is pretty neat.