More Haskell: a bootstrap

So my playing around with Haskell goes on. You can follow the progress of the little bootstrap exercise on github. Now it’s gotten to the point where it actually does a bootstrap interval for the mean of a sample. Consider the following R script:

n <- 100
fake.data <- data.frame(group=rep(1, n), data=rpois(n, 10))
write.table(fake.data, quote=F, row.names=F, col.names=F,
            sep=",", file="fake_data.csv")

library(plyr)
bootstrap.replicates <- llply(vector("list", 100),
                              sample, x=fake.data$data,
                              replace=T, size=n)
bootstrap.means <- unlist(llply(bootstrap.replicates, mean))
print(mean(fake.data$data))
print(quantile(bootstrap.means, c(0.025, 0.975)))
[1] 10.31
    2.5%    97.5% 
 9.72475 10.85200

So, that was a simple bootstrap in R: we get some draws from a Poisson distribution, sample 100 times from the data with replacement, and summarise the replicates.  This is my Haskell thing running in GHCi:

*Main> main
"boot"
"will eventually bootstrap, if martin knows his stuff"
fake_data.csv
[8,6,11,16,5,11,12,12,7,9,13,13,12,7,13,7,7,11,9,14,14,13,10,14,17,12,8,
10,15,12,13,13,7,10,9,6,7,8,10,12,10,10,10,12,11,8,16,12,13,13,12,15,7,
7,8,9,5,7,13,10,12,11,8,6,12,14,12,14,6,9,10,9,10,6,9,7,6,12,13,7,11,7,
13,15,10,10,9,12,12,6,10,6,8,10,13,8,9,13,12,13]
10.31
(9.8,10.83)

It’s certainly not the prettiest thing in the world (for one thing, it will crash if there is an extra line break at the end of the file). Next stop: type declarations! Haskell will infer the types for me, but it is probably a good idea to declare the intended types. Or at least to be able to do so is. Then the plan is to make some use of the first column in the data file, i.e. group the sample belongs to, to add a second sample and make a comparison between the means. And then it’s pretty much done and maybe I’ll move on to something more useful. I’m thinking that implementing least squares linear models would be a decent exercise?

More Haskell fun: the regress of a function

I thought this was pretty funny. Let’s follow the development of one part of my toy bootstrap script. (You have to keep in mind that I’m playing around with functional programming for the first time and that I’m completely utterly horrible at this!) So far, using pseudorandom draws with replacement from a list of data, my script will produce bootstrap replicates of that list. This is the part that, after the random draws have been partitioned into lists of sufficient length goes on to pull out the right elements of the data. The first iteration (heh) goes like this with explicit recursion:

applyShuffle x shuffle =
  if shuffle == [] then
    []
  else
    [x !! head shuffle] ++ applyShuffle x (tail shuffle)

applyShuffleOrder x shuffleOrder =
  if shuffleOrder == [] then
    []
  else
    [applyShuffle x (head shuffleOrder)] ++
       applyShuffleOrder x (tail shuffleOrder)

So, a ”shuffle” in the above is a list of indices of elements of data that form a bootstrap replicate. A ”shuffle order” is a list of such lists. Next step, why not try some pattern matching? Also, the (x:xs) notation for lists makes it look a little more Haskell-y:

applyShuffle x [] = []
applyShuffle x (index:indices) =
    [x !! index] ++ applyShuffle x indices

applyShuffleOrder x [] = []
applyShuffleOrder x (shuffle:shuffles) =
    [applyShuffle x shuffle] ++
        applyShuffleOrder x shuffles

Enter the map function! Map really just applies a function recursively to a list. (For friends of R: it’s like llply or lapply.) It’s still recursion, but it kind of reads like what I would say if I was to tell somebody what the function does: ”map the applyShuffle function to the shuffles we’ve already generated”. This is the second function again, but with map:

applyShuffleOrder x shuffles =
  map f shuffles
  where f = applyShuffle x

At this point you might realise that the functions can be easily combined into something much shorter. That is, you dear reader might have realised that before, but for me it came at this point. I also decided to stop talking about shuffle orders, and simply call them shuffles. At the end, this is all that remains of the above functions.

applyShuffles x shuffles =
  map applyShuffle shuffles
    where applyShuffle = map (x !!)

This is your brain on Haskell

For the last week or so I’ve been playing a little with Haskell, which seems to be great fun. At pretty low intensity, that is, in the evenings. I’ve never done anything with functional programming before, though I’ve taken courses that involved a little recursion, a bit of algorithms etc, so everything is not completely foreign to me. Well, it feels very foreign, a bit like trying to read French (I don’t read French).

Anyway, I felt like it was time to write a little program that actually does something, so I’ll try to make a script for bootstrapping. Even though bootstrapping is not the best thing in the world, a bootstrap comparison between the means of two groups seems a nice task to try: it is not completely trivial, but will make me try useful things like pseudorandom numbers and reading a csv file.

Of course, I haven’t got that far yet. Here is the script on github. (Yes, I’m also trying to get used to github.) When trying to do anything I quickly find myself thinking in terms of procedures, states and side-effects, or otherwise thinking wrong. On the other hand, yesterday evening just before sleep, I had one of those code epiphanies. ”Of course, I just need to do that . . . ” So, while the half-baked script is not that impressive, the process of writing it is probably the most fun I’ve had with a computer since Starcraft II: Wings of Liberty.

Right now, it can make bootstrap replicates of a list of integers, using a sequence of pseudorandom numbers (derived from the seed 42, so not random at all, but I’m sure you can get entropy through some monad later). I think the next thing will something to get data into the script.

I’m just remembering how to do recursion. Like this little thing, that puts data in some given order. It just handles the first element, then calls itself to deal with the rest. I think this is why I like the apply-split-combine approach in R so much, because it really feels like you’re doing almost no work at all.

applyShuffle x shuffle =
  if shuffle == [] then
    []
  else
    [x !! head shuffle] ++ applyShuffle x (tail shuffle)

To be continued, I guess.