# Using R: plyr to purrr, part 1

This is the second post about my journey towards writing more modern Tidyverse-style R code; here is the previous one. We will look at the common case of taking subset of data out of a data frame, making some complex R object from them, and then extracting summaries from those objects.

I miss the plyr package. Especially ddply, ldply, and dlply, my favourite trio of R functions of all times. Yes, the contemporary Tidyverse package dplyr is fast and neat. And those plyr functions operating on arrays were maybe overkill; I seldom used them. But plyr was so smooth, so beautiful, and — after you’ve bashed your head against it for some time and it changed your mind — so intuitive. The package still exists, but it’s retired, and we shouldn’t keep writing R code like it’s 2009, should we?

I used to like to do something like this: take a data frame, push it through a function that returns some complex object, store those objects in a list, and then map various functions over that list to extract the parts I care about. What is the equivalent in the new idiom? To do the same thing but with the purrr package, of course! purrr replaces the list-centric parts of plyr, while dplyr covers data frame-centric summarisation, mutation and so on.

For this example we will be using the lm function on subsets of data and store the model object. It’s the simple case that everyone reaches for to demonstrate these features, but it’s a bit dubious. If you find yourself fitting a lot of linear models to subsets, maybe there are other models you should think about Especially here, when the fake data just happens to come from a multilevel model with varying intercepts … But in this case, let’s simulate a simple linear regression and look at the coefficients we get out.

set.seed(20210807)

n_groups <- 10
group_sizes <- rpois(n_groups, 10)
n <- sum(group_sizes)

fake_data <- tibble(id = 1:n,
group = rep(1:n_groups,
times = group_sizes),
predictor = runif(n, 0, 1))
group_intercept <- rnorm(n_groups, 0, 1)

fake_data$response <- fake_data$predictor * 10 +
group_intercept[fake_data$group] + rnorm(n)  And here is the plyr code: First, dlply takes us from a data frame, splitting it by group, to a list of linear models. Then, ldply takes us from the list of models to a data frame of coefficients. tidy is a function from the wonderful broom package which extracts the same information as you would get in the rather unwieldy object from summary(lm), but as a data frame. library(plyr) library(broom) fit_model <- function(data) { lm(response ~ predictor, data) } models <- dlply(fake_data, "group", fit_model) result <- ldply(models, tidy)  This is what the results might looks like. Notice how ldply adds the split labels nicely into the group column, so we know which rows came from which subset.  group term estimate std.error statistic p.value 1 1 (Intercept) -0.2519167 0.5757214 -0.4375670 6.732729e-01 2 1 predictor 10.6136902 1.0051490 10.5593207 5.645878e-06 3 2 (Intercept) 3.1528489 0.6365294 4.9531864 7.878498e-04 4 2 predictor 8.2075766 1.1458702 7.1627452 5.292586e-05 5 3 (Intercept) -0.8103777 0.6901212 -1.1742542 2.786901e-01 ...  # split/map: The modern synthesis If we pull out purrr, we can get the exact same table like so. The one difference is that we get a tibble (that is, a contemporary, more well-behaved data frame) out of it instead of a base R data.frame. library(purrr) models <- map(split(fake_data, fake_data$group),
fit_model)
result <- map_df(models,
tidy,
.id = "group")

# A tibble: 80 x 6
group term        estimate std.error statistic  p.value

1 1     (Intercept)     1.67     0.773      2.16 6.32e- 2
2 1     predictor       8.67     1.36       6.39 2.12e- 4
3 2     (Intercept)     4.11     0.566      7.26 4.75e- 5
4 2     predictor       8.19     1.11       7.36 4.30e- 5
5 3     (Intercept)    -7.50     0.952     -7.89 9.99e- 5
6 3     predictor      11.5      1.75       6.60 3.03e- 4
7 4     (Intercept)   -19.8      0.540    -36.7  7.32e-13
8 4     predictor      11.5      0.896     12.8  5.90e- 8
9 5     (Intercept)   -12.4      1.03     -12.0  7.51e- 7
10 5     predictor       9.69     1.82       5.34 4.71e- 4
# … with 70 more rows


First, the base function split lets us break the data into subsets based on the values of a variable, which in this case is our group variable. The output of this function is a list of data frames, one for each group.

Second, we use map to apply a function to each element of that list. The function is the same modelling function that we used above, which shoves the data into lm. We now have our list of linear models.

Third, we apply the tidy function to each element of that list of models. Because we want the result to be one single data frame consolidating the output from each element, we use map_df, which will combine the results for us. (If we’d just use map again, we would get a list of data frames.) The .id argument tells map to add the group column that indicates what element of the list of models each row comes from. We want this to be able to identify the coefficients.

If we want to be fancy, we can express with the Tidyverse-related pipe and dot notation:

library(magrittr)

result <- fake_data %>%
values = colours$colour)  # Using R: 10 years with R Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary. I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel! I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme! I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN! The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day: library(qtl) cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")  R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers! Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham! Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers! The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me. There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning. This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again. The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris! In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team! # Using R: from plyr to purrr, part 0 out of however many This post is me thinking out loud about applying functions to vectors or lists and getting data frames back. Using R is an ongoing process of finding nice ways to throw data frames, lists and model objects around. While tidyr has arrived at a comfortable way to reshape dataframes with pivot_longer and pivot_wider, I don’t always find the replacements for the good old plyr package as satisfying. Here is an example of something I used to like to do with plyr. Don’t laugh! Assume we have a number of text files, all in the same format, that we need to read and combine. This arises naturally if you run some kind of analysis where the dataset gets split into chunks, like in genetics, where chunks might be chromosomes. ## Generate vector of file names files <- paste("data/chromosome", 1:20, ".txt", sep = "") library(plyr) library(readr) genome <- ldply(files, read_tsv)  This gives us one big data frame, containing the rows from all those files. If we want to move on from plyr, what are our options? We can go old school with base R functions lapply and Reduce. library(readr) chromosomes <- lapply(files, read_tsv) genome <- Reduce(rbind, chromosomes)  Here, we first let lapply read each file and store it in a list. Then we let Reduce fold the list with rbind, which binds the data frames in the list together, one below the other. If that didn’t make sense, here it is again: lapply maps a function to each element of a vector or list, collecting the results in a list. Reduce folds the elements in a list together, using a function that takes in two arguments. The first argument will be the results it’s accumulated so far, and the second argument will be the next element of the list. In the end, this leaves us, as with ldply, with one big data frame. We can also use purrr‘s map_dfr. This seems to be the contemporary most elegant solution: library(purrr) library(readr) genome <- map_dfr(files, read_tsv)  map_dfr, like good old ldply will map over a vector or list, and collect resulting data frames. The ”r” in the name means adding the next data frame as rows. There is also a ”c” version (map_dfc) for adding as columns. # Using R: Animal model with hglm and Stan (with Cholesky trick) A few weeks ago I posted about fitting the quantitative genetic animal model with MCMCglmm and R-INLA. Since then, I listened to a talk by Lars Rönnegård, one of the creators of the hglm package, and this paper was published in GSE about animal models in Stan. # hglm The hglm package fits hierarchical generalised linear models. That includes the animal model with pedigree or genomic relatedness. Hierarchical generalised linear models also allow you to model the dispersion of random effects, which lets you do tricks like variance QTL mapping (Rönnegård & Valdar 2011), breeding values for variances (Rönnegård et al. 2010) or genomic prediction models with predictors of marker variance (Mouresan, Selle & Rönnegård 2019). But let’s not get ahead of ourselves. How do we fit an animal model? Here is the matrix formulation of the animal model that we skim through in every paper. It’s in this post because we will use the design matrix interface to hglm, which needs us to give it these matrices (this is not a paper, so we’re not legally obliged to include it): $\mathbf{y} = \mu + \mathbf{X} \mathbf{b} + \mathbf{Z} \mathbf{a} + \mathbf{e}$ The terms are the the trait value, intercept, fixed coefficients and their design matrix, genetic coefficients and their design matrix, and the residual. The design matrix Z will contain one row and column for each individual, with a 1 to indicate its position in the phenotype table and pedigree and the rest zeros. If we sort our files, it’s an identity matrix. The trick with the genetic coefficients is that they’re correlated, with a specific known correlation structure that we know from the pedigree (or in genomic models, from markers). It turns out (Lee, Nelder & Pawitan 2017, chapter 8) that you can change the Z matrix around so that it lets you fit the model with an identity covariance matrix, while still accounting for the correlations between relatives. You replace the random effects for relatedness with some transformed random effects that capture the same structure. One way to do this is with Cholesky decomposition. $\mathbf{Z_{fudged}} = \mathbf{Z_0} \mathbf{L}$ As an example of what the Cholesky decomposition does, here is slice of the additive relationship matrix of 100 simulated individuals (the last generation of one replicate of these simulations) and the resulting matrix from Cholesky decomposition. So instead of $\mathbf{a} \sim N(0, \mathbf{A} \sigma)$ We can fit $\mathbf{a_{fudged}} \sim N(0, \mathbf{I} \sigma)$ This lets us fit the animal model with hglm, by putting in a modified Z matrix. Assuming we have data frames with a pedigree and a phenotype (like, again, from these simulations): library(AGHmatrix) library(hglm) A <- Amatrix(ped) Z0 <- diag(1000) L <- t(chol(A)) Z <- Z0 %*% L X <- model.matrix(~1, pheno) model <- hglm(y = pheno$pheno,
X = X,
Z = Z,
conv = 1e-8)

est_h2  <- model$varRanef / (model$varRanef + model$varFix)  (I found the recommendation to decrease the convergence criterion from the default for animal models in a YouTube video by Xia Chen.) # Stan When we turn to Stan, we will meet the Cholesky trick again. Stan is a software for Markov Chain Monte Carlo, built to fit hierarchical linear models, and related high-dimensional models, more effectively than other sampling strategies (like Gibbs). rstan is a helpful package for running Stan from within R. Nishio & Arakawa (2019) recently published a Stan script to fit an animal model, comparing Stan to a Gibbs sampler (and a related MCMC sampler that they also didn’t publish the code for). If we look into their Stan model code, they also do a Cholesky decomposition to be able to use an identity matrix for the variance. First, they decompose the additive relationship matrix that the program takes in: transformed data{ matrix[K,K] LA; LA = cholesky_decompose(A); }  And then, they express the model like this: vector[N] mu; vector[K] a; a_decompose ~ normal(0, 1); a = sigma_G * (LA * a_decompose); mu = X * b + Z * a; Y ~ normal(mu, sigma_R);  We can add this line to the generated quantities block of the Stan program to get heritability estimates directly: real h2; h2 = sigma_U / (sigma_U + sigma_E)  Here, we’ve saved their model to a stan file, and now we can run it from R: pheno$scaled_pheno <- as.vector(scale(pheno$pheno)) model_stan <- stan(file = "nishio_arakawa.stan", data = list(Y = pheno$scaled_pheno,
X = X,
A = A,
Z = Z0,
J = 1,
K = 1000,
N = 1000))

est_h2_stan <- summary(model_stan, pars = "h2")$summary  Important note that I always forget: It's important to scale your traits before you run this model. If not, the priors might be all wrong. The last line pulls out the summary for the heritability parameter (that we added above). This gives us an estimate and an interval. The paper also contains this entertaining passage about performance, which reads as if it was a response to a comment, actual or anticipated: R language is highly extensible and provides a myriad of statistical and graphical techniques. However, R language has poor computation time compared to Fortran, which is especially well suited to numeric computation and scientific computing. In the present study, we developed the programs for GS and HMC in R but did not examine computation time; instead, we focused on examining the performance of estimating genetic parameters and breeding values. Yes, two of their samplers (Gibbs and HMC) were written in R, but the one they end up advocating (and the one used above), is in Stan. Stan code gets translated into C++ and then compiled to machine code. # Stan with brms If rstan lets us run Stan code from R and examine the output, brms lets us write down models in relatively straightforward R syntax. It’s like the MCMCglmm of the Stan world. We can fit an animal model with brms too, by directly plugging in the relationship matrix: model_brms <- brm(scaled_pheno ~ 1 + (1|animal), data = pheno, family = gaussian(), cov_ranef = list(animal = A), chains = 4, cores = 1, iter = 2000)  Then, we can pull out the posterior samples for the variability, here expressed as standard errors, compute the heritability and then get the estimates (and interval, if we want): posterior_brms <- posterior_samples(model_brms, pars = c("sd_animal", "sigma")) h2_brms <- posterior_brms[,1]^2 / (posterior_brms[,1]^2 + posterior_brms[,2]^2) est_h2_brms <- mean(h2_brms)  (Code is on GitHub: both for the graphs above, and the models.) # Using R: From gather to pivot Since version 1.0.0, released in September, the tidyr package has a new replacement for the gather/spread pair of functions, called pivot_longer/pivot_wider. (See the blog post about the release. It can do a lot of cool things.) Just what we needed, another pair of names for melt/cast, right? Yes, I feel like this might just be what we need! My journey started with reshape2, and after a bit of confusion, I internalised the logic of melt/cast. Look at this beauty: library(reshape2) fake_data <- data.frame(id = 1:20, variable1 = runif(20, 0, 1), variable2 = rnorm(20)) melted <- melt(fake_data, id.vars = "id")  This turns a data frame that looks like this …  id variable1 variable2 1 1 0.10287737 -0.21740708 2 2 0.04219212 1.36050438 3 3 0.78119150 0.09808656 4 4 0.44304613 0.48306900 5 5 0.30720140 -0.45028374 6 6 0.42387957 1.16875579  … into a data frame that looks like this:  id variable value 1 1 variable1 0.10287737 2 2 variable1 0.04219212 3 3 variable1 0.78119150 4 4 variable1 0.44304613 5 5 variable1 0.30720140 6 6 variable1 0.42387957  This is extremely useful. Among other things it comes up all the time when using ggplot2. Then, as I detailed in a post two years ago, I switched to tidyr as that became the replacement package. ”Gather” and ”spread” made no sense to me as descriptions of operations on a data frame. To be fair, ”melt” and ”cast” felt equally arbitrary, but by that time I was used to them. Getting the logic of the arguments, the order, what needed quotation marks and not, took some staring at examples and a fair bit of trial and error. Here are some examples. If you’re not used to these functions, just skip ahead, because you will want to learn the pivot functions instead! library(tidyr) melted <- gather(fake_data, variable, value, 2:3) ## Column names instead of indices melted <- gather(fake_data, variable, value, variable1, variable2) ## Excluding instead of including melted <- gather(fake_data, variable, value, -1) ## Excluding using column name melted <- gather(fake_data, variable, value, -id)  Enter the pivot functions. Now, I have never used pivot tables in any spreadsheet software, and in fact, the best way to explain them to me was to tell me that they were like melt/cast (and summarise) … But pivot_longer/pivot_wider are friendlier on first use than gather/spread. The naming of both the functions themselves and their arguments feel like a definite improvement. long <- pivot_longer(fake_data, 2:3, names_to = "variable", values_to = "value")  # A tibble: 40 x 3 id variable value 1 1 variable1 0.103 2 1 variable2 -0.217 3 2 variable1 0.0422 4 2 variable2 1.36 5 3 variable1 0.781 6 3 variable2 0.0981 7 4 variable1 0.443 8 4 variable2 0.483 9 5 variable1 0.307 10 5 variable2 -0.450 # … with 30 more rows  We tell it into what column we want the names to go, and into what column we want the values to go. The function is named after a verb that is associated with moving things about in tables all the way to matrix algebra, followed by an adjective (in my opinion the most descriptive, out of the alternatives) that describes the layout of the data that we want. Or, to switch us back again: wide <- pivot_wider(long, names_from = "variable", values_from = "value")  # A tibble: 20 x 3 id variable1 variable2 1 1 0.103 -0.217 2 2 0.0422 1.36 3 3 0.781 0.0981 4 4 0.443 0.483 5 5 0.307 -0.450 6 6 0.424 1.17  Here, instead, we tell it where we want the new column names taken from and where we want the new values taken from. None of this is self-explanatory, by any means, but they are thoughtful choices that make a lot of sense. We’ll see what I think after trying to explain them to beginners a few times, and after I’ve fought warning messages involving list columns for some time, but so far: well done, tidyr developers! # #TidyTuesday: horror films, squirrels and commuters Tidy Tuesday is a fun weekly activity where a lot of R enthusiasts make different visualisations, and possibly modelling, of the same dataset. You can read more about it at their Github page. I participated for three weeks, and here is a recap. I will show excerpts of the code, but you can read the whole thing by clicking through to Github. # 2019-10-22 Horror films In time for Halloween, we got a dataset with horror film data from IMDB. (Yes, I will be mixing the terms ”film” and ”movie” wildly.) The first week, I started with making a pretty boring plot, the way I’d normally plot things (white background, small multiples, you know the drill). I wanted to look at distribution over the year, so I plotted what month films are released and the distribution of review scores and budgets each month. After thinking about it for a while, I thought a logarithmic scale would make sense for budgets, that span a huge range. Also, after realising that the budget column actually didn’t contain dollars, but a mix of currencies, I decided not to try to convert, but use only the US dollar budgets. I don’t often run into dates, to using the date functions from readr and lubridate was new to me, as was the built-in vector month.abb: library(dplyr) library(egg) library(ggplot2) library(ggimage) library(lubridate) library(readr) library(stringr) movies <- read_csv("horror_movies.csv") ## Parse dates movies$release_parsed  <- parse_date(movies$release_date, format = "%d-%b-%y", locale = locale("en")) movies$release_year <- ifelse(is.na(movies$release_parsed), movies$release_date,
year(movies$release_parsed)) movies$release_month  <- month.abb[month(movies$release_parsed)]  Here, we parse the release data, and extract the release year, treating films that only have a release year separately. I also put in means with confidence intervals, like so, and a line for the mean review rating: model <- lm(review_rating ~ release_month, movies) fit <- data.frame(release_month = month.abb, predict(model, newdata = data.frame(release_month = month.abb), interval = "confidence"), stringsAsFactors = FALSE) grand_mean_rating <- mean(movies$review_rating,
na.rm = TRUE)


As an example of the plotting code, here is the middle panel for ratings. As usual with ggplot2, we layer geometries on top of each other (here: violin plots, points with range bars, and a horizontal line, followed by a lot of formatting.

plot_rating <- ggplot() +
geom_violin(aes(x = release_month,
y = review_rating),
fill = "grey",
colour = NA,
data = movies) +
scale_x_discrete(limits = month.abb) +
geom_pointrange(aes(x = release_month,
y = fit,
ymax = upr,
ymin = lwr),
data = fit) +
geom_hline(yintercept = grand_mean_rating,
linetype = 2,
colour = "red") +
ylim(0, 10) +
theme_bw(base_size = 12) +
theme(panel.grid = element_blank()) +
xlab("") +
ylab("Review rating")


There is similar code for the other two panels. Finally, I used ggarrange from the egg package to put everything together. In summary, most horror films are released in October, probably around Halloween. The review ratings of films released in this horror season are also a tiny bit higher than during the rest of the year, but there is not much of a difference in the budgets.

After that, and after seeing some of the fun horror-themed graphs other people made, I decided to make something more colourful. Here is a plot on the same theme, showing each day and year separately, an appropriately horrendous colour scheme, and a pumpkin icon to indicate the date of Halloween. I like this plot better because it shows more of the data. It shows the increase at Halloween. We also see some spikes at other dates, like 1 January of some years. It also shows how the dataset ends at Halloween 2017.

The code for this plot is mostly a lot of theme formatting. The ggplot2 theme function takes a lot of arguments I’ve never used before.

movies$yday <- yday(movies$release_parsed)

daycount <- summarise(group_by(movies, yday, release_year), n = n())


First, we turn dates into days of the year, and count the number of film releases.

halloween  <-  yday("2019-10-31")

pumpkin_data  <- data.frame(x = halloween,
y = -1,
image = "pumpkin.png",
stringsAsFactors = FALSE)


Then, we set up the date of Halloween and a data frame for the pumpkin icon. We’re going to use geom_image from the ggimage package to add this icon to each subplot.

breaks  <- yday(paste("2019-", 1:12, "-01", sep = ""))

plot_year <- ggplot() +
geom_point(aes(x = yday,
y = n),
colour = "green",
data = na.exclude(dc)) +
geom_image(aes(x = x,
y = y,
image = image),
data = pumpkin_data) +
facet_wrap(~ release_year,
ncol = 2) +
scale_x_continuous(breaks = breaks,
labels = month.abb) +
ylim(-3, NA) +
labs(caption = "Pumpkin icon by Good Ware from www.flatiron.com.") +
theme(panel.grid = element_blank(),
strip.background = element_blank(),
text = element_text(family = "mono",
colour = "grey",
size = 16),
axis.text = element_text(family = "mono",
colour = "green",
size = 14),
axis.ticks = element_line(colour = "green"),
strip.text = element_text(family = "mono",
colour = "grey",
size = 16),
plot.background = element_rect(fill = "black"),
panel.background = element_rect(fill = "black")) +
xlab("") +
ylab("Horror films released on this day") +
ggtitle("When horror films are released")


A lot of other people made graphs that highlight the increase in horror film releases around Halloween in different ways. Here are some that I like:

And, looking deeper, there is a pattern within months too:

Finally, I also like this plot, that makes a case for a U-shaped relationship between budget and rating:

And for contrast, another that makes a different case with the same data:

This seems to be a recurrent theme when it comes to interpretation and quantitative analysis in the Tidy Tuesday datasets. People make different modeling choices, or visualisation choices (which are modeling choices) about what to lump together, what to separate into bins, how to transform the data, and how to show uncertainty. In some cases, as with the pattern of film releases around Halloween, they all find similar results. In some other cases, they don’t.

# 2019-10-28 NYC Squirrel Census

This week, the data was about the location and activities of squirrels in New York central park on certain times. I had this vision of an animated map of squirrel locations. I ended up with an animation, but no map. The colour of the squirrel icon shows the main fur colour of the squirrels (grey, black, cinnamon), and the size shows adults and juveniles.

I had never used gganimate before (only animation, as in this post about the Game of Life), but I had seen Thomas Lin Pedersen tweet about it, and I wanted to try.

library(dplyr)
library(gganimate)
library(ggimage)
library(ggplot2)

## Parse the date
squirrels$date_parsed <- parse_date(as.character(squirrels$date), format = "%m%d%Y")

## Give each observation a unique ID (to use as group in the
## animation, so as to not have points turn into one another but fade
squirrels$key <- 1:nrow(squirrels) ## Associate the different squirrel colours with the filenames of ## icons in different colours (manually filled with GIMP). squirrels$image  <- "squirrel.png"
squirrels$image[squirrels$primary_fur_color == "Cinnamon"]  <- "squirrel_cinnamon.png"
squirrels$image[squirrels$primary_fur_color == "Gray"]  <- "squirrel_grey.png"
squirrels$image[is.na(squirrels$primary_fur_colour)]  <- NA


Again, we need to parse the date. We already have latitude and longitude. We need a unique identifier for each observation, to tell gganimate that we want each squirrel to be in its own group. Then, we associate squirrel colours with three different files with a squirrel icon in different colours.

First, we make two image scatterplot layers, setting the sizes of adults and juveniles manually. The colour is deal with by mapping the image column containing the file names to the image aesthetic. We add some formatting, and then, the transition_states layer, which is where the graph turns from still and boring to magical moving pictures. This will animate a series of discrete ”states”, which here consist of the date pasted together with the shift (AM or PM squirrel observation shift). The special ”{closest_state}” variable in the title string puts this state name as plot title.

plot_colour <- ggplot() +
geom_image(aes(y = long, x = lat, image = image, group = key),
size = 0.04,
data = filter(squirrels, age == "Adult")) +
geom_image(aes(y = long, x = lat, image = image, group = key),
size = 0.03,
data = filter(squirrels, age == "Juvenile")) +
theme_bw(base_size = 16) +
theme(panel.grid = element_blank()) +
xlab("Latitude") +
ylab("Longitude") +
labs(title = "{closest_state}",
caption = "Data from NYC Squirrel Census. Squirrel icon made by Freepik from www.flatiron.com.") +
transition_states(paste(date_parsed, shift),
state_length = 2,
transition_length = 1)

## Render it and write to file
animate(plot_colour,
fps = 10,
nframes = 400,
end_pause = 20,
rewind = FALSE,
width = 1000,
height = 1000)


I was faffing around unsuccessfully with different map packages to try to find something of Central Park. It seems ggmaps is the way to go. Several other participants made nice maps:

However, I think this was my favourite:

https://github.com/ryantimpe/TidyTuesday/blob/master/2019w44/2019w44.R

The original Squirrel Census Report seems to be amazing object, too, with a beautiful map.

# 2019-11-05 Biking and walking to work in the US (and Sweden)

This week I felt I had to make a map. The end result doesn’t look like much, but it took a while. Here are the average percentages of commuters who walk and bike to work in different US states 2008-2012 with data from the American Community Survey:

library(dplyr)
library(ggplot2)
library(usmap)

## Map data from the usmap package
state_map  <- us_map(regions = "state")

## There are some incompletely labelled states; fix them
missing  <- setdiff(commute$state, state_map$full)

commute$state_modified <- commute$state
commute$state_modified[commute$state == "Ca"] <- "California"
commute$state_modified[commute$state == "Massachusett"]  <- "Massachusetts"


We get map coordinates for the US states from the usmap package (because the one in maps doesn’t have Alaska and Hawaii).

Then we fix some mislabelling in the data.

## Get the average per state
state_average  <- summarise(group_by(commute, state_modified, mode),
average = sum(percent * n)/sum(n))

## Combine averages and coordinates
combined  <- inner_join(state_average,
state_map,
by = c("state_modified" = "full"))


We take a weighted average of the percentages per state and join the state averages with the state map coordinates. The map I posted on Twitter didn’t weight the average, but I think that is a bit better. There is still the issue that states have different populations and different distributions of large and small cities, but that’s the nature of things. In summary, there is not much biking going on, but some more walking to work.

plot_map  <- ggplot() +
geom_polygon(aes(x = x, y = y, fill = average, group = group),
colour = "black",
data = combined) +
facet_wrap(~ mode) +
scale_fill_continuous(low = "white",
high = "blue",
name = "Percent commuters") +
theme_bw(base_size = 16) +
theme(panel.grid = element_blank(),
strip.background = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
legend.position = "bottom") +
xlab("") +
ylab("") +
labs(caption = "Cycling and walking to work 2008-2012 in the American Community Survey.")


The US seems to live up to its reputation as a motorised country. But I have no feeling for the scale of the data. For comparision, here is a map of Sweden with some not too recent data (2005-2006, from this VTI report>). The map is from the swemap package.

# Exploratory analysis of a banana

This post is just me amusing myself by exploring a tiny data set I have lying around. The dataset and the code is on Github.

In 2014 (I think), I was teaching the introductory cell biology labs (pictures in the linked post) in Linköping. We were doing a series of simple preparations to look at cells and organelles: a cheek swab gives you a view of dead mammalian cells with bacteria on them; Elodea gives you a nice chloroplast view; a red bell pepper gives you chromoplasts; and a banana stained with iodine gives you amyloplasts. Giving the same lab six times in a row, it became apparent how the number of stained amyloplasts decreased as the banana ripened.

I took one banana, sliced in into five pieces (named A-E), and left it out to ripen. Then I stained (with Lugol’s iodine solution) and counted the number of amyloplasts per cell in a few cells (scraped off with a toothpick) from the end of each piece at day 1, 5, and 9.

First, here is an overview of the data. On average, we go from 17 stained amyloplasts on day 1, to 5 on day five and 2 on day nine.

If we break the plot up by slices, we see decline in every slice and variability between them. Because I only sampled each slice once per day, there is no telling whether this is variation between parts of the banana or between samples taken (say, hypothetically, because I might have stuck the toothpick in more or less deeply, or because the ripeness varies from the middle to the peel).

How can we model this? Let’s first fit a linear model where the number of amyloplasts decline at a constant rate per day, allowing for different starting values and different declines for each slice. We can anticipate that a Gaussian linear model will have some problems in this situation.

We fit a linear model and pull out the fitted values for each day–slice combination:

model_lm  <- lm(amyloplasts ~ day * slice,
data = banana)

levels <- expand.grid(slice = unique(banana$slice), day = unique(banana$day),
stringsAsFactors = FALSE)

pred_lm  <- cbind(levels,
predict(model_lm,
newdata = levels,
interval = "confidence"))


Then, to investigate the model’s behaviour, we can simulate data from the model, allowing for uncertainty in the fitted parameters, with the sim function from the arm package.

We make a function to simulate data from the linear model given a set of parameters, then simulate parameters and feed the first parameter combination to the function to get ourselves a simulated dataset.

y_rep_lm  <- function(coef_lm, sigma, banana) {
slice_coef  <- c(0, coef_lm[3:6])
names(slice_coef)  <- c("A", "B", "C", "D", "E")

slice_by_day_coef  <- c(0, coef_lm[7:10])
names(slice_by_day_coef)  <- c("A", "B", "C", "D", "E")

banana$sim_amyloplasts <- coef_lm[1] + slice_coef[banana$slice] +
banana$day * (coef_lm[2] + slice_by_day_coef[banana$slice]) +
rnorm(nrow(banana), 0, sigma)
banana
}

sim_lm  <- sim(model_lm)

sim_banana  <- y_rep_lm(sim_lm@coef[1,], sim_lm@sigma[1], banana)


The result looks like this (black dots) compared with the real data (grey dots).

The linear model doesn’t know that the number of amyloplasts can’t go below zero, so it happily generates absurd negative values. While not apparent from the plots, the linear model also doesn’t know that amyloplasts counts are restricted to be whole numbers. Let’s fit a generalized linear model with a Poisson distribution, which should be more suited to this kind of discrete data. The log link function will also turn the linear decrease into an exponential decline, which seems appropriate for the decline in amyloplasts.

model_glm <- glm(amyloplasts ~ day * slice,
data = banana,

pred_glm <- predict(model_glm,
newdata = levels,
se.fit = TRUE)

results_glm <- data.frame(levels,
average = pred_glm$fit, se = pred_glm$se.fit,
stringsAsFactors = FALSE)

y_rep_glm  <- function(coef_glm, banana) {
slice_coef  <- c(0, coef_glm[3:6])
names(slice_coef)  <- c("A", "B", "C", "D", "E")

slice_by_day_coef  <- c(0, coef_glm[7:10])
names(slice_by_day_coef)  <- c("A", "B", "C", "D", "E")

latent  <- exp(coef_glm[1] +
slice_coef[banana$slice] + banana$day * (coef_glm[2] + slice_by_day_coef[banana$slice])) banana$sim_amyloplasts  <- rpois(n = nrow(banana),
lambda = latent)
banana
}

sim_glm  <- sim(model_glm)

sim_banana_glm  <- y_rep_glm(sim_glm@coef[2,], banana)


This code is the same deal as above, with small modifications: glm instead of lm, with some differences in the interface. Then a function to simulate data from a Poisson model with an logarithmic link, that we apply to one set of parameters values.

There are no impossible zeros anymore. However, there seems to be many more zeros in the real data than in the simulated data, and consequently, as the number of amyloplasts grow small, we overestimate how many there should be.

Another possibility among the standard arsenal of models is a generalised linear model with a negative binomial distribution. As opposed to the Poisson, this allows greater spread among the values. We can fit a negative binomial model with Stan.

library(rstan)

model_nb  <- stan(file = "banana.stan",
data = list(n = nrow(banana),
n_slices = length(unique(banana$slice)), n_days = length(unique(banana$day)),
amyloplasts = banana$amyloplasts, day = banana$day - 1,
slice = as.numeric(factor(banana$slice)), prior_phi_scale = 1)) y_rep <- rstan::extract(model_nb, pars = "y_rep")[[1]]  Here is the Stan code in banana.stan: data { int n; int n_slices; int <lower = 0> amyloplasts[n]; real <lower = 0> day[n]; int <lower = 1, upper = n_slices> slice[n]; real prior_phi_scale; } parameters { real initial_amyloplasts[n_slices]; real decline[n_slices]; real < lower = 0> phi_rec; } model { phi_rec ~ normal(0, 1); for (i in 1:n) { amyloplasts[i] ~ neg_binomial_2_log(initial_amyloplasts[slice[i]] + day[i] * decline[slice[i]], (1/phi_rec)^2); } } generated quantities { vector[n] y_rep; for (i in 1:n) { y_rep[i] = neg_binomial_2_rng(exp(initial_amyloplasts[slice[i]] + day[i] * decline[slice[i]]), (1/phi_rec)^2); } }  This model is similar to the Poisson model, except that the negative binomial allows an overdispersion parameter, a small value of which corresponds to large variance. Therefore, we put the prior on the reciprocal of the square root of the parameter. Conveniently, Stan can also make the simulated replicated data for us in the generated quantities block. What does the simulated data look like? Here we have a model that allows for more spread, but in the process, generates some extreme data, with hundreds of amyloplasts per cell in some slices. We can try to be procrustean with the prior and constrain the overdispersion to smaller values instead: model_nb2 <- stan(file = "banana.stan", data = list(n = nrow(banana), n_slices = length(unique(banana$slice)),
n_days = length(unique(banana$day)), amyloplasts = banana$amyloplasts,
day = banana$day - 1, slice = as.numeric(factor(banana$slice)),
prior_phi_scale = 0.1))

y_rep2  <- rstan::extract(model_nb2, pars = "y_rep")[[1]]


That looks a little better. Now, we’ve only looked at single simulated datasets, but we can get a better picture by looking at replicate simulations. We need some test statistics, so let us count how many zeroes there are in each dataset, what the maximum value is, and the sample variance, and then do some visual posterior predictive checks.


check_glm  <- data.frame(n_zeros = numeric(1000),
max_value = numeric(1000),
variance = numeric(1000),
model = "Poisson",
stringsAsFactors = FALSE)

check_nb  <- data.frame(n_zeros = numeric(1000),
max_value = numeric(1000),
variance = numeric(1000),
model = "Negative binomial",
stringsAsFactors = FALSE)

check_nb2  <- data.frame(n_zeros = numeric(1000),
max_value = numeric(1000),
variance = numeric(1000),
model = "Negative binomial 2",
stringsAsFactors = FALSE)

for (sim_ix in 1:1000) {
y_rep_data  <- y_rep_glm(sim_glm@coef[sim_ix,], banana)
check_glm$n_zeros[sim_ix] <- sum(y_rep_data$sim_amyloplasts == 0)
check_glm$max_value[sim_ix] <- max(y_rep_data$sim_amyloplasts)
check_glm$variance[sim_ix] <- var(y_rep_data$sim_amyloplasts)

check_nb$n_zeros[sim_ix] <- sum(y_rep[sim_ix,] == 0) check_nb$max_value[sim_ix]  <- max(y_rep[sim_ix,])
check_nb$variance[sim_ix] <- var(y_rep[sim_ix,]) check_nb2$n_zeros[sim_ix]  <- sum(y_rep2[sim_ix,] == 0)
check_nb2$max_value[sim_ix] <- max(y_rep2[sim_ix,]) check_nb2$variance[sim_ix]  <- var(y_rep2[sim_ix,])
}

check  <- rbind(check_glm,
check_nb,
check_nb2)

melted_check  <- gather(check, "variable", "value", -model)

check_data  <- data.frame(n_zeros = sum(banana$amyloplasts == 0), max_value = max(banana$amyloplasts),
variance = var(banana$amyloplasts))  Here is the resulting distribution of these three discrepancy statistics in 1000 simulated datasets for the three models (generalised linear model with Poisson distribution and the two negative binomial models). The black line is the value for real data. When viewed like this, it becomes apparent how the negative binomial models do not fit that well. The Poisson model struggles with the variance and the number of zeros. The negative binomial models get closer to the number of zeros in the real data, they still have too few, while at the same time having way too high maximum values and variance. Finally, let’s look at the fitted means and intervals from all the models. We can use the predict function for the linear model and Poisson model, and for the negative binomial models, we can write our own: pred_stan <- function(model, newdata) { samples <- rstan::extract(model) initial_amyloplasts <- data.frame(samples$initial_amyloplasts)
decline  <- data.frame(samples$decline) names(initial_amyloplasts) <- names(decline) <- c("A", "B", "C", "D", "E") ## Get posterior for levels pred <- matrix(0, ncol = nrow(newdata), nrow = nrow(initial_amyloplasts)) for (obs in 1:ncol(pred)) { pred[,obs] <- initial_amyloplasts[,newdata$slice[obs]] +
(newdata$day[obs] - 1) * decline[,newdata$slice[obs]]
}

## Get mean and interval
newdata$fit <- exp(colMeans(pred)) intervals <- lapply(data.frame(pred), quantile, probs = c(0.025, 0.975)) newdata$lwr  <- exp(unlist(lapply(intervals, "[", 1)))
newdata\$upr  <- exp(unlist(lapply(intervals, "[", 2)))

newdata
}

pred_nb <- pred_stan(model_nb, levels)
pred_nb2 <- pred_stan(model_nb2, levels)


In summary, the three generalised linear models with log link function pretty much agree about the decline of amyloplasts during the later days, which looks more appropriate than a linear decline. They disagree about the uncertainty about the numbers on the first day, which is when there are a lot. Perhaps coincidentally, this must also be where the quality of my counts are the lowest, because it is hard to count amyloplasts on top of each other.