Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:

library(qtl)

cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")

R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!

Using R: from plyr to purrr, part 0 out of however many

This post is me thinking out loud about applying functions to vectors or lists and getting data frames back.

Using R is an ongoing process of finding nice ways to throw data frames, lists and model objects around. While tidyr has arrived at a comfortable way to reshape dataframes with pivot_longer and pivot_wider, I don’t always find the replacements for the good old plyr package as satisfying.

Here is an example of something I used to like to do with plyr. Don’t laugh!

Assume we have a number of text files, all in the same format, that we need to read and combine. This arises naturally if you run some kind of analysis where the dataset gets split into chunks, like in genetics, where chunks might be chromosomes.

## Generate vector of file names
files <- paste("data/chromosome", 1:20, ".txt", sep = "")

library(plyr)
library(readr)
genome <- ldply(files, read_tsv)

This gives us one big data frame, containing the rows from all those files.

If we want to move on from plyr, what are our options?

We can go old school with base R functions lapply and Reduce.

library(readr)

chromosomes <- lapply(files, read_tsv)
genome <- Reduce(rbind, chromosomes)

Here, we first let lapply read each file and store it in a list. Then we let Reduce fold the list with rbind, which binds the data frames in the list together, one below the other.

If that didn’t make sense, here it is again: lapply maps a function to each element of a vector or list, collecting the results in a list. Reduce folds the elements in a list together, using a function that takes in two arguments. The first argument will be the results it’s accumulated so far, and the second argument will be the next element of the list.

In the end, this leaves us, as with ldply, with one big data frame.

We can also use purrr‘s map_dfr. This seems to be the contemporary most elegant solution:

library(purrr)
library(readr)

genome <- map_dfr(files, read_tsv)

map_dfr, like good old ldply will map over a vector or list, and collect resulting data frames. The ”r” in the name means adding the next data frame as rows. There is also a ”c” version (map_dfc) for adding as columns.

Scripting for data analysis (with R)

Course materials (GitHub)

This was a PhD course given in the spring of 2017 at Linköping University. The course was organised by the graduate school Forum scientium and was aimed at people who might be interested in using R for data analysis. The materials developed from a part of a previous PhD course from a couple of years ago, an R tutorial given as part of the Behaviour genetics Masters course, and the Wright lab computation lunches.

Around twenty people attended the seminars, and a couple of handfuls of people completed the homeworks. I don’t know how much one should read into the course evaluation form, but the feedback was mostly positive. Some people had previous exposure to R, and did the first homework in an hour. Others had never programmed in any language, and had a hard time getting started.

There is certainly scope for improvement. For example, some of the packages used could be substituted for more contemporary tools. One could say that the course is slouching towards the tidyverse. But I worry a bit about making the participants feel too boxed in. I don’t want them to feel that they’re taught a way that will solve some anticipated type of problems very neatly, but that may not generalize. Once I’ve made the switch to dplyr and tidyr (and maybe even purr … though I hesitate) fully myself, I would probably use them in teaching too. Another nice plus would be to be able to use R for data science as course literature. The readings now are scattered; maybe a monolithic book would be good.

I’ve tried, in every iteration, to emphasize the importance of writing scripts, even when working interactively with R. I still think I need to emphasize it even more. There is also a kind of ”do as I say, not as I do” issue, since in the seminars, I demo some things by just typing them into the console. I’ll force myself to write them into a script instead.

Possible alternate flavours for the course include: A longer version expanding on the same topics. I don’t think one should cram more contents in. I’d like to have actual projects where the participants can analyze, visualize and present data and simulations.

This is the course plan we sent out:

1. A crash course in R

Why do data analysis with a scripting language
The RStudio interface
Using R as a calculator
Working interactively and writing code
Getting help
Reading and looking at data
Installing useful packages
A first graph with ggplot2

Homework for next time: The Unicorn Dataset, exercises in reading data, descriptive statistics, linear models and a few statistical graphs.

2. Programming for data analysis

Programming languages one may encounter in science
Common concepts and code examples
Data structures in R
Vectors
Data frames
Functions
Control flow

Homework for next time: The Unicorn Expression Dataset, exercises in data wrangling and more interesting graphs.

3. Working with moderately large data

Exercise followup
More about functions
Lists
Objects
Functional and imperative programming
Doing things many times, loops and plyr
Simulating data
Working on a cluster

Final homework: Design analysis by simulation: pick a data analysis project that you care about; simulate data based on a model and reasonable effect size; implement the data analysis; and apply it to simulated data with and without effects to estimate power and other design characteristics. This ties together skills from all seminars.

Using R: When using do in dplyr, don’t forget the dot

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

library(plyr)
results <- ddply(some_data, "key", function(x) {
  ## do something; return data.frame()
})

Or maybe, if I felt serious and thought the function would ever be used again, I’d write:

calculate <- function(x) {
  ## do something; return data.frame()
}
result <- ddply(some_data, "key", calculate)

Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

library(dplyr)
grouped <- group_by(some_data, key)
result <- do(grouped, calculate(.))

Or once again with magrittr:

library(magrittr)
some_data %>%
  group_by(key) %>%
  do(calculate(.)) -> result

(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped <- group_by(some_data, key)
## will not work: Error: Results are not data frames at positions ...
try(result <- do(grouped, calculate))

Don’t forget the dot!

Using R: quickly calculating summary statistics from a data frame

A colleague asked: I have a lot of data in a table and I’d like to pull out some summary statistics for different subgroups. Can R do this for me quickly?

Yes, there are several pretty convenient ways. I wrote about this in the recent post on the barplot, but as this is an important part of quickly getting something useful out of R, just like importing data, I’ll break it out into a post of its own. I will present a solution that uses the plyr and reshape2 packages. You can do the same with base R, and there’s nothing wrong with base R, but I find that plyr and reshape2 makes things convenient and easy to remember. The apply family of functions in base R does the same job as plyr, but with a slightly different interface. I strongly recommend beginners to begin with plyr or the apply functions, and not what I did initially, which was nested for loops and hard bracket indexing.

We’ll go through and see what the different parts do. First, simulate some data. Again, when you do this, you usually have a table already, and you can ignore the simulation code. Usually a well formed data frame will look something this: a table where each observation is a unit such as an individual, and each column gives the data about the individual. Here, we imagine two binary predictors (sex and treatment) and two continuous response variables.

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))
head(data)
  sex treatment   response1   response2
1   1         1 -0.15668214 -0.13663012
2   1         2 -0.40934759 -0.07220426
3   1         1  0.07103731 -2.60549018
4   1         2  0.15113270  1.81803178
5   1         1  0.30836910  0.32596016
6   1         2 -1.41891407  1.12561812

Now, calculating a function of the response in some group is straightforward. Most R functions are vectorised by default and will accept a vector (that is, a column of a data frame). The subset function lets us pull out rows from the data frame based on a logical expression using the column names. Say that we want mean, standard deviation and a simple standard error of the mean. I will assume that we have no missing values. If you have, you can add na.rm=T to the function calls. And again, if you’ve got a more sophisticated model, these might not be the standard errors you want. Then pull them from the fitted model instead.

mean(subset(data, sex == 1 & treatment == 1)$response1)

sd(subset(data, sex == 1 & treatment == 1)$response1)

sd(subset(data, sex == 1 & treatment == 1)$response1)/
  sqrt(nrow(subset(data, sex == 1 & treatment == 1)))

Okay, but doing this for each combination of the predictors and responses is no fun and requires a lot of copying and pasting. Also, the above function calls are pretty messy with lots of repetition. There is a better way, and that’s where plyr and reshape2 come in. We load the packages. The first time you’ll have to run install.packages, as usual.

library(plyr)
library(reshape2)

First out, the melt function from rehape2. Look at the table above. It’s reasonable in many situations, but right now, it would be better if we put both the response variables in the same column. If it doesn’t seem so useful, trust me and see below. Melt will take all the columns except the ones we single out as id variables and put them in the same column. It makes sense to label each row with the sex and treatment of the individual. If we had an actual unit id column, it would go here as well:

melted <- melt(data, id.vars=c("sex", "treatment"))

The resulting ”melted” table looks like this. Instead of the response variables separately we get a column of values and a column indicating which variable the value comes from.

  sex treatment  variable       value
1   1         1 response1 -0.15668214
2   1         2 response1 -0.40934759
3   1         1 response1  0.07103731
4   1         2 response1  0.15113270
5   1         1 response1  0.30836910
6   1         2 response1 -1.41891407

Now it’s time to calculate the summary statistics again. We will use the same functions as above to do the actual calculations, but we’ll use plyr to automatically apply them to all the subsets we’re interested in. This is sometimes called the split-apply-combine approach: plyr will split the data frame into subsets, apply the function of our choice, and then collect the results for us. The first thing to notice is the function name. All the main plyr functions are called something with -ply. The letters stand for the input and return data type: ddply works on a data frame and returns a data frame. It’s probably the most important member of the family.

The arguments to ddply are the data frame to work on (melted), a vector of the column names to split on, and a function. The arguments after the function name are passed on to the function. Here we want to split in subsets for each sex, treatment and response variable. The function we apply is summarise, which makes a new data frame with named columns based on formulas, allowing us to use the column names of the input data frame in formulas. In effect it does exactly what the name says, summarises a data frame. And in this instance, we want to calculate the mean, standard deviation and standard error of the mean, so we use the above function calls, using value as the input. Run the ddply call, and we’re done!

ddply(melted, c("sex", "treatment", "variable"), summarise,
      mean = mean(value), sd = sd(value),
      sem = sd(value)/sqrt(length(value)))
  sex treatment  variable         mean        sd        sem
1   1         1 response1  0.021856280 1.0124371 0.04527757
2   1         1 response2  0.045928150 1.0151670 0.04539965
3   1         2 response1 -0.065017971 0.9825428 0.04394065
4   1         2 response2  0.011512867 0.9463053 0.04232006
5   2         1 response1 -0.005374208 1.0095468 0.04514830
6   2         1 response2 -0.051699624 1.0154782 0.04541357
7   2         2 response1  0.046622111 0.9848043 0.04404179
8   2         2 response2 -0.055257295 1.0134786 0.04532414

Using R: writing a table with odd lines (again)

Let’s look at my gff track headers again. Why not do it with plyr instead?

d_ply splits the data frame by the feature column and applies a nameless function that writes subsets to the file (and returns nothing, hence the ”_” in the name). This isn’t shorter or necessarily better, but it appeals to me.

library(plyr)
connection <- file("separate_tracks_2.gff", "w")
d_ply(gff, "feature", function(x) {
  writeLines(paste("track name=", x$feature[1], sep=""), connection)
  write.table(x, sep="\t", row.names=F, col.names=F,
              quote=F, file=connection)
})
close(connection)