Using R: plotting the genome on a line

Imagine you want to make a Manhattan-style plot or anything else where you want a series of intervals laid out on one axis after one another. If it’s actually a Manhattan plot you may have a friendly R package that does it for you, but here is how to cobble the plot together ourselves with ggplot2.

We start by making some fake data. Here, we have three contigs (this could be your chromosomes, your genomic intervals or whatever) divided into one, two and three windows, respectively. Each window has a value that we’ll put on the y-axis.

library(dplyr)
library(ggplot2)

data <- data_frame(contig = c("a", "a", "a", "b", "b", "c"),
                   start = c(0, 500, 1000, 0, 500, 0),
                   end = c(500, 1000, 1500, 500, 1000, 200),
                   value = c(0.5, 0.2, 0.4, 0.5, 0.3, 0.1))

We will need to know how long each contig is. In this case, if we assume that the windows cover the whole thing, we can get this from the data. If not, say if the windows don’t go up to the end of the chromosome, we will have to get this data from elsewhere (often some genome assembly metadata). This is also where we can decide in what order we want the contigs.

contig_lengths <- summarise(group_by(data, contig), length = max(end))

Now, we need to transform the coordinates on each contig to coordinates on our new axis, where we lay the contings after one another. What we need to do is to add an offset to each point, where the offset is the sum of the lengths of the contigs we’ve layed down before this one. We make a function that takes three arguments: two vectors containing the contig of each point and the position of each point, and also the table of lengths we just made.

flatten_coordinates <- function(contig, coord, contig_lengths) {
    coord_flat <- coord
    offset <- 0

    for (contig_ix in 1:nrow(contig_lengths)) {
        on_contig <- contig == contig_lengths$contig[contig_ix]
        coord_flat[on_contig] <- coord[on_contig] + offset
        offset <- offset + contig_lengths$length[contig_ix]
    }

    coord_flat
}

Now, we use this to transform the start and end of each window. We also transform the vector of the length of the contigs, so we can use it to add vertical lines between the contigs.

data$start_flat <- flatten_coordinates(data$contig,
                                       data$start,
                                       contig_lengths)
data$end_flat <- flatten_coordinates(data$contig,
                                     data$end,
                                     contig_lengths)
contig_lengths$length_flat <- flatten_coordinates(contig_lengths$contig,
                                                  contig_lengths$length,
                                                  contig_lengths)

It would be nice to label the x-axis with contig names. One way to do this is to take the coordinates we just made for the vertical lines, add a zero, and shift them one position, like so:

axis_coord <- c(0, contig_lengths$length_flat[-nrow(contig_lengths)])

Now it’s time to plot! We add one layer of points for the values on the y-axis, where each point is centered on the middle of the window, followed by a layer of vertical lines at the borders between contigs. Finally, we add our custom x-axis, and also some window dressing.

plot_genome <- ggplot() +
    geom_point(aes(x = (start_flat + end_flat)/2,
                   y = value),
               data = data) +
    geom_vline(aes(xintercept = length_flat),
               data = contig_lengths) +
    scale_x_continuous(breaks = axis_coord,
                       labels = contig_lengths$contig,
                       limits = c(0, max(contig_lengths$length_flat))) +
    xlab("Contig") + ylim(0, 1) + theme_bw()

And this is what we get:

I’m sure your plot will look more impressive, but you get the idea.

Using R: When using do in dplyr, don’t forget the dot

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

library(plyr)
results <- ddply(some_data, "key", function(x) {
  ## do something; return data.frame()
})

Or maybe, if I felt serious and thought the function would ever be used again, I’d write:

calculate <- function(x) {
  ## do something; return data.frame()
}
result <- ddply(some_data, "key", calculate)

Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

library(dplyr)
grouped <- group_by(some_data, key)
result <- do(grouped, calculate(.))

Or once again with magrittr:

library(magrittr)
some_data %>%
  group_by(key) %>%
  do(calculate(.)) -> result

(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped <- group_by(some_data, key)
## will not work: Error: Results are not data frames at positions ...
try(result <- do(grouped, calculate))

Don’t forget the dot!

It seems dplyr is overtaking correlation heatmaps

(… on my blog, that is.)

For a long time, my correlation heatmap with ggplot2 was the most viewed post on this blog. It still leads the overall top list, but by far the most searched and visited post nowadays is this one about dplyr (followed by it’s sibling about plyr).

I fully support this, since data wrangling and reorganization logically comes before plotting (especially in the ggplot2 philosophy).

But it’s also kind of a shame, because it’s not a very good dplyr post, and the one about the correlation heatmap is not a very good ggplot2 post. Thankfully, there is a new edition of the ggplot2 book by Hadley Wickham, and a new book by him and Garrett Grolemund about data analysis with modern R packages. I’m looking forward to reading them.

Personally, I still haven’t made the switch from plyr and reshape2 to dplyr and tidyr. But here is the updated tidyverse-using version of how to quickly calculate summary statistics from a data frame:

library(tidyr)
library(dplyr)
library(magrittr)

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)),
                   treatment = rep(c(1, 2), 1000),
                   response1 = rnorm(2000, 0, 1),
                   response2 = rnorm(2000, 0, 1))

gather(data, response1, response2, value = "value", key = "variable") %>%
  group_by(sex, treatment, variable) %>%
  summarise(mean = mean(value), sd = sd(value))

Row by row we:

1-3: Load the packages.

5-8: Simulate some nonsense data.

10: Transform the simulated dataset to long form. This means that the two variables response1 and response2 get collected to one column, which will be called ”value”. The column ”key” will indicate which variable each row belongs to. (gather is tidyr’s version of melt.)

11: Group the resulting dataframe by sex, treatment and variable. (This is like the second argument to d*ply.)

12: Calculate the summary statistics.

Source: local data frame [8 x 5]
Groups: sex, treatment [?]

    sex treatment  variable        mean        sd
  (dbl)     (dbl)     (chr)       (dbl)     (dbl)
1     1         1 response1 -0.02806896 1.0400225
2     1         1 response2 -0.01822188 1.0350210
3     1         2 response1  0.06307962 1.0222481
4     1         2 response2 -0.01388931 0.9407992
5     2         1 response1 -0.06748091 0.9843697
6     2         1 response2  0.01269587 1.0189592
7     2         2 response1 -0.01399262 0.9696955
8     2         2 response2  0.10413442 0.9417059