# Showing a difference in mean between two groups, take 2

A couple of years ago, I wrote about the paradoxical difficulty of visualising a difference in means between two groups, while showing both the data and some uncertainty interval. I still feel like many ills in science come from our inability to interpret a simple comparison of means. Anything with more than two groups or a predictor that isn’t categorical makes things worse, of course. It doesn’t take much to overwhelm the intuition.

My suggestion at the time was something like this — either a panel that shows the data an another panel with coefficients and uncertainty intervals; or a plot that shows the with lines that represent the magnitude of the difference at the upper and lower limit of the uncertainty interval. Alternative 1 (left), with separate panels for data and coefficient estimates, and alternative 2 (right), with confidence limits for the difference shown as vertical lines. For details, see the old post about these graphs.

Here is the fake data and linear model we will plot. If you want to follow along, the whole code is on GitHub. Group 0 should have a mean of 4, and the difference between groups should be two, and as the graphs above show, our linear model is not too far off from these numbers.

```library(broom)

data <- data.frame(group = rep(0:1, 20))
data\$response <- 4 + data\$group * 2 + rnorm(20)

model <- lm(response ~ factor(group), data = data)
result <- tidy(model)
```

Since the last post, a colleague has told me about the Gardner-Altman plot. In a paper arguing that confidence intervals should be used to show the main result of studies, rather than p-values, Gardner & Altman (1986) introduced plots for simultaneously showing confidence intervals and data.

Their Figure 1 shows (hypothetical) systolic blood pressure data for a group of diabetic and non-diabetic people. The left panel is a dot plot for each group. The right panel is a one-dimensional plot (with a different scale than the right panel; zero is centred on the mean of one of the groups), showing the difference between the groups and a confidence interval as a point with error bars.

There are functions for making this kind of plot (and several more complex alternatives for paired comparisons and analyses of variance) in the R package dabestr from Ho et al. (2019). An example with our fake data looks like this: Alternative 3: Gardner-Altman plot with bootstrap confidence interval.

```library(dabestr)

bootstrap <- dabest(data,
group,
response,
idx = c("0", "1"),
paired = FALSE)

bootstrap_diff <- mean_diff(bootstrap)

plot(bootstrap_diff)
```

While this plot is neat, I think it is a little too busy — I’m not sure that the double horizontal lines between the panels or the shaded density for the bootstrap confidence interval add much. I’d also like to use other inference methods than bootstrapping. I like how the scale of the right panel has the same unit as the left panel, but is offset so the zero is at the mean of one of the groups.

Here is my attempt at making a minimalistic version: Alternative 4: Simplified Garner-Altman plot.

This piece of code first makes the left panel of data using ggbeeswarm (which I think looks nicer than the jittering I used in the first two alternatives), then the right panel with the estimate and approximate confidence intervals of plus/minus two standard errors of the mean), adjusts the scale, and combines the panels with patchwork.

```library(ggbeeswarm)
library(ggplot2
library(patchwork)

ymin <- min(data\$response)
ymax <- max(data\$response)

plot_points_ga <- ggplot() +
geom_quasirandom(aes(x = factor(group), y = response),
data = data) +
xlab("Group") +
ylab("Response") +
theme_bw() +
scale_y_continuous(limits = c(ymin, ymax))

height_of_plot <- ymax-ymin

group0_fraction <- (coef(model) - ymin)/height_of_plot

diff_min <- - height_of_plot * group0_fraction

diff_max <- (1 - group0_fraction) * height_of_plot

plot_difference_ga <- ggplot() +
geom_pointrange(aes(x = term, y = estimate,
ymin = estimate - 2 * std.error,
ymax = estimate + 2 * std.error),
data = result[2,]) +
scale_y_continuous(limits = c(diff_min, diff_max)) +
ylab("Difference") +
xlab("Comparison") +
theme_bw()

(plot_points_ga | plot_difference_ga) +
plot_layout(widths = c(0.75, 0.25))
```

Literature

Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal

Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-Chang, A. (2019). Moving beyond P values: data analysis with estimation graphics. Nature methods.

# Using R: setting a colour scheme in ggplot2

Note to self: How to quickly set a colour scheme in ggplot2.

Imagine we have a series of plots that all need a uniform colour scale. The same category needs to have the same colour in all graphics, made possibly with different packages and by different people. Instead of hard-coding the colours and the order of categories, we can put them in a file, like so:

```library(readr)
colours <- read_csv("scale_colours.csv")
```
```# A tibble: 5 x 2
name   colour

1 blue   #d4b9da
2 red    #c994c7
3 purple #df65b0
4 green  #dd1c77
5 orange #980043```

Now a plot with default colours, using some made-up data:

```x <- 1:100

beta <- rnorm(5, 1, 0.5)

stroop <- data.frame(x,
sapply(beta, function(b) x * b + rnorm(100, 1, 10)))
colnames(stroop)[2:6] <- c("orange", "blue", "red", "purple", "green")

data_long <- pivot_longer(stroop, -x)

plot_y <- qplot(x = x,
y = value,
colour = name,
data = data_long) +
theme_minimal() +
theme(panel.grid = element_blank())
``` Now we can add the custom scale like this:

```plot_y_colours <- plot_y +
scale_colour_manual(limits = colours\$name,
values = colours\$colour)

``` # Using R: simple Gantt chart with ggplot2

Jeremy Yoder’s code for a simple Gantt chart on the Molecular Ecologist blog uses geom_line and gather to prepare the data structure. I like using geom_linerange and a coord_flip, which lets you use start and end columns directly without pivoting.

Here is a very serious data frame of activities:

```# A tibble: 6 x 4
activity       category        start               end

1 Clean house    preparations    2020-07-01 00:00:00 2020-07-03 00:00:00
2 Pack bags      preparations    2020-07-05 10:00:00 2020-07-05 17:00:00
3 Run to train   travel          2020-07-05 17:00:00 2020-07-05 17:15:00
4 Sleep on train travel          2020-07-05 17:15:00 2020-07-06 08:00:00
5 Procrastinate  procrastination 2020-07-01 00:00:00 2020-07-05 00:00:00
6 Sleep          vacation        2020-07-06 08:00:00 2020-07-09 00:00:00
```

And here is the code:

```
library(ggplot2)
library(readr)

activities <- read_csv("activities.csv")

## Set factor level to order the activities on the plot
activities\$activity <- factor(activities\$activity,
levels = activities\$activity[nrow(activities):1])

plot_gantt <- qplot(ymin = start,
ymax = end,
x = activity,
colour = category,
geom = "linerange",
data = activities,
size = I(5)) +
scale_colour_manual(values = c("black", "grey", "purple", "yellow")) +
coord_flip() +
theme_bw() +
theme(panel.grid = element_blank()) +
xlab("") +
ylab("") +
ggtitle("Vacation planning")
``` # Using R: 10 years with R

Yesterday, 29 Feburary 2020, was the 20th anniversary of the release R 1.0.0. Jozef Hajnala’s blog has a cute anniversary post with some trivia. I realised that it is also (not to the day, but to the year) my R anniversary.

I started using R in 2010, during my MSc project in Linköping. Daniel Nätt, who was a PhD student there at the time, was using it for gene expression and DNA methylation work. I think that was the reason he was pulled into R; he needed the Bioconductor packages for microarrays. He introduced me. Thanks, Daniel!

I think I must first have used it to do something with qPCR melting curves. I remember that I wrote some function to reshape/pivot data between long and wide format. It was probably an atrocity of nested loops and hard bracket indexing. Coming right from an undergraduate programme with courses using Ada and C++, even if we had also used Minitab for statistics and Matlab for engineering, I spoke R with a strong accent. At any rate, I was primed to think that doing my data analysis with code was a good idea, and jumped at the opportunity to learn a tool for it. Thanks, undergraduate programme!

I think the easiest thing to love about R is the package system. You can certainly end up in dependency hell with R and metaphorically shoot your own foot, especially on a shared high performance computing system. But I wouldn’t run into any of that until after several years. I was, and still am, impressed by how packages just worked, and could do almost anything. So, the Bioconductor packages were probably, indirectly, why I was introduced to R, and after that, my R story can be told in a series of packages. Thanks, CRAN!

The next package was R/qtl, that I relied on for my PhD. I had my own copy of the R/qtl book. For a period, I probably wrote thing every day:

```library(qtl)

cross <- read.cross(file = "F8_geno_trim.csv", format = "csv")
```

R/qtl is one of my favourite pieces or research software, relatively friendly and with lots of documentation. Thanks, R/qtl developers!

Of course it was Dom Wright, who was my PhD supervisor, who introduced me to R/qtl, and I think it was also he who introduced me to ggplot2. At least he used it, and at some point we were together trying to fix the formatting of a graph, probably with some ugly hack. I decided to use ggplot2 as much as possible, and as it is wont to, ggplot2 made me care about rearranging data, thus leading to reshape2 and plyr. ”The magic is not in plotting the data but in tidying and rearranging the data for plotting.” After a while, most everything I wrote used the ddply function in some way. Thank you, Hadley Wickham!

Then came the contemporary tidyverse. For the longest time, I was uneasy with tidyr, and I’m still not a regular purrr user, but one can’t avoid loving dplyr. How much? My talk at the Swedish Bioinformatics Workshop in 2016 had a slide expressing my love of the filter function. It did not receive the cheers that the function deserves. Maybe the audience were Python users. With new file reading functions, new data frames and functions to manipulate data frames, modern R has become smoother and friendlier. Thanks, tidyverse developers!

The history of R on this blog started in 2011, originally as a way to make notes for myself or, ”a fellow user who’s trying to google his or her way to a solution”. This turned into a series of things to help teach R to biologists around me.

There was the Slightly different introduction to R series of blog posts. It used packages that feel somewhat outdated, and today, I don’t think there’s anything even slightly different about advocating RStudio, and teaching ggplot2 from the beginning.

This spawned a couple of seminars in course for PhD students, which were updated for the Wright lab computation lunches, and eventually turned into a course of its own given in 2017. It would be fun to update it and give it again.

The last few years, I’ve been using R for reasonably large genome datasets in a HPC environment, and gotten back to the beginnings, I guess, by using Bioconducor a lot more. However, the package that I think epitomises the last years of my R use is AlphaSimR, developed by colleagues in Edinburgh. It’s great to be able throw together a quick simulation to check how some feature of genetics behaves. AlphaSimR itself is also an example of how far the R/C++ integration has come with RCpp and RCppArmadillo. Thanks, Chris!

In summary, R is my tool of choice for almost anything. I hope we’ll still be using it, in new and interesting ways, in another ten years. Thank you, R core team!

# #TidyTuesday: horror films, squirrels and commuters

Tidy Tuesday is a fun weekly activity where a lot of R enthusiasts make different visualisations, and possibly modelling, of the same dataset. You can read more about it at their Github page. I participated for three weeks, and here is a recap. I will show excerpts of the code, but you can read the whole thing by clicking through to Github.

# 2019-10-22 Horror films

In time for Halloween, we got a dataset with horror film data from IMDB. (Yes, I will be mixing the terms ”film” and ”movie” wildly.)

The first week, I started with making a pretty boring plot, the way I’d normally plot things (white background, small multiples, you know the drill). I wanted to look at distribution over the year, so I plotted what month films are released and the distribution of review scores and budgets each month. After thinking about it for a while, I thought a logarithmic scale would make sense for budgets, that span a huge range. Also, after realising that the budget column actually didn’t contain dollars, but a mix of currencies, I decided not to try to convert, but use only the US dollar budgets.

I don’t often run into dates, to using the date functions from readr and lubridate was new to me, as was the built-in vector month.abb:

```library(dplyr)
library(egg)
library(ggplot2)
library(ggimage)
library(lubridate)
library(readr)
library(stringr)

movies <- read_csv("horror_movies.csv")

## Parse dates

movies\$release_parsed  <- parse_date(movies\$release_date,
format = "%d-%b-%y",
locale = locale("en"))

movies\$release_year <- ifelse(is.na(movies\$release_parsed),
movies\$release_date,
year(movies\$release_parsed))

movies\$release_month  <- month.abb[month(movies\$release_parsed)]
```

Here, we parse the release data, and extract the release year, treating films that only have a release year separately.

I also put in means with confidence intervals, like so, and a line for the mean review rating:

```model  <- lm(review_rating ~ release_month, movies)

fit  <- data.frame(release_month = month.abb,
predict(model,
newdata = data.frame(release_month = month.abb),
interval = "confidence"),
stringsAsFactors = FALSE)

grand_mean_rating  <- mean(movies\$review_rating,
na.rm = TRUE)
```

As an example of the plotting code, here is the middle panel for ratings. As usual with ggplot2, we layer geometries on top of each other (here: violin plots, points with range bars, and a horizontal line, followed by a lot of formatting.

```plot_rating <- ggplot() +
geom_violin(aes(x = release_month,
y = review_rating),
fill = "grey",
colour = NA,
data = movies) +
scale_x_discrete(limits = month.abb) +
geom_pointrange(aes(x = release_month,
y = fit,
ymax = upr,
ymin = lwr),
data = fit) +
geom_hline(yintercept = grand_mean_rating,
linetype = 2,
colour = "red") +
ylim(0, 10) +
theme_bw(base_size = 12) +
theme(panel.grid = element_blank()) +
xlab("") +
ylab("Review rating")
```

There is similar code for the other two panels. Finally, I used ggarrange from the egg package to put everything together. In summary, most horror films are released in October, probably around Halloween. The review ratings of films released in this horror season are also a tiny bit higher than during the rest of the year, but there is not much of a difference in the budgets. After that, and after seeing some of the fun horror-themed graphs other people made, I decided to make something more colourful. Here is a plot on the same theme, showing each day and year separately, an appropriately horrendous colour scheme, and a pumpkin icon to indicate the date of Halloween. I like this plot better because it shows more of the data. It shows the increase at Halloween. We also see some spikes at other dates, like 1 January of some years. It also shows how the dataset ends at Halloween 2017. The code for this plot is mostly a lot of theme formatting. The ggplot2 theme function takes a lot of arguments I’ve never used before.

```movies\$yday  <- yday(movies\$release_parsed)

daycount <- summarise(group_by(movies, yday, release_year), n = n())
```

First, we turn dates into days of the year, and count the number of film releases.

```halloween  <-  yday("2019-10-31")

pumpkin_data  <- data.frame(x = halloween,
y = -1,
image = "pumpkin.png",
stringsAsFactors = FALSE)
```

Then, we set up the date of Halloween and a data frame for the pumpkin icon. We’re going to use geom_image from the ggimage package to add this icon to each subplot.

```breaks  <- yday(paste("2019-", 1:12, "-01", sep = ""))

plot_year <- ggplot() +
geom_point(aes(x = yday,
y = n),
colour = "green",
data = na.exclude(dc)) +
geom_image(aes(x = x,
y = y,
image = image),
data = pumpkin_data) +
facet_wrap(~ release_year,
ncol = 2) +
scale_x_continuous(breaks = breaks,
labels = month.abb) +
ylim(-3, NA) +
labs(caption = "Pumpkin icon by Good Ware from www.flatiron.com.") +
theme(panel.grid = element_blank(),
strip.background = element_blank(),
text = element_text(family = "mono",
colour = "grey",
size = 16),
axis.text = element_text(family = "mono",
colour = "green",
size = 14),
axis.ticks = element_line(colour = "green"),
strip.text = element_text(family = "mono",
colour = "grey",
size = 16),
plot.background = element_rect(fill = "black"),
panel.background = element_rect(fill = "black")) +
xlab("") +
ylab("Horror films released on this day") +
ggtitle("When horror films are released")
```

A lot of other people made graphs that highlight the increase in horror film releases around Halloween in different ways. Here are some that I like:

And, looking deeper, there is a pattern within months too:

Finally, I also like this plot, that makes a case for a U-shaped relationship between budget and rating:

And for contrast, another that makes a different case with the same data:

This seems to be a recurrent theme when it comes to interpretation and quantitative analysis in the Tidy Tuesday datasets. People make different modeling choices, or visualisation choices (which are modeling choices) about what to lump together, what to separate into bins, how to transform the data, and how to show uncertainty. In some cases, as with the pattern of film releases around Halloween, they all find similar results. In some other cases, they don’t.

# 2019-10-28 NYC Squirrel Census

This week, the data was about the location and activities of squirrels in New York central park on certain times. I had this vision of an animated map of squirrel locations. I ended up with an animation, but no map. The colour of the squirrel icon shows the main fur colour of the squirrels (grey, black, cinnamon), and the size shows adults and juveniles. I had never used gganimate before (only animation, as in this post about the Game of Life), but I had seen Thomas Lin Pedersen tweet about it, and I wanted to try.

```library(dplyr)
library(gganimate)
library(ggimage)
library(ggplot2)
library(readr)

squirrels <- read_csv("nyc_squirrels.csv")

## Parse the date
squirrels\$date_parsed  <- parse_date(as.character(squirrels\$date), format = "%m%d%Y")

## Give each observation a unique ID (to use as group in the
## animation, so as to not have points turn into one another but fade
## instead.
squirrels\$key  <- 1:nrow(squirrels)

## Associate the different squirrel colours with the filenames of
## icons in different colours (manually filled with GIMP).
squirrels\$image  <- "squirrel.png"
squirrels\$image[squirrels\$primary_fur_color == "Cinnamon"]  <- "squirrel_cinnamon.png"
squirrels\$image[squirrels\$primary_fur_color == "Gray"]  <- "squirrel_grey.png"
squirrels\$image[is.na(squirrels\$primary_fur_colour)]  <- NA
```

Again, we need to parse the date. We already have latitude and longitude. We need a unique identifier for each observation, to tell gganimate that we want each squirrel to be in its own group. Then, we associate squirrel colours with three different files with a squirrel icon in different colours.

First, we make two image scatterplot layers, setting the sizes of adults and juveniles manually. The colour is deal with by mapping the image column containing the file names to the image aesthetic. We add some formatting, and then, the transition_states layer, which is where the graph turns from still and boring to magical moving pictures. This will animate a series of discrete ”states”, which here consist of the date pasted together with the shift (AM or PM squirrel observation shift). The special ”{closest_state}” variable in the title string puts this state name as plot title.

```plot_colour <- ggplot() +
geom_image(aes(y = long, x = lat, image = image, group = key),
size = 0.04,
data = filter(squirrels, age == "Adult")) +
geom_image(aes(y = long, x = lat, image = image, group = key),
size = 0.03,
data = filter(squirrels, age == "Juvenile")) +
theme_bw(base_size = 16) +
theme(panel.grid = element_blank()) +
xlab("Latitude") +
ylab("Longitude") +
labs(title = "{closest_state}",
caption = "Data from NYC Squirrel Census. Squirrel icon made by Freepik from www.flatiron.com.") +
transition_states(paste(date_parsed, shift),
state_length = 2,
transition_length = 1)

## Render it and write to file
animate(plot_colour,
fps = 10,
nframes = 400,
end_pause = 20,
rewind = FALSE,
width = 1000,
height = 1000)
```

I was faffing around unsuccessfully with different map packages to try to find something of Central Park. It seems ggmaps is the way to go. Several other participants made nice maps:

However, I think this was my favourite:

https://github.com/ryantimpe/TidyTuesday/blob/master/2019w44/2019w44.R

The original Squirrel Census Report seems to be amazing object, too, with a beautiful map.

# 2019-11-05 Biking and walking to work in the US (and Sweden)

This week I felt I had to make a map. The end result doesn’t look like much, but it took a while. Here are the average percentages of commuters who walk and bike to work in different US states 2008-2012 with data from the American Community Survey: ```library(dplyr)
library(ggplot2)
library(readr)
library(usmap)

commute <- read_csv("commute.csv")

## Map data from the usmap package
state_map  <- us_map(regions = "state")

## There are some incompletely labelled states; fix them
missing  <- setdiff(commute\$state, state_map\$full)

commute\$state_modified <- commute\$state
commute\$state_modified[commute\$state == "Ca"] <- "California"
commute\$state_modified[commute\$state == "Massachusett"]  <- "Massachusetts"
```

We get map coordinates for the US states from the usmap package (because the one in maps doesn’t have Alaska and Hawaii).

Then we fix some mislabelling in the data.

```## Get the average per state
state_average  <- summarise(group_by(commute, state_modified, mode),
average = sum(percent * n)/sum(n))

## Combine averages and coordinates
combined  <- inner_join(state_average,
state_map,
by = c("state_modified" = "full"))
```

We take a weighted average of the percentages per state and join the state averages with the state map coordinates. The map I posted on Twitter didn’t weight the average, but I think that is a bit better. There is still the issue that states have different populations and different distributions of large and small cities, but that’s the nature of things. In summary, there is not much biking going on, but some more walking to work.

```plot_map  <- ggplot() +
geom_polygon(aes(x = x, y = y, fill = average, group = group),
colour = "black",
data = combined) +
facet_wrap(~ mode) +
scale_fill_continuous(low = "white",
high = "blue",
name = "Percent commuters") +
theme_bw(base_size = 16) +
theme(panel.grid = element_blank(),
strip.background = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
legend.position = "bottom") +
xlab("") +
ylab("") +
labs(caption = "Cycling and walking to work 2008-2012 in the American Community Survey.")
```

The US seems to live up to its reputation as a motorised country. But I have no feeling for the scale of the data. For comparision, here is a map of Sweden with some not too recent data (2005-2006, from this VTI report>). The map is from the swemap package. # Using R: Correlation heatmap with ggplot2

(This post was originally written on 2013-03-23. Since then, it has persistently remained one of my most visited posts, and I’ve decided to revisit and update it. I may do the same to some other old R-related posts that people still arrive on through search engines. There was also this follow-up, which I’ve now incorporated here.)

Just a short post to celebrate when I learned how incredibly easy it is to make a heatmap of correlations with ggplot2 (with some appropriate data preparation, of course). Here is a minimal example using the reshape2 package for preparation and the built-in attitude dataset:

```library(ggplot2)
library(reshape2)
qplot(x = Var1, y = Var2,
data = melt(cor(attitude)),
fill = value,
geom = "tile")
``` What is going on in that short passage?

• cor makes a correlation matrix with all the pairwise correlations between variables (twice; plus a diagonal of ones).
• melt takes the matrix and creates a data frame in long form, each row consisting of id variables Var1 and Var2 and a single value.
• We then plot with the tile geometry, mapping the indicator variables to rows and columns, and value (i.e. correlations) to the fill colour.

However, there is one more thing that is really need, even if just for the first quick plot one makes for oneself: a better scale. The default scale is not the best for correlations, which range from -1 to 1, because it’s hard to tell where zero is. Let’s use the airquality dataset for illustration as it actually has some negative correlations. In ggplot2, a scale that has a midpoint and a different colour in each direction is called scale_colour_gradient2, and we just need to add it. I also set the limits to -1 and 1, which doesn’t change the colour but fills out the legend for completeness. Done!

```data <- airquality[,1:4]
qplot(x = Var1, y = Var2,
data = melt(cor(data, use = "p")),
fill = value,
geom = "tile") +
scale_fill_gradient2(limits = c(-1, 1))
``` Finally, if you’re anything like me, you may be phasing out reshape2 in favour of tidyr. If so, you’ll need another function call to turn the matrix into a data frame, like so:

```library(tidyr)

correlations <- data.frame(cor(data, use = "p"))
correlations\$Var1 <- rownames(correlations)
melted <- gather(correlations, "value", "Var2", -Var1)

qplot(x = Var1, y = Var2,
data = melted,
fill = value,
geom = "tile") +
scale_fill_gradient2(limits = c(-1, 1))
```

The data preparation is no longer a oneliner, but, honestly, it probably shouldn’t be.

Okay, you won’t stop reading until we’ve made a solution with pipes? Sure, we can do that! It will be pretty gratuitous and messy, though. From the top!

```library(magrittr)

airquality %>%
'['(1:4) %>%
data.frame %>%
transform(Var1 = rownames(.)) %>%
gather("Var2", "value", -Var1) %>%
ggplot() +
geom_tile(aes(x = Var1,
y = Var2,
fill = value)) +
scale_fill_gradient2(limits = c(-1, 1))
```

# ‘Simulating genetic data with R: an example with deleterious variants (and a pun)’

A few weeks ago, I gave a talk at the Edinburgh R users group EdinbR on the RAGE paper. Since this is an R meetup, the talk concentrated on the mechanics of genetic data simulation and with the paper as a case study. I showed off some of what Chris Gaynor’s AlphaSimR can do, and how we built on that to make the specifics of this simulation study. The slides are on the EdinbR Github.

Genetic simulation is useful for all kinds of things. Sure, they’re only as good as the theory that underpins them, but the willingness to try things out in simulations is one of the things I always liked about breeding research.

This is my description of the logic of genetic simulation: we think of the genome as a large table of genotypes, drawn from some distribution of allele frequencies. To make an utterly minimal simulation, we could draw allele frequencies from some distribution (like a Beta distribution), and then draw the genotypes from a binomial distribution. Done! However, there is a ton of nuance we would like to have: chromosomes, linkage between variants, sexes, mating, selection … AlphaSimR addresses all of this, and allows you to throw individuals and populations around to build pretty complicated designs. Here is the small example simulation I used in the talk.

```
library(AlphaSimR)
library(ggplot2)

## Generate founder chromsomes

FOUNDERPOP <- runMacs(nInd = 1000,
nChr = 10,
segSites = 5000,
inbred = FALSE,
species = "GENERIC")

## Simulation parameters

SIMPARAM <- SimParam\$new(FOUNDERPOP)
SIMPARAM\$addTraitA(nQtlPerChr = 100,
mean = 100,
var = 10)
SIMPARAM\$addSnpChip(nSnpPerChr = 1000)
SIMPARAM\$setGender("yes_sys")

## Founding population

pop <- newPop(FOUNDERPOP,
simParam = SIMPARAM)

pop <- setPheno(pop,
varE = 20,
simParam = SIMPARAM)

## Breeding

print("Breeding")
breeding <- vector(length = 11, mode = "list")
breeding[] <- pop

for (i in 2:11) {
print(i)
sires <- selectInd(pop = breeding[[i - 1]],
gender = "M",
nInd = 25,
trait = 1,
use = "pheno",
simParam = SIMPARAM)

dams <- selectInd(pop = breeding[[i - 1]],
nInd = 500,
gender = "F",
trait = 1,
use = "pheno",
simParam = SIMPARAM)

breeding[[i]] <- randCross2(males = sires,
females = dams,
nCrosses = 500,
nProgeny = 10,
simParam = SIMPARAM)
breeding[[i]] <- setPheno(breeding[[i]],
varE = 20,
simParam = SIMPARAM)
}

## Look at genetic gain and shift in causative variant allele frequency

mean_g <- unlist(lapply(breeding, meanG))
sd_g <- sqrt(unlist(lapply(breeding, varG)))

plot_gain <- qplot(x = 1:11,
y = mean_g,
ymin = mean_g - sd_g,
ymax = mean_g + sd_g,
geom = "pointrange",
main = "Genetic mean and standard deviation",
xlab = "Generation", ylab = "Genetic mean")

start_geno <- pullQtlGeno(breeding[], simParam = SIMPARAM)
start_freq <- colSums(start_geno)/(2 * nrow(start_geno))

end_geno <- pullQtlGeno(breeding[], simParam = SIMPARAM)
end_freq <- colSums(end_geno)/(2 * nrow(end_geno))

plot_freq_before <- qplot(start_freq, main = "Causative variant frequency before")
plot_freq_after <- qplot(end_freq, main = "Causative variant frequency after")
```

This code builds a small livestock population, breeds it for ten generations, and looks at the resulting selection response in the form of a shift of the genetic mean, and the changes in the underlying distribution of causative variants. Here are the resulting plots: # What single step does with relationship

We had a journal club about the single step GBLUP method for genomic evaluation a few weeks ago. In this post, we’ll make a few graphs of how the single step method models relatedness between individuals.

Imagine you want to use genomic selection in a breeding program that already has a bunch of historical pedigree and trait information. You could use some so-called multistep evaluation that uses one model for the classical pedigree + trait quantitative genetics and one model for the genotype + trait genomic evaluation, and then mix the predictions from them together. Or you could use the single-step method, which combines pedigree, genotypes and traits into one model. It does this by combining the relationship estimates from pedigree and genotypes. That matrix can then go into your mixed model.

We’ll illustrate this with a tiny simulated population: five generations of 100 individuals per generation, where ten random pairings produce the next generation, with families of ten individuals. (The R code is on Github and uses AlphaSimR for simulation and AGHmatrix for matrices). Here is a heatmap of the pedigree-based additive relationship matrix for the population: What do we see? In the lower left corner are the founders, and not knowing anything about their heritage, the matrix has them down as unrelated. The squares of high relatedness along the diagonal are the families in each generation. As we go upwards and to the right, relationship is building up.

Now, imagine the last generation of the population also has been genotyped with a SNP chip. Here is a heatmap of their genomic relationship matrix: Genomic relationship is more detailed. We can still discern the ten families within the last generation, but no longer are all the siblings equally related to each other and to their ancestors. The genotyping helps track segregation within families, pointing out to us when relatives are more or less related than the average that we get from the pedigree.

Enter the single-step relationship matrix. The idea is to put in the genomic relationships for the genotyped individuals into the big pedigree-based relationship matrix, and then adjust the rest of the matrix to propagate that extra information we now have from the genotyped individuals to their ungenotyped relatives. Here is the resulting heatmap: You can find the matrix equations in Legarra, Aguilar & Misztal (2009). The matrix, called H, is broken down into four partitions called H11, H12, H21, and H22. H22 is the part that pertains to the genotyped animals, and it’s equal to the genomic relationship matrix G (after some rescaling). The others are transformations of G and the corresponding parts of the additive relationship matrix, spreading G onto A.

To show what is going on, here is the difference between the additive relationship matrix and the single-step relationship matrix, with lines delineating the genotyped animals and breaking the matrix into the four partitions: What do we see? In the top right corner, we have a lot of difference, where the genomic relationship matrix has been plugged in. Then, fading as we go from top to bottom and from right to left, we see the influence of the genomic relationship on relatives, diminishing the further we get from the genotyped individuals.

Literature

Legarra, Andres, I. Aguilar, and I. Misztal. ”A relationship matrix including full pedigree and genomic information.” Journal of dairy science 92.9 (2009): 4656-4663.

# Using R: plotting the genome on a line

Imagine you want to make a Manhattan-style plot or anything else where you want a series of intervals laid out on one axis after one another. If it’s actually a Manhattan plot you may have a friendly R package that does it for you, but here is how to cobble the plot together ourselves with ggplot2.

We start by making some fake data. Here, we have three contigs (this could be your chromosomes, your genomic intervals or whatever) divided into one, two and three windows, respectively. Each window has a value that we’ll put on the y-axis.

```library(dplyr)
library(ggplot2)

data <- data_frame(contig = c("a", "a", "a", "b", "b", "c"),
start = c(0, 500, 1000, 0, 500, 0),
end = c(500, 1000, 1500, 500, 1000, 200),
value = c(0.5, 0.2, 0.4, 0.5, 0.3, 0.1))
```

We will need to know how long each contig is. In this case, if we assume that the windows cover the whole thing, we can get this from the data. If not, say if the windows don’t go up to the end of the chromosome, we will have to get this data from elsewhere (often some genome assembly metadata). This is also where we can decide in what order we want the contigs.

```contig_lengths <- summarise(group_by(data, contig), length = max(end))
```

Now, we need to transform the coordinates on each contig to coordinates on our new axis, where we lay the contings after one another. What we need to do is to add an offset to each point, where the offset is the sum of the lengths of the contigs we’ve layed down before this one. We make a function that takes three arguments: two vectors containing the contig of each point and the position of each point, and also the table of lengths we just made.

```flatten_coordinates <- function(contig, coord, contig_lengths) {
coord_flat <- coord
offset <- 0

for (contig_ix in 1:nrow(contig_lengths)) {
on_contig <- contig == contig_lengths\$contig[contig_ix]
coord_flat[on_contig] <- coord[on_contig] + offset
offset <- offset + contig_lengths\$length[contig_ix]
}

coord_flat
}
```

Now, we use this to transform the start and end of each window. We also transform the vector of the length of the contigs, so we can use it to add vertical lines between the contigs.

```data\$start_flat <- flatten_coordinates(data\$contig,
data\$start,
contig_lengths)
data\$end_flat <- flatten_coordinates(data\$contig,
data\$end,
contig_lengths)
contig_lengths\$length_flat <- flatten_coordinates(contig_lengths\$contig,
contig_lengths\$length,
contig_lengths)
```

It would be nice to label the x-axis with contig names. One way to do this is to take the coordinates we just made for the vertical lines, add a zero, and shift them one position, like so:

```axis_coord <- c(0, contig_lengths\$length_flat[-nrow(contig_lengths)])
```

Now it’s time to plot! We add one layer of points for the values on the y-axis, where each point is centered on the middle of the window, followed by a layer of vertical lines at the borders between contigs. Finally, we add our custom x-axis, and also some window dressing.

```plot_genome <- ggplot() +
geom_point(aes(x = (start_flat + end_flat)/2,
y = value),
data = data) +
geom_vline(aes(xintercept = length_flat),
data = contig_lengths) +
scale_x_continuous(breaks = axis_coord,
labels = contig_lengths\$contig,
limits = c(0, max(contig_lengths\$length_flat))) +
xlab("Contig") + ylim(0, 1) + theme_bw()
```

And this is what we get: I’m sure your plot will look more impressive, but you get the idea.

# Scripting for data analysis (with R)

Course materials (GitHub)

This was a PhD course given in the spring of 2017 at Linköping University. The course was organised by the graduate school Forum scientium and was aimed at people who might be interested in using R for data analysis. The materials developed from a part of a previous PhD course from a couple of years ago, an R tutorial given as part of the Behaviour genetics Masters course, and the Wright lab computation lunches.

Around twenty people attended the seminars, and a couple of handfuls of people completed the homeworks. I don’t know how much one should read into the course evaluation form, but the feedback was mostly positive. Some people had previous exposure to R, and did the first homework in an hour. Others had never programmed in any language, and had a hard time getting started.

There is certainly scope for improvement. For example, some of the packages used could be substituted for more contemporary tools. One could say that the course is slouching towards the tidyverse. But I worry a bit about making the participants feel too boxed in. I don’t want them to feel that they’re taught a way that will solve some anticipated type of problems very neatly, but that may not generalize. Once I’ve made the switch to dplyr and tidyr (and maybe even purr … though I hesitate) fully myself, I would probably use them in teaching too. Another nice plus would be to be able to use R for data science as course literature. The readings now are scattered; maybe a monolithic book would be good.

I’ve tried, in every iteration, to emphasize the importance of writing scripts, even when working interactively with R. I still think I need to emphasize it even more. There is also a kind of ”do as I say, not as I do” issue, since in the seminars, I demo some things by just typing them into the console. I’ll force myself to write them into a script instead.

Possible alternate flavours for the course include: A longer version expanding on the same topics. I don’t think one should cram more contents in. I’d like to have actual projects where the participants can analyze, visualize and present data and simulations.

This is the course plan we sent out:

1. A crash course in R

Why do data analysis with a scripting language
The RStudio interface
Using R as a calculator
Working interactively and writing code
Getting help
Reading and looking at data
Installing useful packages
A first graph with ggplot2

Homework for next time: The Unicorn Dataset, exercises in reading data, descriptive statistics, linear models and a few statistical graphs.

2. Programming for data analysis

Programming languages one may encounter in science
Common concepts and code examples
Data structures in R
Vectors
Data frames
Functions
Control flow

Homework for next time: The Unicorn Expression Dataset, exercises in data wrangling and more interesting graphs.

3. Working with moderately large data

Exercise followup
More about functions
Lists
Objects
Functional and imperative programming
Doing things many times, loops and plyr
Simulating data
Working on a cluster

Final homework: Design analysis by simulation: pick a data analysis project that you care about; simulate data based on a model and reasonable effect size; implement the data analysis; and apply it to simulated data with and without effects to estimate power and other design characteristics. This ties together skills from all seminars.