Boring meta-post of the year

Publicerat i dear diary, english av mrtnj

Really, it’s the second boring meta-post of the year, since I’ve already posted this one.

There were some rumours recently that the Scienceblogs blog network would shut down the site. It appears to still be up, and there are still blogs going there, so I don’t know about that, but this reminded me that Scienceblogs existed. I don’t think I’ve read anything on Scienceblogs in years, but it was one of my inspirations when I started blogging. It’s not that I wanted to be a science writer, but Scienceblogs and the also now defunct ResearchBlogging RSS feed (Fausto & al 2012) made me figure out that blogging about science was a thing people did.

Slowly, this thing took shape and became a ”science community blog”, in the terminology of Saunders & al (2017). That is, this blog is not so much about outreach or popular science, but ”aimed at the academic community”. I think of it as part of a conversation about genetics, even if it may be largely a conversation with myself.

So what is the state of the blog now? In September 2016, I decided to try to post once or twice a month (and also to make sure that both posts weren’t pointless filler posts). This panned out pretty well up until October 2017, when I ran out of steam for a while. Probably unrelated to that, 2017 was also the year my blog traffic suddenly increased by more than a factor of two. I don’t know for sure why, but looking at the numbers of individual posts, it seems the increase is because a lot of R users are looking for tidyverse-related things. If I went by viewer statistics, I would post less about genetics and more about hip R packages.

Instead, 2018 I will:

Attempt to keep up the pace of writing one or two things every month. Some, but not all, of them will be pointless fillers.
Hopefully produce a couple of posts about papers, if those things get out of the pipeline eventually. The problem with this, as anyone who writes papers knows, is that once something is out of the pipeline, one has grown so enormously tired of it.
Write a few more posts about other scientific papers I read. I’ve heard that there is limited interest in that sorts of thing, but I enjoy it, and writing should make me think harder about what I read.

17Dec2017

Using R: reshape2 to tidyr

Publicerat i computer stuff, data analysis, english av mrtnj

Tidy data — it’s one of those terms that tend to confuse people, and certainly confused me. It’s Codd’s third normal form, but you can’t go around telling that to people and expect to be understood. One form is ”long”, the other is ”wide”. One form is ”melted”, another ”cast”. One form is ”gathered”, the other ”spread”. To make matters worse, I often botch the explanation and mix up at least two of the terms.

The word is also associated with the tidyverse suite of R packages in a somewhat loose way. But you don’t need to write in a tidyverse-style (including the %>%s and all) to enjoy tidy data.

But Hadley Wickham’s definition is straightforward:

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

In practice, I don’t think people always take their data frames all the way to tidy. For example, to make a scatterplot, it is convenient to keep a couple of variables as different columns. The key is that we need to move between different forms rapidly (brain time-rapidly, more than computer time-rapidly, I might add).

And not everything should be organized this way. If you’re a geneticist, genotypes are notoriously inconvenient in normalized form. Better keep that individual by marker matrix.

The first serious piece of R code I wrote for someone else was a function to turn data into long form for plotting. I suspect plotting is often the gateway to tidy data. The function was like what you’d expect from R code written by a beginner who comes from C-style languages: It reinvented the wheel, and I bet it had nested for loops, a bunch of hard bracket indices, and so on. Then I discovered reshape2.

library(reshape2)
fake_data <- data.frame(id = 1:20,
                        variable1 = runif(20, 0, 1),
                        variable2 = rnorm(20))
melted <- melt(fake_data, id.vars = "id")

The id.vars argument is to tell the function that the id column is the key, a column that tells us which individual each observation comes from. As the name suggests, id.vars can name multiple columns in a vector.

So the is the data before:

  id   variable1    variable2
1  1 0.938173781  0.852098580
2  2 0.408216233  0.261269134
3  3 0.341325188  1.796235963
4  4 0.958889279 -0.356218000

And this is after. We go from 20 rows to 40: two variables times 20 individuals.

  id  variable       value
1  1 variable1 0.938173781
2  2 variable1 0.408216233
3  3 variable1 0.341325188
4  4 variable1 0.958889279

And now: tidyr. tidyr is the new tidyverse package for rearranging data like this.

The tidyr equivalent of the melt function is called gather. There are two important differences that messed with my mind at first.

The melt and gather functions take the opposite default assumption about what columns should be treated as keys and what columns should be treated as containing values. In melt, as we saw above, we need to list the keys to keep them with each observation. In gather, we need to list the value columns, and the rest will be treated as keys.

Also, the second and third arguments (and they would be the first and second if you piped something into it), are the variable names that will be used in the long form data. In this case, to get a data frame that looks exactly the same as the first, we will stick with ”variable” and ”value”.

Here are five different ways to get the same long form data frame as above:

library(tidyr)
melted <- gather(fake_data, variable, value, 2:3)

## Column names instead of indices
melted <- gather(fake_data, variable, value, variable1, variable2)

## Excluding instead of including
melted <- gather(fake_data, variable, value, -1)

## Excluding using column name
melted <- gather(fake_data, variable, value, -id)

## With pipe
melted <- fake_data %>% gather(variable, value, -id)

Usually, this is the transformation we need: wide to long. If we need to go the other way, we can use plyr’s cast functions, and tidyr’s gather. This code recovers the original data frame:

## plyr
dcast(melted, id ~  variable)

## tidyr
spread(melted, variable, value)

10Dec2017

Peerage of Science Reviewer Prize 2017

Publicerat i dear diary, english av mrtnj

I won a prize! Hurrah! I’m obviously very happy.

If you want to hear me answer a couple of questions and see the Peerage of Science crew engaged in some amusing video editing, look at the interview.

How did that happen? After being told, about a year ago to check out the peer review platform Peerage of Science, I decided to keep reviewing manuscripts that showed up and were relevant to my interests. Reading and commenting on unpublished manuscripts is stimulating, and I thought it would help improve my reviewing and, maybe, writing.

Maybe this is a testament to the power of gamification. I admit that I’ve occasionally been checking my profile to see what the score is even without thinking of any reviewer prize.

On unicorns and genes

Martin Johnsson's blog about genetics and sundry things

Månadsarkiv: december 2017

Boring meta-post of the year

Using R: reshape2 to tidyr

Peerage of Science Reviewer Prize 2017