Tidy data — it’s one of those terms that tend to confuse people, and certainly confused me. It’s Codd’s third normal form, but you can’t go around telling that to people and expect to be understood. One form is ”long”, the other is ”wide”. One form is ”melted”, another ”cast”. One form is ”gathered”, the other ”spread”. To make matters worse, I often botch the explanation and mix up at least two of the terms.
The word is also associated with the tidyverse suite of R packages in a somewhat loose way. But you don’t need to write in a tidyverse-style (including the %>%s and all) to enjoy tidy data.
But Hadley Wickham’s definition is straightforward:
In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
In practice, I don’t think people always take their data frames all the way to tidy. For example, to make a scatterplot, it is convenient to keep a couple of variables as different columns. The key is that we need to move between different forms rapidly (brain time-rapidly, more than computer time-rapidly, I might add).
And not everything should be organized this way. If you’re a geneticist, genotypes are notoriously inconvenient in normalized form. Better keep that individual by marker matrix.
The first serious piece of R code I wrote for someone else was a function to turn data into long form for plotting. I suspect plotting is often the gateway to tidy data. The function was like what you’d expect from R code written by a beginner who comes from C-style languages: It reinvented the wheel, and I bet it had nested for loops, a bunch of hard bracket indices, and so on. Then I discovered reshape2.
fake_data <- data.frame(id = 1:20,
variable1 = runif(20, 0, 1),
variable2 = rnorm(20))
melted <- melt(fake_data, id.vars = "id")
The id.vars argument is to tell the function that the id column is the key, a column that tells us which individual each observation comes from. As the name suggests, id.vars can name multiple columns in a vector.
So the is the data before:
id variable1 variable2
1 1 0.938173781 0.852098580
2 2 0.408216233 0.261269134
3 3 0.341325188 1.796235963
4 4 0.958889279 -0.356218000
And this is after. We go from 20 rows to 40: two variables times 20 individuals.
id variable value
1 1 variable1 0.938173781
2 2 variable1 0.408216233
3 3 variable1 0.341325188
4 4 variable1 0.958889279
And now: tidyr. tidyr is the new tidyverse package for rearranging data like this.
The tidyr equivalent of the melt function is called gather. There are two important differences that messed with my mind at first.
The melt and gather functions take the opposite default assumption about what columns should be treated as keys and what columns should be treated as containing values. In melt, as we saw above, we need to list the keys to keep them with each observation. In gather, we need to list the value columns, and the rest will be treated as keys.
Also, the second and third arguments (and they would be the first and second if you piped something into it), are the variable names that will be used in the long form data. In this case, to get a data frame that looks exactly the same as the first, we will stick with ”variable” and ”value”.
Here are five different ways to get the same long form data frame as above:
melted <- gather(fake_data, variable, value, 2:3)
## Column names instead of indices
melted <- gather(fake_data, variable, value, variable1, variable2)
## Excluding instead of including
melted <- gather(fake_data, variable, value, -1)
## Excluding using column name
melted <- gather(fake_data, variable, value, -id)
## With pipe
melted <- fake_data %>% gather(variable, value, -id)
Usually, this is the transformation we need: wide to long. If we need to go the other way, we can use plyr’s cast functions, and tidyr’s gather. This code recovers the original data frame:
dcast(melted, id ~ variable)
spread(melted, variable, value)