R in genomics @ SciLifeLab, Solna

Dear diary,

I went to the Stockholm R useR group meetup on R in genomics at the Stockholm node of SciLifeLab. It was nice. If I had worked a bit closer I would attend meetups all the time. I even got to be pretentious with my notebook while waiting for the train.

blog_pretentious

The speakers were:

Jakub Orzechowski Westholm on R and genomics in general. He demonstrated genome browser-style tracks with Gviz, some GenomicRanges, and a couple of common plots of gene expression data. I have been on the fence about what package I should use for drawing genes and variants along the genome. I should play with Gviz.

Daniel Klevebring on clinical sequencing and how he uses R (not that much) in sequencing pipelines aimed at targeting the right therapy to patients based on the mutations in their cancer cells. He mentioned some getopt snippets for getting R to play nicely on the command line, which is something I should definitely try more!

Finally, Arvind Singh Mer on predictive modelling for clinical genomics (like the abovementioned ClinSeq data). He showed the caret package for machine learning, with an elastic net regression.

I don’t know the rest of the audience, so maybe the choice to gear talks towards the non-bio* person was spot on, but that made things a bit less interesting for me. For instance, in Jakub’s talk about gene expression, I would’ve preferred more about the messy stuff: how to make that nice gene-by-sample matrix in the first place, and if R can be of any help in that process; also, in the other end, what models one would use after that first pass of visualisation. But this isn’t a criticism of the presenters — time and complexity constraints apply. (If I was asked to present how I use R any demos would be toy analyses of clean datasets. That is the way these things go.)

We also heard repeated praise for and recommendations of the hadleyverse and data.table. I’m not a data.tabler myself, but I probably should be. And I completely agree about the value of dplyr — there’s this one analysis where a couple of lines with dplyr changed it from ”argh, do I have to rewrite this in C?” to being workable. I think we also saw all the three plotting systems: base graphics, ggplot2 and lattice in action.

Bibliometrics and I

Dear diary,

I’m attending a course about scientific publishing, and the other day there was lecture about bibliometrics by Lovisa Österlund and David Lawrence from the Linköping University library. I don’t think I know anyone who particularly likes bibliometrics, but I guess it makes sense that if one needs to evaluate research without trying to understand what it is about there are only citations, the reputation of the publication channel and the cv of the researcher to look at. I imagine it’s a bit like reviewing a novel in a language one doesn’t know. A couple of things occured to me, though.

What to do when different instruments of evaluation give different results? Take the two papers (so far) published during my PhD: they both deal with the genetics of chicken comb size; one is published in PLOS Genetics and one in Molecular Ecology. If we look at journal impact factors (and we shouldn’t, but say that we do), PLOS Genetics comes out ahead with an impact factor of 8.5 against 6.3. For those that do not know this about it, journal impact factor is the mean number of citations for papers in that journal the last two years calculated by Thomson Reuters in their own secret way. However, Linköping University has for some reason decided to use the Norwegian index for evaluating publication channels. I don’t know why, and I don’t think it matters that much for me personally, since the system will change soon and I will finish in about a year and a half. In the Norwegian system journals are ranked as level one or two, where two is better and is supposed to represent the top 20% of that subject area. According to their database, Molecular ecology is level 2, while PLOS Genetics is level 1. The source of the discrepancy is probably that PLOS Genetics is counted as biomedicine, while Molecular Ecology is biology, according to the Norwegian database.

They also mentioned Altmetrics, and I don’t know what to make of it. On one hand, I guess it’s good to keep tabs on social media. On the other hand, what do numbers of tweets really tell you, except that one of the authors has a Twitter account? One of the examples in the lecture was the metrics page for this paper that I happen to be a contributor to. It is actually pretty strange. It shows three tweets or 11 tweets, depending on where on the page you look. Also, when I accessed this page earlier today it linked a blog. Now it doesn’t. That says something about the ephemeral nature of internet media. Regardless, when I first saw the page I thought perhaps the metrics page had picked up on my post about the paper, but that was not the case. I don’t know how altmetric.com define a ”science blog”, maybe the blog has to be listed on some aggregation site or another, and I’m not pretending my post is particularly insightful or important. Still it’s a little strange that the altmetrics page doesn’t list a post by one of the authors about the paper, but listed a post that referred to the paper with only two sentences and was mistaken about the conclusion.

About blogging

Dear diary,

I’ve had this blog since 2010, but it was not until last year that I started writing anything else than popular/science in Swedish. There is lots of discussion on academic blogs about whether PhD students, or any academics, should write on blogs or not and also quite a bit of fear, uncertainty and doubt going around. This is what I think: I don’t think my blog is such a big deal. It’s just a small hobby project that makes me happy. And while I hope it doesn’t hurt my research or my chances to continue doing science, I don’t think it helps them much either.

Do I have a target audience? There was recently a small survey to find what academics blog about and why; they found that most blogs were directed at peers, not for outreach. I’m not surprised. As I’ve already mentioned, my posts in Swedish are more popular/science, less technical and sometimes deal with things published in Swedish media. I think the target audience is still geeks of some kind, but not necessarily genetics geeks. My posts in English are more directed at academical things, either related to my research and work as a PhD student or about the R language. So my posts are a mix of languages and themes. Is that a problem? From a popularity or readership perspective, probably yes. I can see little reason not to split the posts to two blogs, each concentrated on one theme, except that I don’t feel like running two blogs.

Does blogging hurt me because it hurts my work? I hardly think so. First, blogging is not part of my duties at the university, and I don’t do it instead of writing, working in the lab or analysing data. I do it in the evening after work, or in the case of some posts in the morning before. I’m not convinced blogging makes me in any way a better scientist, but it can hardly make me worse. Thinking about science or how to explain it for another hour now and then can’t hurt. And yes, the time spent blogging could theoretically be spent writing papers or something, but so could theoretically the time spent at the gym, with family or friends. If we grant that academics do other things, blogging could be one of those activities. My blog is not completely disconnected from my work, but I think it’s disconnected enough to be regarded as a fun pastime.

Does blogging hurt my reputation because people might read my blog and disapprove? I don’t think that many people read my blog; actually, I know that not many people do. Still, it is certainly possible that some of the readers might be important to my career and that they don’t like what they see. It will be found when people look me up with a search engine. Maybe someone thinks that I’m wasting my time, or maybe I’ve written something controversial — or more likely, something stupid. I think and say things that are mistaken all the time, and some of those mistakes might end up in a blog post. The point is, though, that expressing my opinion about things I care about is not something I do because I think it’ll further my career. I do it because I want to. If my writing is successful, the things on my blog will be the kinds of things I honestly know, think and believe about science.

From Evolution in Sweden 2014, Uppsala

Dear diary,

A couple of weeks ago I attended the Evolution in Sweden meeting in Uppsala, as expected a very nice meeting with lots of interesting things. My last conference was ESEB last summer, which was great because it was a huge conference with so much to see and so many people. Evolution in Sweden was great because it wasn’t huge, so that it was very possible to see everything, recognise familiar faces and talk with people. I had a poster on the behaviour genetics of chicken domestication (of course!).

Here are some of my personal highlights, in no particular order:

Kerstin Johannesson’s talk, an ”advertisement for marine organsims” was probably the most fun and engaging. I was very convinced that evolutionary research in the Baltic Sea is a great idea! Among other things she mentioned salinity gradients, the sexual and asexual reproduction of Fucus brown algae, Littorina saxatilis of course and the IMAGO project to sequence and assemble reference genomes for eight different species from the Baltic.

We have a great infrastructure for evolutionary research: the Baltic Sea. [quoted from memory]

Claudia Köhler spoke about why triploids in Arabidopsis thaliana fail, which is an interesting story involving the endosperm, which in a triploid seed turns out tetraploid, and genomic imprinting. They screened for mutants able to form triploid seeds and found paternally imprinted gene, that is dosage-sensitive and causes the failure of triploid seeds (Kradolfer & al 2013).

Anna Qvarnström and Hans Ellegren talked about different flycatcher projects. I don’t have that much clever to say about this right now, except that both projects are really fascinating and impressive. Everyone who cares about genomics in the wild should keep an eye on this.

There were two talks from Umeå Plant Science Centre: Stefan Jansson’s about association mapping in aspen (SwAsp), which sounds fun but difficult with tons of genetic variation, and Pär K. Ingvarsson’s about the Norway spruce genome (Nystedt & al 2013). An interesting observation from the latter was that it’s gigantic genome size (~20 Gb) apparently isn’t due to whole-genome duplications, but to unchecked transposable element activity. A nice nugget to remember: about half of the sequence, or three to four human genomes, consists of LTR-type repeats.

I’m afraid you will never read very much from me about theory talks. I am an engineer after all, so I don’t fear the equations that much, but most of the time I don’t have necessary context to have any clue where this particular model fits into the grand scheme of things. However, Jessica Abbott gave a fun talk presenting a model for sexual conflict in hermaphrodites that deserves a special mention.

I did see quite few a genomic plots of Fst outliers and I believe the question that needs answering about them is: What do they really mean? One can do comparisons of comparisons (like in Roger Butlin’s talk and  their paper on parallel evolution of morphs in Littorina; Butlin & al 2013), but when it comes to picking out the most differentiated loci on a genome-wide level, are they really the most interesting loci? Are the loci of highest differentiation the loci of adaptation; are they the loci of speciation? (Ellegren’s talk and the flycatcher genome paper; Ellegren & al 2012). It’s a bit like the problem faced by QTL mappers — ”now that we’ve got a few genomic regions, what do we do with them?” — with the added complication that we don’t have a phenotype associated with them.

Genetic architecture wasn’t an explicit theme of the meeting, but it always comes up, doesn’t it? Will traits be massively polygenic, dooming researchers to a lifetime search for missing heritability, or relatively simple with a handful of loci? And under what circumstances will either architecture occur? Jon Ågren talked about the fantastic Arabidopsis thaliana in situ QTL mapping experiment. I think it is best illustrated with the video he showed last time I heard him talk about this — Lost in transplantation:

Folmer Bokma used Lego dinosaurs to great effect to illustrate developmental constraints. Also a large part of the talk was quotes from different famous evolutionary biologists. Very memorable, but I’m not sure I understood where he was heading. I was expecting him to start talking about the need for G matrix methods any moment. My lack of understanding is of course my fault as well, not just of the speaker’s, and there were a few graphs of gene duplications and gene expression data in primates, but I don’t feel that he showed ”how phylogenetic analyses of genomic data can shed new light on these ideas”, as promised in the abstract.

Possibly the best expression of the meeting: Erik Svensson’s ”next generation fieldwork”. I’m not a fan of the inflation of words ending in -omics (and I sometimes feel ”genomics” should just be ”genetics”), but if we have genomics and proteomics, phenomics is also justified, I guess. As a tounge-in-cheek version ”next generation fieldwork” is spot on. And very true: clever phenotyping strategies in natural populations and natural settings is more even more important than rapid sequencing and genotyping. By the way, Erik Svensson, Jessica Abbott, Maren Wellenreuther and their groups have a lab blog which seems nice and active.

And finally, the thing that wasn’t so great, coincidentally, the same thing that wasn’t so great at ESEB: the gender balance: only 7 out of 28 speakers were women. I don’t know to what extent that ratio reflect the gender ratio of Swedish evolutionary biology, but regardless it is too low.

It’s been a while since mid-January, but I’ve been busy (with some fun things — will tell you more later). And maybe we’ll see each other at the next Evolution in Sweden in Lund.

uppsala_gustavianum uppsala_snow uppsala_chickens

Fall is the data analysis season

fall

Dear diary,

I spent a lot of my summer in the lab, and my fall has been mostly data analysis, with a little writing and a couple of courses thrown in there. Data analysis means writing code, and nowadays I do most of my work with the help of R. R has even replaced python and perl for most ad hoc scripting. Case in point: I recently wrote an R script to generate and run a (long) series of tar commands for me. It might sound weird, but R can do these silly tasks just as well as any scripting language and even when its statistical functions play no role, its tabular data structures often come in handy.

Working on multiple similar but not identical projects also means I’ve got to reread and rework some old scripts, and I often find that when return to reuse some piece code, I’ve learned enough to rewrite it in a better way. Inspired by this paper, I’m trying to slowly improve my programming practices. The assertthat package is a new friend, and the next step is getting better testing routines going, probably with the aid of testthat. (Speaking of learning R, did you know that you get the underscore sign in ESS by double tapping the key? Just pressing it once makes an assignment arrow. I didn’t realise until the other day and I feel very stupid for it.)

We’ve been running a second season of the introduction to R seminars with the lab, also including some gene expression and massively parallel resequencing data. (The latter not so much with R, though.) I’ve learned quite a bit, and hopefully refined my R teaching skills a little. I have the impression that doing lots of in-seminar exercises has been helpful, and this time around I put a lot more emphasis on organising analysis code into scripts.

I’ve also gotten to play a bit more with quantitative genetics models with MCMCglmm, which is great fun. Speaking of MCMC, Gelman & co’s Bayesian Data Analysis 3rd edition has come out! My copy is on its way, and I’ve also bought Dirk Edelbuettel’s Rcpp book. Looking forward to that.

During November, my blog hits set a new record (almost doubling the previous most visited month), thanks to links from Matt Asher’s Probability and statistics blog and Sam Clifford’s blog . It’s very flattering to be linked by two statistics bloggers that I’ve read, one of which was already in my RSS reader.

By the way, I will be at the Evolution in Sweden meeting in Uppsala in January. If you’re there, say hi!

Morning coffee: alpha level 0.005

kaffe

Valen Johnson recently published a paper in PNAS about Bayes factors and p-values. In null hypothesis testing p-values measure the probability of seeing data this extreme or more extreme, if the null hypothesis is true. Bayes factors measures the ratio between the posterior probability of the alternative hypothesis to the posterior probability of the null hypothesis. The words ‘probability of the hypothesis’ tells us we’re in Bayes land, but of course, that posterior probability comes from combining the prior probability with the likelihood, which is the probability of generating the data under the hypothesis. So the Bayes factor considers not only what happens if the null is true, but what happens if the alternative is true. That is one source of discrepancies between them. Johnson has found a way to construct Bayes factors so that they correspond certain common hypothesis tests (including an approximation for the t-test, so there goes most of biology), and found for many realistic test situations a p-value of 0.05 corresponds to pretty weak support in terms of Bayes factors. Therefore, he suggests the alpha level of hypothesis tests should be reduced to at least 0.005. I don’t know enough about Bayes factors to really appreciate Johnson’s analysis. However, I do know that some responses to the paper make things seem a bit too easy. Johnson writes:

Of course, there are costs associated with raising the bar for statistical significance. To achieve 80% power in detecting a standardized effect size of 0.3 on a normal mean, for instance, decreasing the threshold for significance from 0.05 to 0.005 requires an increase in sample size from 69 to 130 in experimental designs. To obtain a highly significant result, the sample size of a design must be increased from 112 to 172.

If one does not also increase the sample sizes to preserve — or, I guess, preferably improve — power, just reducing the alpha level to 0.005 will only make matters worse. With low power comes, as Andrew Gelman likes to put it, high Type M or magnitude error rate. That is if power is bad enough not only will there be few significant findings, but all of them will be overestimates.

Morning coffee: tables

(Note: ‘Morning coffee’ will be short musings about science-related topics.)

coffee

I don’t like tables. Or, more precisely: I don’t like tables that I have to read, but I love telling my computer to read tables for me. Tables made for human eyes tend to have certain features — I don’t know whether they really help facilitate human understanding, but people seem to think they do — such as merged cells or omission of repeated values, footnotes indicated by superscript symbols and sometimes colouring that conveys meaning. There is a conflict between keeping the number of columns small enough to be readable and putting in all the statistics that readers want. Someone might want the coefficient of determination while someone who of information theoretic persuasion wants the AIC. It is more convenient for the human reader to see the table close to the text, while the computer user would probably like it in a text file. Some journals do this almost right: right below the table there is a link to download it as comma separated values. I think ideally any data would be presented as a summary table — or even better a graph! — and the underlying computer-readable data would be the click of a link away.