A partial success

In 2010, Poliseno & co published some results on the regulation of a gene by a transcript from a pseudogene. Now, Kerwin & co have published a replication study, the protocol for which came out in 2015 (Khan et al). An editor summarises it like this in an accompanying commentary (Calin 2020):

The partial success of a study to reproduce experiments that linked pseudogenes and cancer proves that understanding RNA networks is more complicated than expected.

I guess he means ”partial success” in the sense that they partially succeeded in performing the replication experiments they wanted. These experiments did not reproduce the gene regulation results from 2010.

Seen from the outside — I have no insight in what is going on here or who the people involved are — something is not working here. If it takes five years from paper to replication effort, and then another five years to replication study accompanied by an editorial commentary that subtly undermines it, we can’t expect replication studies to update the literature, can we?


What’s the moral of the story, according to Calin?

What are the take-home messages from this Replication Study? One is the importance of fruitful communication between the laboratory that did the initial experiments and the lab trying to repeat them. The lack of such communication – which should extend to the exchange of protocols and reagents – was the reason why the experiments involving microRNAs could not be reproduced. The original paper did not give catalogue numbers for these reagents, so the wrong microRNA reagents were used in the Replication Study. The introduction of reporting standards at many journals means that this is less likely to be an issue for more recent papers.

There is something right and something wrong about this. On the one hand, talking to your colleagues in the field obviously makes life easier. We would like researchers to put all pertinent information in writing, and we would like there to be good communication channels in cases where the information turns out not to be what the reader needed. On the other hand, we don’t want science to be esoteric. We would like experiments to be reproducible without the special artifact or secret sauce. If nothing else, because the people’s time and willingness to provide tech support for their old papers might be limited. Of course, this is hard, in a world where the reproducibility of an experiment might depend on the length of digestion (Hines et al 2014) or that little plastic thingamajig you need for the washing step.

Another take-home message is that it is finally time for the research community to make raw data obtained with quantitative real-time PCR openly available for papers that rely on such data. This would be of great benefit to any group exploring the expression of the same gene/pseudogene/non-coding RNA in the same cell line or tissue type.

This is true. You know how doctored, or just poor, Western blots are a notorious issue in the literature? I don’t think that’s because Western blot as a technique is exceptionally bad, but because there is a culture of showing the raw data (the gel), so people can notice problems. However, even if I’m all for showing real-time PCR amplification curves (as well as melting curves, standard curves, and the actual batch and plate information from the runs), I doubt that it’s going to be possible to trouble-shoot PCR retrospectively from those curves. Maybe sometimes one would be able to spot a PCR that looks iffy, but beyond that, I’m not sure what we would learn. PCR issues are likely to have to do with subtle things like primer design, reaction conditions and handling that can only really be tackled in the lab.

The world is messy, alright

Both the commentary and the replication study (Kerwin et al 2020) are cautious when presenting their results. I think it reads as if the authors themselves either don’t truly believe their failure to replicate or are bending over backwards to acknowledge everything that could have gone wrong.

The original study reported that overexpression of PTEN 3’UTR increased PTENP1 levels in DU145 cells (Figure 4A), whereas the Replication Study reports that it does not. …

However, the original study and the Replication Study both found that overexpression of PTEN 3’UTR led to a statistically significant decrease in the proliferation of DU145 cells compared to controls.

In the original study Poliseno et al. reported that two microRNAs – miR-19b and miR-20a – suppress the transcription of both PTEN and PTENP1 in DU145 prostate cancer cells (Figure 1D), and that the depletion of PTEN or PTENP1 led to a statistically significant reduction in the corresponding pseudogene or gene (Figure 2G). Neither of these effects were seen in the Replication Study. There are many possible explanations for this. For example, although both studies used DU145 prostate cancer cells, they did not come from the same batch, so there could be significant genetic differences between them: see Andor et al. (2020) for more on cell lines acquiring mutations during cell cultures. Furthermore, one of the techniques used in both studies – quantitative real-time PCR – depends strongly on the reagents and operating procedures used in the experiments. Indeed, there are no widely accepted standard operating procedures for this technique, despite over a decade of efforts to establish such procedures (Willems et al., 2008; Schwarzenbach et al., 2015).

That is both commentary and replication study seem to subscribe to a view of the world where biology is so rich and complex that both might be right, conditional on unobserved moderating variables. This is true, but it throws us into a discussion of generalisability. If a result only holds in some genotypes of DU145 prostate cancer cells, which might very well be the case, does it generalise enough to be useful for cancer research?

Power underwhelming

There is another possible view of the world, though … Indeed, biology rich and complicated, but in the absence of accurate estimates, we don’t know which of all these potential moderating variables actually do anything. First order, before we start imagining scenarios that might explain the discrepancy, is to get a really good estimate of it. How do we do that? It’s hard, but how about starting with a cell size greater than N = 5?

The registered report contains power calculations, which is commendable. As far as I can see, it does not describe how they arrived at the assumed effect sizes. Power estimates for a study design depend on the assumed effect sizes. Small studies tend to exaggerate effect sizes (because, if an estimate is small the difference can’t be significant). This means that taking the estimates as staring effect sizes might leave you with a design that is still unable to detect a true effect of reasonable size.

I don’t know what effect sizes one should expect in these kinds of experiments, but my intuition would be that even if you think that you can get good power with a handful of samples per cell, can’t you please run a couple more? We are all limited by resources and time, but if you’re running something like a qPCR, the cost per sample must be much smaller than the cost for doing one run of the experiment in the first place. It’s really not as simple as adding one row on a plate, but almost.


Calin, George A. ”Reproducibility in Cancer Biology: Pseudogenes, RNAs and new reproducibility norms.” eLife 9 (2020): e56397.

Hines, William C., et al. ”Sorting out the FACS: a devil in the details.” Cell reports 6.5 (2014): 779-781.

Kerwin, John, and Israr Khan. ”Replication Study: A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” eLife 9 (2020): e51019.

Khan, Israr, et al. ”Registered report: a coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” Elife 4 (2015): e08245.

Poliseno, Laura, et al. ”A coding-independent function of gene and pseudogene mRNAs regulates tumour biology.” Nature 465.7301 (2010): 1033-1038.

Morning coffee: alpha level 0.005


Valen Johnson recently published a paper in PNAS about Bayes factors and p-values. In null hypothesis testing p-values measure the probability of seeing data this extreme or more extreme, if the null hypothesis is true. Bayes factors measures the ratio between the posterior probability of the alternative hypothesis to the posterior probability of the null hypothesis. The words ‘probability of the hypothesis’ tells us we’re in Bayes land, but of course, that posterior probability comes from combining the prior probability with the likelihood, which is the probability of generating the data under the hypothesis. So the Bayes factor considers not only what happens if the null is true, but what happens if the alternative is true. That is one source of discrepancies between them. Johnson has found a way to construct Bayes factors so that they correspond certain common hypothesis tests (including an approximation for the t-test, so there goes most of biology), and found for many realistic test situations a p-value of 0.05 corresponds to pretty weak support in terms of Bayes factors. Therefore, he suggests the alpha level of hypothesis tests should be reduced to at least 0.005. I don’t know enough about Bayes factors to really appreciate Johnson’s analysis. However, I do know that some responses to the paper make things seem a bit too easy. Johnson writes:

Of course, there are costs associated with raising the bar for statistical significance. To achieve 80% power in detecting a standardized effect size of 0.3 on a normal mean, for instance, decreasing the threshold for significance from 0.05 to 0.005 requires an increase in sample size from 69 to 130 in experimental designs. To obtain a highly significant result, the sample size of a design must be increased from 112 to 172.

If one does not also increase the sample sizes to preserve — or, I guess, preferably improve — power, just reducing the alpha level to 0.005 will only make matters worse. With low power comes, as Andrew Gelman likes to put it, high Type M or magnitude error rate. That is if power is bad enough not only will there be few significant findings, but all of them will be overestimates.

Power overwhelming II: hur vilseledande är små studier?

Alla vet att stickprovsstorlek spelar roll: stora undersökningar och experiment är i allmänhet trovärdigare än små. Frågan är bara: hur stora och hur trovärdiga? I den förra posten skrev jag lite om klassiska styrkeberäkningar och tog som exempel en studie av Raison m. fl. om antiinflammatoriska antikroppar mot depression. (Spoiler: det verkar inte ha någon större effekt.) Nyss såg jag en kort förhandspublicerad artikel av Andrew Gelman och John Carlin som utvecklar ett lite annat sätt att se på styrka — eller designanalys, som de skriver — med två nya mått på studiers dålighet. Föreställ dig ett experiment som fungerar ungefär som det med antikropparna: två grupper av deprimerade patienter får en ny medicin (här: antikroppen infliximab) eller placebo, och det som intresserar oss är skillnaden mellan grupperna är efter en tids behandling.

I klassiska styrkeberäkningar handlar det om kontrollera risken att göra så kallat typ 2-fel, vilket betyder att missa en faktisk skillnad mellan grupperna. Typ 1-fel är att råka se en skillnad som egentligen inte finns där. Det här sättet att resonera har Gelman inte mycket till övers för. Han (och många andra) brukar skriva att vi oftast redan vet redan från början att skillnaden inte är noll: det vi behöver veta är inte om det finns en skillnad mellan de som fått infliximab och de andra, utan i vilken riktning skillnaden går — är patienterna som blivit behandlade friskare eller sjukare? — och ifall skillnaden är stor nog att vara trovärdig och betydelsefull.

Därför föreslår Gelman & Carlin att vi ska titta på två andra typer av fel, som de kallar typ S, teckenfel, och typ M, magnitudfel. Teckenfel är att säga att skillnaden går åt ena hållet när den i själva verket går åt det andra. Magnitudfel är att säga att en skillnad är stor när den i själva verket är liten — Gelman & Carlin mäter magnitudfel med en exaggeration factor, som är den skattade skillnaden i de fall där det är stort nog att anses signifikant skild från noll dividerat med den verkliga skillnaden.

Låt oss ta exemplet med infliximab och depression igen. Gelman & Carlin understryker hur viktigt det är att inte ta sina antaganden om effektstorlek ur luften, så jag har letat upp några artiklar som är sammanställningar av många studier av antidepressiva mediciner. Om vi antar att utgångsläget är 24 enheter på Hamiltons depressionsskala (vilket är ungefär vad patienterna hade i början av experimentet) motsvarar medeleffekten i Kahn & cos systematiska litteraturstudie en skillnad på 2.4 enheter. Det överensstämmer ganska väl med Gibbons & cos metaanalys av fluoxetine and venlafaxine där skillnaden överlag var 2.6 enheter. Storosum & cos metaanalys av tricykliska antidepressiva medel har en skillnad på 2.8 enheter. Det är såklart omöjligt att veta hur stor effekt infliximab skulle ha ifall det fungerar, men det verkar väl rimligt att anta något i samma storleksordning som fungerande mediciner? I antikroppsstudiens styrkeberäkning kom de fram till att de borde ha en god chans att detektera en skillnad på 5 enheter. Den uppskattningen verkar ha varit ganska optimistisk.

Precis som med den första styrkeberäkningen så har jag gjort simuleringar. Jag har prövat skillnader på 1 till 5 enheter. Det är 60 deltagare i varje grupp, precis som i experimentet, och samma variation som författarna använt till sin styrkeberäkning. Jag låter datorn slumpa fram påhittade datamängder med de parametrarna och sedan är det bara att räkna ut risken för teckenfel och överskattningsfaktorn.


Det här diagrammet visar chansen att få ett signifikant resultat (alltså styrkan) samt risken för teckenfel vid olika verkliga effektstorlekar. Den grå linjen markerar 2.5 enheter. Jämfört med Gelmans & Carlins exempel ser risken för teckenfel inte så farlig ut: den är väldigt nära noll vid realistiska effekter. Styrkan är fortfarande sådär, knappa 30% för en effekt på 2.5 enheter.


Det här diagrammet är överskattningsfaktorn vid olika effektstorlekar — jag försökte demonstrerar samma sak med histogram i förra posten. Vid 5 enheter, som är den effektstorlek författarna räknat med, har kurvan hunnit plana ut nära ett, alltså ingen större överskattning. Men vid 2.5 får vi ändå räkna med att skillnaden ser ut att vara dubbelt så stor som den är. Sammanfattningsvis: författarna bör kunna utesluta stora effekter som fem enheter på Hamiltonskalan, men dagens antidepressiva mediciner verkar ha betydligt mindre effekt än så. Alltså finns det risk att missa realistiska effekter och ännu värre blir det förstås när de börjar dela upp försöket i mindre undergrupper.


Gelman A & Carlin J (2013) Design analysis, prospective or retrospective, using external information. Manuskript på Gelmans hemsida.

Storosum JG, Elferink AJA, van Zwieten BJ, van den Brink W, Gersons BPR, van Strik R, Broekmans AW (2001) Short-term efficacy of tricyclic antidepressants revisited: a meta-analytic study European Neuropsychopharmacology 11 pp. 173-180 http://dx.doi.org/10.1016/S0924-977X(01)00083-9.

Gibbons RD, Hur K, Brown CH, Davis JM, Mann JJ (2012) Benefits From Antidepressants. Synthesis of 6-Week Patient-Level Outcomes From Double-blind Placebo-Controlled Randomized Trials of Fluoxetine and Venlafaxine. Archives of General Psychiatry 69 pp. 572-579 doi:10.1001/archgenpsychiatry.2011.2044

Khan A, Faucett J, Lichtenberg P, Kirsch I, Brown WA (2012) A Systematic Review of Comparative Efficacy of Treatments and Controls for Depression. PLoS ONE  e41778 doi:10.1371/journal.pone.0041778

Raison CL, Rutherford RE, Woolwine BJ, Shuo C, Schettler P, Drake DF, Haroon E, Miller AH (2013) A Randomized Controlled Trial of the Tumor Necrosis Factor Antagonist Infliximab for Treatment-Resistant DepressionJAMA Psychiatry 70 pp. 31-41. doi:10.1001/2013.jamapsychiatry.4


Gelman & Carlin skriver om en R-funktion för felberäkningarna, men jag hittar den inte. För min simulering, se github.