This list of gradually more radical conceptions represent my attempt at understanding the term ”data-driven science”. What does it mean?
Level 0: ”All science is already data-driven”
This is the non-answer that I expect from researchers who take ”data-driven” to mean ”empirical science as usual”, potentially distinguishing it from ”theory”, which they may dislike.
Level 1: ”My science is already data-driven”
This is the answer I expect from the ”milk recordings are Big Data” crew, i.e., researchers who are used to using large datasets, and don’t see anything truly new in the ”Big Data” movement (but might like to take part in the hype with their own research).
For example, Leonelli (2014) writes about developments in model organism databases and their relation to ”Big Data”. She argues that working with large datasets was nothing new for this community but that the Big Data discourse brought attention to the importance of data and data management:
There is strong continuity with practices of large data collection and assemblage conducted since the early modern period; and the core methods and epistemic problems of biological research, including exploratory experimentation, sampling and the search for causal mechanisms, remain crucial parts of inquiry in this area of science – particularly given the challenges encountered in developing and applying curatorial standards for data other than the high-throughput results of ‘omics’ approaches. Nevertheless, the novel recognition of the relevance of data as a research output, and the use of technologies that greatly facilitate their dissemination and re-use, provide an opportunity for all areas in biology to reinvent the exchange of scientific results and create new forms of inference and collaboration.
Level 2: Re-using data as much as possible
This answer is the notion that data-driven research is about making use of the data that is already out there as much as possible, putting it in new research contexts and possibly aggregating data from many sources.
This is the emphasis of the SciLife & Wallenberg program for ”data-driven life science”:
Experiments generate data, which can be analyzed to address specific hypotheses. The data can also be used and combined with other data, into larger and more complex sets of information, and generate new discoveries and new scientific models. Which, in their turn, can be addressed with new experiments.
Researchers in life science often collect large amounts of data to answer their research questions. Often, the same data can be used to answer other research questions, posed by other research groups. Perhaps many, many other questions, and many, many other research groups. This means that data analysis, data management and data sharing is central to each step of the research process …
In my opinion, taking this stance does not necessarily imply any ban on modelling, hypothesising or theorising (which is the next level). It’s a methodological emphasis on data, not a commitment to a particular philosophy of science.
Level 3: A philosophy of science that prioritises data exploration
This answer understands ”data-driven science” as a philosophy of science position that prioritises data exploration over modelling, hypothesising and theorising. Leonelli argues that this is one of the key features of data driven research (2012).
Extreme examples include Chris Anderson’s view that ”correlation is enough” and that large datasets have made theories and searches for causes obsolete. Anderson wrote in Wire 2008 about how ”petabyte-scale” data is making theory, hypotheses, models, and the search for causes obsolete.
The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the ”beautiful story” phase of a discipline starved of data) is that we don’t know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.
There is now a better way. Petabytes allow us to say: ”Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
I believe that this extreme position is incoherent. We can’t infer directly from data without a theory or model. The options are to do it with or without acknowledging and analysing the models we use. But there are less extreme versions that make more sense.
For example, van Helden (2013) argued that omic studies imply more broad and generic hypotheses, about what kinds of mechanisms and patterns one might expect to find in a large dataset, rather than hypothesising the identity of the genes, pathways or risk factors themselves.
Equipped with such huge data sets, we can perform data mining in an objective way. For some purists, this approach to data acquisition is anathema, as it is not ‘hypothesis-driven’. However, I submit that it is. In this case, the original hypothesis is broad or generic–we generate data, assess it and probably find something useful for elucidating our research problem. /…/ The hypothesis is that one will design an algorithm and find a pattern, which allows us to distinguish between cases and controls.
Anderson might be right that the world as revealed by data is too complicated for our puny brains to write down equations about; maybe machines can help. That brings us to the next level.
Level 4: Machine learning to replace scientific reasoning
This answer is the attitude that complex relationships in large data are too much for the human mind, and traditional theoretical models, to handle. Instead, we should use machine learning models to find patterns and make predictions using automated inference processes that are generally too intricate for humans to interpret intuitively. This is also one of Leonelli’s key features of data-driven research (2012).
Again, Anderson’s ”End of Theory” is an extreme example of this view, when he argues that the complex interactions between genes, epigenetics and environment make theoretical models of genetics useless.
Now biology is heading in the same direction. The models we were taught in school about ”dominant” and ”recessive” genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton’s laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.
In short, the more we learn about biology, the further we find ourselves from a model that can explain it.
Anderson overstates how helpless genetic models are against complexity, but there is a point here, for example with regard to animal breeding. Part of what made genomics useful in modern animal breeding was giving up on the piecemeal identification of individual markers of large genetic effect, and instead throwing whole-genome information into a large statistical model. We estimate them all, without regard for their individual effect, without looking for genetic causes (”genetics without genes” as Lowe & Bruce (2019) put it), and turn them into a ranking of animals for selection decisions. However, this is level 3 stuff, because there is a human-graspable quantitative genetic theory underlying the whole machinery, with marker effect, breeding values, and equations that can be interpreted. A more radical vision of machine learning-based breeding would take away that too, no notion of genetic relationship between individuals, genetic correlation between traits, or separating genotype from phenotype. Will this work? I don’t know, but there are tendencies like this in the field, for example when researchers try to predict useful traits from high-dimensional omics or sensor data.
Apart from those pro-machine learning, anti-theory stances, you could have a pro-machine pro-theory stance. There several fields of research that try to use automated reasoning to help do theory. One interesting subfield of research that I only learned about recently is research around ”symbolic regression” and ”machine scientists” where you try to find mathematical expressions that fit data but allowing for a large variety of different expressions. It’s like a form of model selection where you allow a lot of flexibility about what kind of model you consider.
There is a whole field of this, but here is an example I read recently: Guimerà et al. (2020) used a Bayesian method, where they explore different expression with Markov Chain Monte Carlo, putting a prior on the mathematical expression based on equations from Wikipedia. This seems like something that could be part of a toolbox of a data-driven scientist who still won’t give up on the idea that we can understand what’s going on, even if we might need a help from a computer.
Anderson C., The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired.
van Helden P., 2013 Data-driven hypotheses. EMBO reports 14: 104–104.
Leonelli S., 2012 Introduction: Making sense of data-driven research in the biological and biomedical sciences. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 43: 1–3.
Leonelli S., 2014 What difference does quantity make? On the epistemology of Big Data in biology. Big Data & Society 1: 2053951714534395.
Lowe, James WE, and Ann Bruce., 2019 Genetics without genes? The centrality of genetic markers in livestock genetics and genomics. History and Philosophy of the Life Sciences 41.4
Guimerà, R., Reichardt, I., Aguilar-Mogas, A., Massucci, F. A., Miranda, M., Pallarès, J., & Sales-Pardo, M., 2020. A Bayesian machine scientist to aid in the solution of challenging scientific problems. Science advances, 6(5), eaav6971.