Pages

Wednesday, 22 January 2025

Big Data and the Future of the Scientific Method

[This blog post is based on a chapter in my new book ‘REVISITING THE SCIENTIFIC METHOD: The Need to Make Science More Inclusive in Scope’ (Wadhawan 2025).

A problem, as well as a boon, of modern times is that we have an overwhelming amount of data at our disposal. The term Big Data has been coined for this. Big Data means a complex and voluminous set of information comprising structured, unstructured, and semi-structured datasets, that is challenging to manage using traditional data-processing tools, and that requires additional infrastructure to govern, analyse, and convert into insights or knowledge. It is characterised by ‘volume’, ‘velocity’, ‘variety’, ‘veracity’, and ‘value’ (BasuMallik 2022; Tiao 2024). 

Credit: (22) The Role of Big Data in Scientific Research | LinkedIn

The volume of data at our disposal is now measured in zettabytes (10­­­21 bytes), and yottabytes (1024 bytes), and it is growing exponentially rapidly. ‘A Big Data system has two closely related parts: storage and analysis. Analysis is based on storage, which facilitates access. Storage is based on analysis, which reduces volume. Analytical solutions that really respond to this problem have two features: induction and speed’ (Malle 2013).

A feature of Big Data is that just about any correlation can be found in it. This means that the dictum ‘correlation does not necessarily imply causation’ can be under threat, and we have to find sophisticated ways of deciding whether to take a correlation seriously or not.

The end of theory: The data deluge makes the scientific method obsolete’ was the attention-grabbing title of a provocative and much discussed article by the then Editor in Chief of the Wired magazine, Chris Anderson (2008). The article begins by quoting George Box (a statistician), who had written in 1976 that ‘all models are wrong, but some are useful’. The justification for such a statement comes from the nature, or rather the limitation, of deductive logic. Any model or hypothesis or theory is a generalisation based on the assumption that its underlying axioms are true. From the model we draw conclusions by deductive logic, and then go about checking their validity/falsifiability against the available information. As more and more information pours in with the passage of time, the model may fall by the wayside, to be replaced by a better model. This has happened several times in the history of science. Some examples are: Newton’s models of gravity and of space (replaced by Einstein’s theory of spacetime); the classical model about the simultaneity of events (replaced by Einstein’s theory of relativity); theory of classical mechanics (replaced by quantum mechanics); theories of human behaviour (modified or replaced again and again). The new models may, in turn, prove to be inadequate or ‘wrong’, to be replaced by still better or new models in the light of additional information that keeps pouring in. And so on. So, any model is liable to be ‘wrong’.

As information has gone on piling up, a stage has come when ‘information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later’ (Anderson 2008). Anderson gave the example of the success achieved by Google in the advertising world: ‘Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right. Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content’. He quoted Google's research director Peter Norvig as saying: ‘All models are wrong, and increasingly you can succeed without them’.

This is a far cry from the traditional way of doing science, which is based on testable hypotheses. But we must admit that there is indeed a stalemate of sorts in physics at present. No big (conceptual) breakthroughs have been coming for quite some time now.  We have the string theory, which we are unable to verify. It has not been possible to unify quantum mechanics with the theory of gravity. And then there is dark matter. And so on. Do we need to change tracks altogether? Many people think that we do. Here is Anderson’s (2008) take on this: ‘The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on. Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility’. He is overstating a bit, but still has a point there.

Anderson’s way out was to assert that correlations are enough, and that knowledge about the causes behind the observed correlations can take a back seat to start with. He advocated a new kind of science: data-driven science. In it, we can stop looking for models or hypotheses. We can analyse the data without any preconceived hypotheses or biases about what it might show. We throw the numbers into big computing clusters and let sophisticated statistical algorithms find patterns where the existing science cannot. Such an approach is particularly relevant for bioinformatics, systems biology, epidemiology, ecology, etc. But can it replace the usual hypothesis-driven way of doing science? Not really, though it can be an important adjunct to it. Below I discuss some reactions to the points made by Anderson.

Mazzocchi (2015) examined the issue from the epistemological point of view. He asked: ‘Is data-driven research a genuine mode of knowledge production, or is it above all a tool to identify potentially useful information? Given the amount of scientific data available, is it now possible to dismiss the role of theoretical assumptions and hypotheses?’ He began by pointing out that, long before Anderson (2008), it was Francis Bacon who, as far back as in 1620, had argued in his Novum Organum that ‘scientific knowledge should not be based on preconceived notions but on experimental data. Deductive reasoning, he argued, is eventually limited because setting a premise in advance of an experiment would constrain the reasoning so as to match that premise. Instead, he advocated a bottom-up approach: In contrast to deductive reasoning, which has dominated science since Aristotle, inductive reasoning should be based on facts to generalize their meaning, drawing inferences from observations and data’.

Thus Big-Data based science has renewed the primacy of inductive reasoning in the form of technology-based empiricism. It is believed by some that this hypothesis-neutral way of creating knowledge will replace the hypothesis-driven way of doing research. Data mining can throw up unexpected correlations and patterns, which can then be used for generating new hypotheses for discovering the causes behind the correlations (see Hassanien et al. 2015). So, the new computational route ends up doing hypothesis-generating research rather than hypothesis-testing research.

Inductive algorithms occupy centre-stage here. Mazzocchi quotes Malle (2013): ‘Inductive reasoning generally produces no finished status. The results of inferences are likely to alter the inferences already made. It is possible to continue the reasoning indefinitely. The best inductive algorithms can evolve: they “learn”, they refine their way of processing data according to the most appropriate use which can be made. Permanent learning, never completed, produces imperfect but useful knowledge. Any resemblance with the human brain is certainly not a coincidence’. [As Malle explains, ‘induction, unlike deduction, is a mechanism used by the human brain at almost every moment. Indeed, despite the fact that deduction is considered as cleaner, more scientific, it occupies only a small portion of the processing time of our brain. It is particularly relevant when analysing a situation out of its context’.]

Mazzocchi and many others have expressed concerns that unscrupulous agents can influence what kind of Big Data is generated and made public, to the exclusion of information inimical to their interests or designs. Think of the Dark web. Perhaps we never get the ‘full’ picture at any point of time in our history. And, even when the intentions are not questionable and there is no manipulation, it is wrong to presume data-neutrality, even in good science. Data are not collected randomly. Experiments are designed and carried out within theoretical, methodological and instrumental limitations. There always are hypotheses and assumptions at play. Data collection is seldom unbiased. ‘Scientific research is carried out by human beings whose cognitive stance has been formed by many years of incorporating and developing cultural, social, rational, disciplinary ideas, preconceptions and values, together with practical knowledge. Scientists form their ideas and hypotheses based on specific theoretical and disciplinary backgrounds, which again are the result of decades or even centuries of history of scientific and philosophical thought’ (Mazzocchi 2015).

Calude and Longo (2017) delivered a severe blow to the predictions of Big Data enthusiasts about the ‘end of the scientific method’, who had been asserting that the hypothesis-free computer-discovered correlations found in an apparently ‘unbiased’ way from the Big Data are enough for the advancement of knowledge, and that we can ignore all pre-conceived causation aspects of any observation or data. They (Calude and Longo) used classical results from ergodic theory, Ramsey theory, and algorithmic information theory to prove that very large databases must contain arbitrary correlations, and that these correlations or ‘regularities’ appear only due to the large size, and not the nature, of data. ‘Such correlations can be found in ‘randomly generated, large-enough databases’. They proved that most correlations are spurious. The scientific method can be enriched by data mining in very large databases, but not replaced by it.

I discuss next the work of Succi and Coveney (2019), which I believe to be particularly important because it addresses the impact of Big Data on the science of complex systems. A very large chunk of activity in modern science is about complex systems. This century is the century of complexity, as was predicted by Stephen Hawking. Succi and Coveney point out that ‘once the most extravagant claims of Big Data are properly discarded, a synergistic merging of Big Data with big theory offers considerable potential to spawn a new scientific paradigm capable of overcoming some of the major barriers confronted by the modern scientific method originating with Galileo. These obstacles are due to the presence of nonlinearity, nonlocality and hyperdimensions which one encounters frequently in multiscale modelling’. They make the following four points:

1. Complex systems do not (generally) obey Gaussian statistics. This is almost a defining feature of complex systems. Their strongly correlated nature makes them generally obey power-law statistics (see Wadhawan 2018). This means that the law of large numbers (so characteristic of Gaussian statistics) generally does not hold for complex systems: For them, an increase in sample size (a major feature of Big data) is no guarantee that the error or uncertainty in the estimated or mean value decreases monotonically. In a system obeying Gaussian statistics, the chances of occurrence of an event far from the mean value are small; the bell-shaped Gaussian curve has a small tail. Not so for power-law statistics; the tail may be far from being small. ‘This explains why the Big Data trumpets should be toned down: when rare events are not so rare, convergence rates can be frustratingly slow even in the face of petabytes of data’.

2. No data is big enough for systems with strong sensitivity to data inaccuracies. The evolution of a complex or chaotic system is well-known to be a very sensitive function of ‘initial conditions’ or ‘data inaccuracies’ (see Wadhawan 2018). Think of the ‘butterfly effect’. This flies in the face of Big Data radicalism: The main claim of Big Data enthusiasts is that we can extract patterns from data, or discover correlations between phenomena we never thought of as connected, simply because of the large sample sizes.

3. Correlation does not imply causation, the link between the two becoming exponentially fainter at increasing data sizes. Correlations between two sets of data or signals can be either true correlations (TC) or false correlations (FC). TCs indicate a causative relationship or connection. And an FC is that which just happens to be observed for apparently no likely reason; it is a ‘spurious’ correlation. But distinguishing between a true and a false correlation can be tough at times. And the problem is compounded by the fact that as data sizes grow, false correlations become more and more common (as proved by Calude and Longo 2017). It is also true that, as shown by Meng (2014), to be able to make statistically reliable inferences one needs to have access to more than 50% of the data on which to perform one’s machine learning for detecting patterns or correlations. According to Succi and Coveney (2019), what we need are ‘many more theorems that reliably specify the domain of validity of the methods and the amounts of data to produce statistically reliable conclusions’. They cite the example of a paper by Karbalayghareh et al. (2018) as a step in the right direction.

4. In a finite-capacity world, too much data is just as bad as no data. We extract information from data, knowledge from information, and wisdom from knowledge. The step from knowledge to wisdom may well involve hypothesising a model for the underlying cause(s). And the wisdom gained can be utilised for optimising the model by a repetitive, circular, process of reasoning. Offhand we might think that an expanded database should lead to a corresponding increase in information, knowledge, and wisdom, in a linear sort of way. But the linearity is not guaranteed, particularly for complex systems. Succi and Coveney (2019) make the point that for finite complex systems a state of nonlinear saturation is reached sooner or later as more and more data pour in: ‘This is the very general competition-driven phenomenon by which increasing data supply leads to saturation and sometimes even loss of information; adding further data actually destroys information. … Beyond a certain threshold, further data does not add any information, simply because additional data contain less and less new information, and ultimately no new information at all.  .  .  . We speculate, without proof, that this is a general rule in the natural world’. [Destruction of information occurs if new and old data are mutually contradictory.]

Here are some final conclusions in the essay by Succi and Coveney (2019): ‘There is no doubt that the “Big Data/machine learning/artificial intelligence” approach has plenty of scope to play a creative and important role in addressing major scientific problems. Among the applications, pattern recognition is particularly powerful in detecting patterns which might otherwise remain hidden indefinitely (modulo the problem of false positives). Possibly the most important role is likely to be in establishing patterns which then demand further explanation, where scientific theories are required to make sense of what is discovered.  … Instead of rendering theory, modelling and simulation obsolete, Big Data should and will ultimately be used to complement and enhance it. Examples are flourishing in the current literature, with machine learning techniques being embedded to assist large-scale simulations of complex systems in materials science, turbulence, and also to provide major strides towards personalised medicine, a prototypical problem for which statistical knowledge will never be a replacement for patient-specific modelling. It is not hard to predict that major progress may result from an inventive blend of the two, perhaps emerging as a new scientific methodology’.

In the beginning we had theoretical science and experimental science. The advent of computers gave rise to a third kind of science, namely computational science: It happens often that the mathematical formulation of a model involves the writing of differential equations that are too difficult, if not impossible, to solve analytically. But the availability of powerful computers enables us to formulate the model in terms of difference equations, rather than differential equations. The difference equations can be solved to a desired degree of accuracy using powerful computers. The cellular-automata approach is an example of computational science (see Wadhawan 2018). And now the data deluge has given rise to a fourth type of science: data-driven science: Find the correlations first, and then go about looking for the reasons or causes behind the observed correlations. Big Data analytics plays a big role in this (see What is Big Data Analytics? | IBM).

Big Data theory is already a field of research in its own right. It is a set of generalized principles that explain the foundations, knowledge, and methods used in the practice of data-driven science (Big Data Theory | SpringerLink).

Not only ‘Big Data theory’, but also theories from Big Data! Yes, physical theories. ‘The Next Einstein: New AI Can Develop New Theories of Physics’ is the title of an article by Juelich (2024), in which the work of Merger et al. (2023) is described. Many scientists produce a large amount of data through their research. We may call them data producers. Then, once in a while, comes along a smart scientist who is able to see in the published data a common trend or pattern of great fundamental importance, leading to a leap of progress in science. A great example is that of J. C. Maxwell, whose Maxwell equations are textbook stuff. He not only unified the work of many earlier scientists, but also discovered ‘displacement current’ in the process. Some other examples are those of Newton, Einstein, and (P. W.) Anderson. They were data users rather than data producers. Efforts are afoot at present to develop AI that can play the game-changing role of the data-user type of scientist. The work of Merger et al. (2023) is a step in that direction. Their paper has the title ‘Learning Interacting Theories from Data’. I quote from its Abstract: ‘One challenge of physics is to explain how collective properties arise from microscopic interactions. Indeed, interactions form the building blocks of almost all physical theories and are described by polynomial terms in the action. The traditional approach is to derive these terms from elementary processes and then use the resulting model to make predictions for the entire system. But what if the underlying processes are unknown? Can we reverse the approach and learn the microscopic action by observing the entire system? We use invertible neural networks (INNs) to first learn the observed data distribution. By the choice of a suitable nonlinearity for the neuronal activation function, we are then able to compute the action from the weights of the trained model; a diagrammatic language expresses the change of the action from layer to layer. This process uncovers how the network hierarchically constructs interactions via nonlinear transformations of pairwise relations.’

In other words, it is a top-down approach, rather than a bottom-up approach for doing science. The data are the end results of the microscopic interactions. They lie at the top of the hierarchy, at the bottom of which are the microscopic interactions which have given rise to the data we observe. Machine learning has been used for generating a theory, without any prior knowledge about the nature of the microscopic interactions involved. I quote from their paper again: ‘Key to this approach is the use of a generative neural network, which maps a complicated data distribution to a simpler one. By decomposing this mapping into interactions between simpler features, we can better understand how and why models make predictions. We hence unravel the complex, hierarchical structure that has been learned by a neural network and explain it in a form that is central to physics: interactions between degrees of freedom’.

INNs can be applicable in a variety of other fields also: genomics, epidemiology, condensed-matter physics, astrophysics, climate modelling, ecology, economics, sociology, neuroscience.

‘Big Data marked a break in the evolution of information systems from three points of view: the explosion of available data, the increasing variety of these data, and their constant renewal. Processing these data demands more than just computing power. It requires a complete break from Cartesian logic. It calls for the non-scientific part of human thought: inductive reasoning’(Malle 2013).

‘Big Data, distributed computing and sophisticated data analysis all played a crucial role in the discovery of the Higgs boson—and perhaps in finding new ‘patterns’ they might also generate new hypotheses in this field. But the discovery of the Higgs boson was not data-driven. The collider experiments were mostly driven by theoretical predictions: It is because scientists were attempting to confirm the Standard Model of elementary particles that the discovery of the Higgs boson—the only missing piece—could occur’ (Fulvio Mazzocchi 2015).

==

References cited

Anderson, C. (2008). ‘The end of theory: The data deluge makes the scientific method obsolete’. Wired magazine, 16(7), 16. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (statlit.org)

BasuMallik, C. (2022). ‘What Is Big Data? Definition, Types, Importance, and Best Practices’. What Is Big Data? Definition and Best Practices (spiceworks.com).

Calude, C.S. and G. Longo (2017). ‘The Deluge of Spurious Correlations in Big Data’. Foundations of Science, 22, 595–612. https://doi.org/10.1007/s10699-016-9489-4. The Deluge of Spurious Correlations in Big Data | Foundations of Science (springer.com)

Malle, J.-P. (2013). ‘Big Data: Farewell to Cartesian Thinking?’. Paris-Tech Review (http://www.paristechreview.com/2013/03/15/big-data-cartesian-thinking/

Mazzocchi, F. (2015). ‘Could big data be the end of theory in science?: A few remarks on the epistemology of data-driven science’. EMBO reports, 16(10), 1250–1255. Could Big Data be the end of theory in science? (embopress.org)

Hassanien, A. E. (Eds.) (2015). Big Data in Complex Systems: Challenges and Opportunities. Big Data in Complex Systems: Challenges and Opportunities | SpringerLink.

Juelich, F. (March 12, 2024). ‘The Next Einstein: New AI Can Develop New Theories of Physics’. SciTechDaily. The Next Einstein: New AI Can Develop New Theories of Physics (scitechdaily.com)

Karbalayghareh, A., X. Qian, and E. R. Dougherty (2018). ‘Optimal Bayesian transfer learning’. IEEE Transactions on Signal Processing, 66 (14).

Meng, X. L. (2014). ‘A trio of inference problems that could win you a Nobel prize in statistics (if you help fund it)’. In X. Lin, C. Genest, D. L. Banks, G. Molenberghs, D. W. Scott, and J. -L. Wang (Eds.):  Past, Present, and Future of Statistical Science, pp. 537–562. CRC Press, Boca Raton, FL.

Merger, C. et al. (20 November 2023). ‘Learning Interacting Theories from Data’. Physical Review X. DOI: 10.1103/PhysRevX.13.041033.

Succi, S. and P. V. Coveney (2019). ‘Big data: the end of the scientific method?’. Philosophical Transactions of the Royal Society A, 377(2142), 20180145. 1807.09515 (arxiv.org)

Tiao, S. (2024). ‘What is Big data?’. What Is Big Data? | Oracle

Wadhawan, V. K. (2018). Understanding Natural Phenomena: Self-Organization and Emergence in Complex Systems. CreateSpace Independent Publishing Platform, SC, USA.

No comments:

Post a Comment