Rice University statistician Genevera Allen issued a grave warning at a prominent scientific conference this week: that scientists are leaning on machine learning algorithms to find patterns in data even when the algorithms are just fixating on noise that won’t be reproduced by another experiment.
“There is general recognition of a reproducibility crisis in science right now,” she told the BBC. “I would venture to argue that a huge part of that does come from the use of machine learning techniques in science.”
The problem, according to Allen, can arise when scientists collect a large amount of genome data and then use poorly-understood machine learning algorithms to find clusters of similar genomic profiles.
“Often these studies are not found out to be inaccurate until there’s another real big dataset that someone applies these techniques to and says ‘oh my goodness, the results of these two studies don’t overlap,'” she told the BBC.
The problem with machine learning, according to Allen, is that it’s trained to look for patterns even where none exist. The solution, she suspects, will be in next-generation algorithms that are better able to evaluate how reliable the predictions they make are.
“The question is, ‘Can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets?'” Allen said in a press release. “The answer in many situations is probably, ‘Not without checking,’ but work is underway on next-generation machine-learning systems that will assess the uncertainty and reproducibility of their predictions.”
READ MORE: AAAS: Machine learning ‘causing science crisis’ [BBC]
More on machine learning: Should Coma Patients Live or Die? Machine Learning Will Help Decide.