Are computers smart enough to make scientific discoveries? Research by the US Department of Energy’s Lawrence Berkeley National Laboratory suggests the answer may be “yes”. It shows that an algorithm with no training in materials science can scan the text of millions of papers and uncover new scientific knowledge, according to a team led by Anubhav Jain.
They collected 3.3 million abstracts of published materials and fed them into an algorithm called Word2vec, which analysed relationships between words then was able to predict discoveries of new thermoelectric materials years in advance and suggest as-yet unknown materials as candidates for thermoelectric materials.
The experiment included having the algorithm perform tasks “in the past”; that is, giving it abstracts up to a certain year, then assessing how its predictions panned out.
“Without telling it anything about materials science, it learned concepts like the periodic table and the crystal structure of metals,” says Jain.
“That hinted at the potential of the technique. But probably the most interesting thing we figured out is, you can use this algorithm to address gaps in materials research, things that people should study but haven’t studied so far.”
The findings are published in the journal Nature.
The team collected abstracts from papers published in more than 1000 journals between 1922 and 2018. The algorithm took each of the approximately 500,000 distinct words in those abstracts and turned each into a 200-dimensional vector, or an array of 200 numbers.
“What’s important is not each number but using the numbers to see how words are related to one another,” says Jain.
When trained on materials science text, the algorithm was able to learn the meaning of scientific terms and concepts such as the crystal structure of metals based simply on the positions of the words in the abstracts and their co-occurrence with other words.
It was even able to learn the relationships between elements on the periodic table when the vector for each chemical element was projected onto two dimensions.
The researchers say the project was motivated by the difficulty scientists have making sense of the overwhelming amount of published studies.
“In every research field there’s 100 years of past research literature, and every week dozens more studies come out,” says Berkeley’s Gerbrand Ceder.
“A researcher can access only fraction of that. We thought, can machine learning do something to make use of all this collective knowledge in an unsupervised manner, without needing guidance from human researchers?”
Related reading: What will it take for us to trust algorithms?