Massive public database of over 2,000 languages created

June 16, 2022

Evrim Yazgin

Cosmos science journalist

Apart from their compilation of stories – which have become a pillar of Western culture – the brothers Grimm (Wilhelm and Jacob) were intensely interested in linguistics. Jacob in particular made a significant contribution in his book Deutsche Grammatik, published in 1819, in which he documented the relationships among Indo-European languages.

The similarities between languages have raised the possibility that we might follow the links between linguistic families down the language tree all the way to some root language (see the 1989 essay “Grimm’s Greatest Tale” by Stephen Jay Gould for further discussion). Other questions surround the possible parallel evolution of languages and their diversity.

Now, a team of linguists, computational scientists and psychologists at the Max Planck Institute for Evolutionary Anthropology in Germany have created a massive public database to study these and other questions about the evolution and diversity of language.

They present their research in a paper published in the Scientific Data journal.

More on anthropology: Ancient humans used Spanish cave for rock art for more than 50,000 years

“When our Department of Linguistic and Cultural Evolution was founded in 2014, I presented my colleagues with an ambitious goal. There are more than 7,000 languages in the world: create databases with the most extensive documentation of the linguistic diversity as possible,” says the paper’s co-author and Max Planck director Russell Gray.

“Our inspiration came from Genbank – a large genetic database where biologists from all over the world have deposited genomic data,” Gray continues. “Genbank was a game changer. The large amount of freely available sequence data revolutionised the ways we can analyse biological diversity. We hope that the first of our global linguistic databases, Lexibank, will start to revolutionise our knowledge of linguistic diversity in a similar way.”

Lexibank stores data in the form of standardised wordlists for more than 2,000 language varieties.

“The work on Lexibank coincided with a push towards more consistent data formats in linguistic databases. Thus, Lexibank can serve both as a large-scale example of the benefits of standardisation and a catalyst for further standardisation,” reports co-author Robert Forkel, who led the computational part of the data collection. “We decided to create our own standards, called Cross-Linguistic Data Formats, which have now been used successfully in a multitude of projects in which our department is involved.”

“We have designed new computer-assisted workflows that enable existing language datasets to be made comparable,” says co-author Johann-Mattis List, who led the practical data curation. “With these workflows, we have dramatically increased the efficiency of data standardisation and data curation.”

Using new computational techniques, the team showed how languages are alike or differ according to 60 different criteria.

“Thanks to our standardised representation of language data, it is now easy to check how many languages use words like ‘mama’ and ‘papa’ for ‘mother’ and ‘father’,” says List.

“It turns out that this pattern can indeed be found in many languages of the world and in very different regions,” adds Simon J. Greenhill, one of the founders of the Lexibank project. “Since all the languages with this pattern are not closely related to each other, it reflects independent parallel evolution, just as the great linguist Roman Jakobson suggested in 1968.”

Other patterns that the dataset and computational tools have found warrant further probing, say the authors.

“When investigating which languages use the same word for ‘arm’ and ‘hand’, we found that these languages typically also use the same word for ‘leg’ and ‘foot’,” List reports. “While this may seem to be a silly coincidence, it shows that the lexicon of human languages is often much more structured than one might assume when investigating one language in isolation.”

The researchers say the next phase of the project will be the expansion of their dataset, and probing further questions on linguistic diversity and language evolution. “Nobody thinks that the analysis must stop with the examples we give in our paper,” says List. “On the contrary, we hope that linguists, psychologists, and evolutionary scientists will feel encouraged to build on our example by expanding the data and developing new methods,” adds Forkel.

What is bioinformatics? Shred a book and put it back together

Sickness or health: Healthy life split along gender, education lines

We may be looking at the wrong climate change data… and it might be worse than we thought

ASEAN nations join forces to imagine the “next big things”