US researchers have turned to the Bible to develop an algorithm that can convert written works into different styles for different audiences.
They were inspired, in this case, by the fact that it is “a large, previously untapped dataset of aligned parallel text”.
The collection of more than 31,000 verses allowed the team from Dartmouth College in New Hampshire to produce around 1.5 million unique pairings of source and target verses for machine-learning training sets.
Internet tools that translate between languages are widely available, but translators that keep text in the same language but transform the style have been slower to emerge, in part because of difficulties acquiring the necessary amount of appropriate data.
In the past, alternatives as diverse as the works of Shakespeare or entries in Wikipedia have been used, but researchers Keith Carlson, Allen Riddell and Daniel Rockmore say the resulting data sets are either much smaller or not as well suited to the task.{%recommended 6044%}
The Bible has the added benefit of being thoroughly indexed by the consistent use of book, chapter and verse numbers. Predictable organisation of the text across versions eliminates the risk of alignment errors caused by automatic methods of matching different versions of the same text.
“Humans have been performing the task of organising Bible texts for centuries, so we didn’t have to put our faith into less reliable alignment algorithms,” says computer scientist Rockmore.
To define “style” for the study, the researchers referenced sentence length, the use of passive or active voices, and word choice that could result in texts with varying degrees of simplicity or formality.
“Different wording may convey different levels of politeness or familiarity with the reader, display different cultural information about the writer, be easier to understand for certain populations,” they write in a paper published in the journal Royal Society Open Science.
Rockmore and colleagues used 34 stylistically distinct versions of the Bible that vary in linguistic complexity. The texts were fed into two algorithms: a statistical machine translation system called Moses and a neural network framework commonly used in machine translation, called Seq2Seq.
The hope is that systems can ultimately be developed that translate the style of any written text for different audiences. For example, a selection from Moby Dick could be translated for young readers and non-native English speakers.
“Text simplification is only one specific type of style transfer,” says Carlson. “More broadly, our systems aim to produce text with the same meaning as the original, but to do so with different words.”