Computers have become much more adept at translating from one language into another in recent years, thanks to the application of neural networks. However, these AI systems usually require a lot of content translated by humans for the computers to learn from, while two new papers demonstrate that it's possible to develop a system that doesn't rely on parallel texts.
Mikel Artetxe, a computer scientist at the University of the Basque Country (UPV) and the author of one of these papers, compares the situation to giving someone various books in Chinese and various books in Arabic, without any of the same texts overlapping. A human would find it very difficult to learn how to translate from Chinese into Arabic in this scenario, but a computer might not.
In a typical machine-learning process, the AI system would be supervised. This means that it would make its attempt at the right answer for any given problem, a human would tell it whether or not that's correct, and it would amend its activity as needed. That isn't the case with these two papers.
Instead, they hinge upon the way that words are connected in similar ways across different languages – for instance, 'table' and 'chair' are frequently used together, no matter the dialect. By mapping out these connections for each language and then comparing them, it's possible to get a decent idea of which terms relate to one another. This process is not supervised by a human.
The systems can be used to translate full sentences, rather than just individual words, using two complementary training strategies. Back translation sees a sentence written in one language roughly translate into the other, then back to the first, with the system tweaking its protocols if the result isn't exactly the same. Denoising is a similar process, but with words being removed or added to the sentence for different translations. Working in sync, these methods help the machine get a greater understanding of how language actually operates.
The two systems – one developed at UPV and the other by Facebook computer scientist Guillaume Lample – are yet to be peer-reviewed, but they have shown promising results in early testing.
The only way to make a direct comparison between their capabilities is by gauging their ability to translate between English and French text that comes from a shared pool of around 30 million sentences. Both managed to score a bilingual evaluation understudy score of around 15.
Google Translate, which uses supervised machine learning, scores around 40 by this measure, whereas human translators can score 50. However, the unsupervised scores are a significant improvement over basic, word-for-word translation.
Indeed, the researchers behind both papers agree that they could each enhance their system by drawing on the other's work. They could also be made more capable if they were semi-supervised, by introducing a few thousand parallel sentences to their training program – which would still cut down on the time and data required to learn the ropes.