Facebook launched the "M2M-100" model without relying on English data and translating between 100 languages.

雷锋网 · Nov 19, 2020 11:24

原标题：无需依赖英语数据，100种语言互译，脸书推出「 M2M-100」模型

译者：AI研习社

双语原文链接：Is The Data Science Profession At Risk of Automation?

Facebook AI首次提出多语言机器翻译（MMT）模型——M2M -100，该模型可以在不依赖英语数据的情况下在任何100种语言间进行翻译。该项目已开源。

由于英语训练数据非常广泛，之前的中法翻译模型会分别训练中文与英语以及英语与法语的翻译模型，通过英语作为中间环节实现翻译。本文模型直接应用中文与法语的训练数据，以更好保留语义。计算BLEU指标，表明其高于以英语中转的系统约10个百分点。

M2M-100总共使用2200种语言进行训练，这比但之前最好的以英语为中心的多语言模型要多10倍。通过M2M-100模型，将有助于十亿人的翻译工作，对于低资源语言的翻译提升更加显著。

Facebook AI经多年在机器翻译方面的耕耘，如今终于实现该里程碑式的结果。接下来，我们将介绍具体的研究工作，包括为100种语言建立的翻译训练数据、模型的细节和训练。同时，我们还将开源该模型，并发布模型的训练和评估设置，以方便其他研究人员的复现，以此为基础进一步推动多语言模型的发展。

机器翻译（MT）能够打破语言障碍，将不同语种的人团结起来，为不同人群提供有关COVID的权威信息以帮助他们避免感染。得益于我们在低资源机器翻译及翻译质量评估的最新研究与进展，现在，我们每天能够在Facebook News Feed上提供近200亿次翻译。

典型的MT系统需要对不同语言和任务单独构建翻译模型，然而，这样的方式却并不适合Facebook，因为在Facebook上，有超过160种语言发布的数十亿条内容。现在的多语言系统虽然可以一次处理多种语言，但却是通过英语数据作为源语言和目标语言之间的中转，从而降低了准确性。因此，我们需要一个真正的多语言机器翻译（MMT）模型，该模型可以在任何语言之间直接进行翻译，这将为我们的社区提供更好的服务。

我们已经在Facebook对MT进行了多年的研究，现在终于可以自豪的宣布：我们首次构建了一个的大型MMT模型，该模型可以在100种不同语言之间直接进行翻译，而无需依赖英语作为中转语言。同时，我们的多语言模型的表现完全不弱于传统的双语模型，甚至要比以英语为中转的多语言模型提高了10个BLEU点。

通过新颖的挖掘策略，我们首次构建了一个真正的“多对多”翻译数据集，该数据集有75亿个句子，涵盖100种不同语言。最终，我们构建了一个具有150亿个参数的通用模型，该模型可以捕获相关语言的信息，并能够学习更加多样化的语言和形态特征。开源地址见此。

不同语言的亿万训练语句挖掘

建立多对多MMT模型的最大障碍之一是训练数据，即不同语言之间直接的高质量翻译数据，而不是以英语作为中间语言。然而现实情况是，比起法语和中文的直接翻译数据，中文和英文以及英语和法语的翻译数据更易获取。此外，训练所需的数据量与支持语言的数量成正比，例如，如果每种语言需要需要10M句子对，那么10种语言就是1B句子对，100种语言需要100B句子对。

构建包含100种语言的75亿句子对的多对多MMT数据集是艰巨的任务，由于我们多年来积累了不同的数据挖掘资源，包括ccAligned，ccMatrix和LASER，因此构建该数据集是可行的。为此，我们创建了新的LASER 2.0，改进了fastText语言识别，从而提高挖掘质量，相关的训练与评估脚本也会开源。当然，所有这些数据都是开源合法的。

Facebook AI提出的多对多的多语言模型是多年研究的结晶，MT模型、数据资源和优化技术等方面均是开创性的。本文会重点介绍一些主要成就。除此之外，我们通过挖掘ccNET创建了庞大的训练数据集，该数据集是基于fastText的；基于CCMatrix的LASER库可将句子嵌入多语言嵌入空间中；CCAligned则能够根据URL匹配来对齐文档。进一步，我们开发了改进版本LASER 2.0。

即使使用LASER 2.0等先进技术，挖掘100种不同语言/4450种可能语言对中的任意一类训练数据也需要大量的计算。由于数据规模巨大，为方便管理，我们首先关注翻译请求最多的语言。因此，我们综合数据规模和数据质量对挖掘目标进行优先排序，舍弃了对极冷门语言的数据挖掘，如冰岛语-尼泊尔语或僧伽罗语-爪哇语。

接下来，我们引入一种新的过渡挖掘策略，该策略根据地理和文化相似性将语言分为14个语言组。之所以这样做，是因为相同国家或地区中的人们会有更多的交流，这样的翻译数据质量更高。例如，将印度地区的语言分为一组，包括孟加拉语，北印度语，马拉地语，尼泊尔语，泰米尔语和乌尔都语。类似的，我们系统挖掘了不同组的全部语言对。

为了在不同组的语言之间建立联系，我们从每组中选择少量过渡语言，一般是一到三种主要语言。在上端的示例中，我们选择印地语，孟加拉语和泰米尔语作为印度雅-利安语言的过渡语言。然后，我们并行挖掘了过渡语言2200种组合的所有数据，最终得到包含75亿条数据的训练集。由于翻译数据是可以在两种语言之间相互进行训练的（如en-> fr和fr-> en），因此我们的挖掘策略采用高效的稀疏挖掘方式，通过一个模型就能实现100x100（共9,900个）种组合的数据挖掘工作。

在并行挖掘过程中，会得到一些低质量、低资源的翻译数据，基于此，我们采用反向翻译方法对这类数据进行扩充，该方法帮助我们在2018年和2019年的WMT国际机器翻译比赛中获得第一名。具体而言，如果我们的目标是训练汉语到法语的翻译模型，那么我们首先会训练法语到汉语的模型，然后将法语反译成汉语。我们发现，在数据规模较大时（如上亿语句）该方法非常有效。本研究中，我们使用反向翻译的合成数据对挖掘数据集进行扩充，同时，我们还使用反向翻译为那些未标注的语言对创建训练数据。

总体而言，相比仅依赖挖掘数据训练的模型，结合过渡策略和反向翻译的训练数据学习到的模型在100个反向翻译任务中BLEU平均提升约1.7。有了丰富、高质量的训练数据集，多对多翻译模型成为可能。

此外。我们还发现，对于没有训练数据的一个语言对，零样本（zero-shot）想过显著。例如，如果模型的训练数据只有法语-英语和德语-瑞典语，通过zero-shot我们可以在法语和瑞典语之间实现翻译。我们的M2M-100模型也表明，对于没有训练数据的语言对，融合zero-shot的多语言模型表现优于以英语作为过渡的多语言模型。

MMT模型-150亿参数，翻译快又准

多语言翻译中的一个挑战是，单一模型必须要能够从不同语言获取信息。为此，通常的方法是增大模型，添加面向特定语言类型的参数。同时，过量训练数据训练的模型包含一些无关参数，舍弃这类参数不仅会压缩模型，还避免了这些参数对翻译任务的干扰。最终，我们当将模型大小缩放到含120亿参数，发现在不同语言的翻译任务中BLEU平均提升约1.2，但随着参数的继续减少，模型性能开始下降。这样，通用多语言翻译模型含120参数，加上面向特定语言的32亿稀疏参数，最终的模型有150亿参数。

我们将该模型与双语基准模型和以英语作为过渡的多语言模型进行比较，如上图所示。第一行表示由24个编码器层和24个解码器层组成的包含12亿参数的基线模型，第二行是以英语为过渡的的多语言翻译模型。接下来，分别是包含12亿参数和120亿参数的M2M-100模型，可以看到，更多参数的模型BLEU提升1.2。

通过增加Transformer的层数以及每层的宽度，我们训练得到更大的模型，该模型依然训练高效、收敛快递。值得注意的是，该多对多翻译系统首次应用了Fairscale——一个是专用于pipeline和张量并行运算的新的PyTorch库。我们建立了通用架构，以通过Fairscale并行训练大型模型，避免了单GPU的限制。同时，我们应用ZeRO优化器，层内模型并行和pipeline模型并行来加快模型训练。

然而，120亿参数的多语言翻译模型是不够的，我们要训练更准确高效的模型。现在有许多研究工作使用多模型集成方法，即训练多个模型，并将其用于相同源语句进行翻译。为降低多个模型训练的复杂性和计算量，我们引入多源自组技术，该技术将源句子翻译成多种语言以提高翻译质量。参照LayerDrop和Depth-Adaptive，我们训练得到一个具有公共主干和不同语言特定参数集的模型。该方法能够按语言对或语言族将模型进行分块，非常适用多对多模型。最终，将压缩的多语言模型参数（12B）与特定语言参数（约3B）相结合，我们的模型不仅能像大型模型那样具有广泛扩展性，同时还能面向不同语言进行针对处理。

全力打破不同语言间的壁垒

多年来，人工智能研究人员一直在努力构建一个能够理解所有语言的通用模型。这样一个支持所有语言或方言的通用模型将为所有人提供更好的服务，令人满意的翻译将打破数十亿人的语言壁垒，让他们更加平等的了解这个世界。这项工作使我们更加接近了这一目标。

在长久的研究中，我们在预训练语言模型，微调和自我监督学习等方面发展迅速，研究成果振奋人心。这一系列的研究将进一步提高我们的系统使用未标记的数据来理解低资源语言文本的能力。例如，XLM-R是一个强大的多语言模型，它可以仅从一种语言数据中进行学习，然后扩展到100种语言。针对多语言BART任务，mBART是首次预训练全模型之一。最近，我们提出新的自我监督方法CRISS，通过许多不同语言的未标记数据来挖掘不同语言的并行句子，迭代训练更好的多语言模型。

我们将持续关注前沿进展，学习最新技术，探索MT系统的部署方式以及更加专业的计算架构，以继续改进翻译模型。

GitHub

https://github.com/pytorch/fairseq/tree/master/examples/m2m_100

AI研习社是AI学术青年和AI开发者技术交流的在线社区。我们与高校、学术机构和产业界合作，通过提供学习、实战和求职服务，为AI学术青年和开发者的交流互助和职业发展打造一站式平台，致力成为中国最大的科技创新人才聚集地。

如果，你也是位热爱分享的AI爱好者。欢迎与译站一起，学习新知，分享成长。

雷锋网版权文章，未经授权禁止转载。详情见转载须知。

Original title: no need to rely on English data, 100 languages translate each other, Facebook launched the "M2M-100" model

Translator: AI study Society

Bilingual original link: Is The Data Science Profession At Risk of Automation?

For the first time, Facebook Inc AI proposed a multilingual machine translation (MMT) model, M2M-100. the model can be translated between any 100 languages without relying on English data. The project is open source.

Because of the wide range of English training data, the previous Chinese-French translation models will train the translation models of Chinese and English and English and French respectively, and use English as an intermediate link to achieve translation. The model of this paper directly applies the training data of Chinese and French to better retain the semantics. The calculated BLEU index shows that it is about 10 percentage points higher than that of the English transit system.

In total, M2M-100 is trained in 2200 languages, which is 10 times more than the best English-centric multilingual model before. Through the M2M-100 model, it will be helpful to the translation work of a billion people, and the translation of low-resource languages will be improved more significantly.

After years of hard work in machine translation, Facebook Inc AI has finally achieved this milestone. Next, we will introduce the specific research work, including translation training data for 100 languages, details of the model and training. At the same time, we will open source the model, and release the training and evaluation settings of the model to facilitate the reproduction of other researchers, as a basis to further promote the development of the multilingual model.

Machine translation (MT) can break down the language barrier, unite people of different languages, and provide authoritative information about COVID to different people to help them avoid infection. Thanks to our latest research and progress in low-resource machine translation and translation quality assessment, we are now able to provide nearly 20 billion translations a day on Facebook Inc News Feed.

A typical MT system needs to build separate translation models for different languages and tasks. however, this approach is not suitable for Facebook Inc, because there are billions of posts published in more than 160 languages on Facebook Inc. Although the current multilingual system can deal with multiple languages at one time, it reduces the accuracy by using English data as the transfer between the source language and the target language. Therefore, we need a true multilingual machine translation (MMT) model, which can be translated directly between any language, which will provide better services to our community.

We have been studying MT in Facebook Inc for many years, and now we can proudly announce that we have built a large-scale MMT model for the first time, which can be translated directly between 100 different languages without relying on English as a transit language. At the same time, the performance of our multilingual model is not weaker than that of the traditional bilingual model, and even 10 BLEU points higher than the multilingual model with English as the transit.

Through the novel mining strategy, we construct a real "many-to-many" translation dataset for the first time, which has 7.5 billion sentences and covers 100 different languages. Finally, we build a general model with 15 billion parameters, which can capture the information of related languages and learn more diverse language and morphological features. See the open source address here.

Mining hundreds of millions of training sentences in different languages

One of the biggest obstacles to establishing a many-to-many MMT model is the training data, that is, direct and high-quality translation data between different languages, rather than using English as the intermediate language. However, the reality is that Chinese and English translation data as well as English and French translation data are more easily available than French and Chinese direct translation data. In addition, the amount of data required for training is proportional to the number of supported languages. For example, if each language requires 10m sentence pairs, then 10 languages need 1B sentence pairs and 100 languages need 100B sentence pairs.

Building many-to-many MMT datasets of 7.5 billion sentence pairs in 100 languages is a daunting task because we have accumulated differentdata miningResources, including ccAligned,ccMatrix and LASER, so it is feasible to build the dataset. To this end, we have created a new LASER 2.0, improved fastText language recognition to improve the quality of mining, and related training and evaluation scripts will be open source. Of course, all this data is open source legal.

The many-to-many multi-language model proposed by Facebook Inc AI is the fruit of years of research. The MT model, data resources and optimization techniques are groundbreaking. This article will focus on some major achievements. In addition, we create a huge training data set by mining ccNET, which is based on fastText; the CCMatrix-based LASER library can embed sentences into multi-language embedded space; and CCAligned can align documents according to URL matching. Further, we have developed an improved version of LASER 2.0.

Even using advanced technologies such as LASER 2.0, mining any kind of training data in 100 different languages / 4450 possible language pairs requires a lot of computation. Because of the large scale of the data, in order to facilitate management, we first pay attention to the languages with the most translation requests. Therefore, we prioritize mining targets in terms of data size and data quality, abandoning extremely unpopular languages.data miningSuch as Icelandic-Nepali or Sinhala-Javanese.

Next, we introduce a new transitional mining strategy, which divides languages into 14 language groups according to geographical and cultural similarities. The reason for this is that people in the same country or region will have more communication, so the translation data is of higher quality. For example, the languages of India are divided into groups, including Bengali, Hindi, Marathi, Nepali, Tamil and Urdu. Similarly, our system excavates all the language pairs of different groups.

In order to establish connections between languages in different groups, we choose a small number of transitional languages from each group, usually one or three major languages. In the example above, we chose Hindi, Bengali and Tamil as the transitional languages of the Indian Ya-Lean language. Then, we mine all the data of 2200 combinations of transition languages in parallel, and finally get a training set containing 7.5 billion pieces of data. Because the translation data can be trained between the two languages (such as en- > fr and fr- > en), our mining strategy adopts an efficient sparse mining method, and the data mining work of 100x100 (a total of 9900) can be realized through one model.

In the process of parallel mining, we will get some translation data with low quality and low resources. based on this, we use the reverse translation method to expand this kind of data, which helps us to win the first prize in the WMT International Machine Translation Competition in 2018 and 2019. Specifically, if our goal is to train the Chinese-to-French translation model, then we will first train the French-to-Chinese model, and then reverse translate French into Chinese. We find that this method is very effective when the data scale is large (such as hundreds of millions of statements). In this study, we use the synthetic data of reverse translation to expand the mining data set, and we also use reverse translation to create training data for those unlabeled language pairs.

Overall, compared with the model which only relies on mining data training, the model combined with transition strategy and reverse translation training data has an average BLEU improvement of about 1.7 in 100 reverse translation tasks. With rich and high-quality training data sets, many-to-many translation model becomes possible.

Besides. We also found that for a language pair without training data, zero sample (zero-shot) thought significantly. For example, if the training data of the model is only French-English and German-Swedish, we can translate between French and Swedish through zero-shot. Our M2M-100 model also shows that for language pairs without training data, the multilingual model with zero-shot performs better than the multilingual model with English as the transition.

MMT model-15 billion parameters, fast and accurate translation

A challenge in multilingual translation is that a single model must be able to obtain information from different languages. To do this, the usual way is to increase the model and add parameters that are oriented to specific language types. At the same time, the model trained by excessive training data contains some irrelevant parameters. Abandoning these parameters will not only compress the model, but also avoid the interference of these parameters to the translation task. Finally, when we scale up the model size to contain 12 billion parameters, we find that BLEU increases by an average of about 1.2 in translation tasks in different languages, but as the parameters continue to decrease, the model performance begins to decline. In this way, the general multilingual translation model contains 120 parameters, plus 3.2 billion sparse parameters for a specific language, and the final model has 15 billion parameters.

We compare this model with the bilingual benchmark model and the multilingual model with English as the transition, as shown in the figure above. The first line represents a baseline model with 1.2 billion parameters composed of 24 encoder layers and 24 decoder layers, and the second line is a multilingual translation model with English as the transition. Next, there are M2M-100 models with 1.2 billion parameters and 12 billion parameters, respectively, and you can see that the BLEU of the model with more parameters is improved by 1.2.

By increasing the number of Transformer layers and the width of each layer, we train to get a larger model, which still trains efficient and convergent express delivery. It is worth noting that the many-to-many translation system uses Fairscale--, a new PyTorch library dedicated to pipeline and tensor parallel computing, for the first time. We have established a general architecture to train large models in parallel through Fairscale, avoiding the limitations of a single GPU. At the same time, we apply ZeRO optimizer, intra-layer model parallelism and pipeline model parallelism to speed up model training.

However, the multilingual translation model with 12 billion parameters is not enough, we need to train a more accurate and efficient model. Now there is a lot of research work using the multi-model integration method, that is, to train multiple models and apply them to the translation of the same source sentences. In order to reduce the complexity and computational complexity of multiple model training, we introduce the multi-source self-organizing technique, which translates the source sentences into multiple languages to improve the translation quality. With reference to LayerDrop and Depth-Adaptive, we train a model with a common trunk and specific parameter sets for different languages. This method can divide the model into blocks according to language pairs or language families, and is very suitable for many-to-many models. Finally, by combining the compressed multilingual model parameters (12B) with specific language parameters (about 3B), our model can not only be as scalable as a large model, but also can be processed for different languages.

Try to break down the barriers between different languages

For years, artificial intelligence researchers have been trying to build a universal model that can understand all languages. Such a common model that supports all languages or dialects will provide better services for all, and satisfactory translation will break down the language barriers of billions of people and give them a more equal understanding of the world. This work brings us closer to this goal.

In the long-term research, we have developed rapidly in pre-training language models, fine-tuning and self-supervised learning, and the research results are encouraging. This series of studies will further improve the ability of our system to use untagged data to understand text in low-resource languages. For example, XLM-R is a powerful multilingual model that can be learned from only one language data and then extended to 100 languages. For multilingual BART tasks, mBART is one of the first pre-training models. Recently, we propose a new self-monitoring method, CRISS, which uses untagged data from many different languages to mine parallel sentences in different languages and iteratively train better multilingual models.

We will continue to follow cutting-edge developments, learn the latest technologies, explore the deployment of MT systems and more professional computing architecture, in order to continue to improve the translation model.

GitHub

Https://github.com/pytorch/fairseq/tree/master/examples/m2m_100

AI Research Society is an online community for technical exchange between AI academic youth and AI developers. In cooperation with universities, academic institutions and industry, we provide learning, practical and job-hunting services to create an one-stop platform for the exchange, mutual assistance and career development of AI academic youth and developers, and strive to become the largest gathering place for scientific and technological innovation talents in China.

If, you are also an AI lover who loves sharing. Welcome to the translation station to learn new knowledge and share growth.

Copyright articles of Lei Feng net are forbidden to reprint without authorization. For details, please see the reprint instructions.

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.

无需依赖英语数据，100种语言互译，脸书推出「 M2M-100」模型

Facebook launched the "M2M-100" model without relying on English data and translating between 100 languages.

Risk Disclaimer

Statement