share_log

无需依赖英语数据,100种语言互译,脸书推出「 M2M-100」模型

Facebook launched the "M2M-100" model without relying on English data and translating between 100 languages.

雷锋网 ·  Nov 19, 2020 11:24

Original title: no need to rely on English data, 100 languages translate each other, Facebook launched the "M2M-100" model

Translator: AI study Society

Bilingual original link: Is The Data Science Profession At Risk of Automation?

0622-kcysmrw8368137.jpg

For the first time, Facebook Inc AI proposed a multilingual machine translation (MMT) model, M2M-100. the model can be translated between any 100 languages without relying on English data. The project is open source.

Because of the wide range of English training data, the previous Chinese-French translation models will train the translation models of Chinese and English and English and French respectively, and use English as an intermediate link to achieve translation. The model of this paper directly applies the training data of Chinese and French to better retain the semantics. The calculated BLEU index shows that it is about 10 percentage points higher than that of the English transit system.

In total, M2M-100 is trained in 2200 languages, which is 10 times more than the best English-centric multilingual model before. Through the M2M-100 model, it will be helpful to the translation work of a billion people, and the translation of low-resource languages will be improved more significantly.

After years of hard work in machine translation, Facebook Inc AI has finally achieved this milestone. Next, we will introduce the specific research work, including translation training data for 100 languages, details of the model and training. At the same time, we will open source the model, and release the training and evaluation settings of the model to facilitate the reproduction of other researchers, as a basis to further promote the development of the multilingual model.

Machine translation (MT) can break down the language barrier, unite people of different languages, and provide authoritative information about COVID to different people to help them avoid infection. Thanks to our latest research and progress in low-resource machine translation and translation quality assessment, we are now able to provide nearly 20 billion translations a day on Facebook Inc News Feed.

A typical MT system needs to build separate translation models for different languages and tasks. however, this approach is not suitable for Facebook Inc, because there are billions of posts published in more than 160 languages on Facebook Inc. Although the current multilingual system can deal with multiple languages at one time, it reduces the accuracy by using English data as the transfer between the source language and the target language. Therefore, we need a true multilingual machine translation (MMT) model, which can be translated directly between any language, which will provide better services to our community.

We have been studying MT in Facebook Inc for many years, and now we can proudly announce that we have built a large-scale MMT model for the first time, which can be translated directly between 100 different languages without relying on English as a transit language. At the same time, the performance of our multilingual model is not weaker than that of the traditional bilingual model, and even 10 BLEU points higher than the multilingual model with English as the transit.

Through the novel mining strategy, we construct a real "many-to-many" translation dataset for the first time, which has 7.5 billion sentences and covers 100 different languages. Finally, we build a general model with 15 billion parameters, which can capture the information of related languages and learn more diverse language and morphological features. See the open source address here.

Mining hundreds of millions of training sentences in different languages

One of the biggest obstacles to establishing a many-to-many MMT model is the training data, that is, direct and high-quality translation data between different languages, rather than using English as the intermediate language. However, the reality is that Chinese and English translation data as well as English and French translation data are more easily available than French and Chinese direct translation data. In addition, the amount of data required for training is proportional to the number of supported languages. For example, if each language requires 10m sentence pairs, then 10 languages need 1B sentence pairs and 100 languages need 100B sentence pairs.

Building many-to-many MMT datasets of 7.5 billion sentence pairs in 100 languages is a daunting task because we have accumulated differentdata miningResources, including ccAligned,ccMatrix and LASER, so it is feasible to build the dataset. To this end, we have created a new LASER 2.0, improved fastText language recognition to improve the quality of mining, and related training and evaluation scripts will be open source. Of course, all this data is open source legal.

The many-to-many multi-language model proposed by Facebook Inc AI is the fruit of years of research. The MT model, data resources and optimization techniques are groundbreaking. This article will focus on some major achievements. In addition, we create a huge training data set by mining ccNET, which is based on fastText; the CCMatrix-based LASER library can embed sentences into multi-language embedded space; and CCAligned can align documents according to URL matching. Further, we have developed an improved version of LASER 2.0.

Even using advanced technologies such as LASER 2.0, mining any kind of training data in 100 different languages / 4450 possible language pairs requires a lot of computation. Because of the large scale of the data, in order to facilitate management, we first pay attention to the languages with the most translation requests. Therefore, we prioritize mining targets in terms of data size and data quality, abandoning extremely unpopular languages.data miningSuch as Icelandic-Nepali or Sinhala-Javanese.

Next, we introduce a new transitional mining strategy, which divides languages into 14 language groups according to geographical and cultural similarities. The reason for this is that people in the same country or region will have more communication, so the translation data is of higher quality. For example, the languages of India are divided into groups, including Bengali, Hindi, Marathi, Nepali, Tamil and Urdu. Similarly, our system excavates all the language pairs of different groups.

In order to establish connections between languages in different groups, we choose a small number of transitional languages from each group, usually one or three major languages. In the example above, we chose Hindi, Bengali and Tamil as the transitional languages of the Indian Ya-Lean language. Then, we mine all the data of 2200 combinations of transition languages in parallel, and finally get a training set containing 7.5 billion pieces of data. Because the translation data can be trained between the two languages (such as en- > fr and fr- > en), our mining strategy adopts an efficient sparse mining method, and the data mining work of 100x100 (a total of 9900) can be realized through one model.

In the process of parallel mining, we will get some translation data with low quality and low resources. based on this, we use the reverse translation method to expand this kind of data, which helps us to win the first prize in the WMT International Machine Translation Competition in 2018 and 2019. Specifically, if our goal is to train the Chinese-to-French translation model, then we will first train the French-to-Chinese model, and then reverse translate French into Chinese. We find that this method is very effective when the data scale is large (such as hundreds of millions of statements). In this study, we use the synthetic data of reverse translation to expand the mining data set, and we also use reverse translation to create training data for those unlabeled language pairs.

Overall, compared with the model which only relies on mining data training, the model combined with transition strategy and reverse translation training data has an average BLEU improvement of about 1.7 in 100 reverse translation tasks. With rich and high-quality training data sets, many-to-many translation model becomes possible.

Besides. We also found that for a language pair without training data, zero sample (zero-shot) thought significantly. For example, if the training data of the model is only French-English and German-Swedish, we can translate between French and Swedish through zero-shot. Our M2M-100 model also shows that for language pairs without training data, the multilingual model with zero-shot performs better than the multilingual model with English as the transition.

MMT model-15 billion parameters, fast and accurate translation

A challenge in multilingual translation is that a single model must be able to obtain information from different languages. To do this, the usual way is to increase the model and add parameters that are oriented to specific language types. At the same time, the model trained by excessive training data contains some irrelevant parameters. Abandoning these parameters will not only compress the model, but also avoid the interference of these parameters to the translation task. Finally, when we scale up the model size to contain 12 billion parameters, we find that BLEU increases by an average of about 1.2 in translation tasks in different languages, but as the parameters continue to decrease, the model performance begins to decline. In this way, the general multilingual translation model contains 120 parameters, plus 3.2 billion sparse parameters for a specific language, and the final model has 15 billion parameters.

e6c4-kcysmrw8368135.jpg

We compare this model with the bilingual benchmark model and the multilingual model with English as the transition, as shown in the figure above. The first line represents a baseline model with 1.2 billion parameters composed of 24 encoder layers and 24 decoder layers, and the second line is a multilingual translation model with English as the transition. Next, there are M2M-100 models with 1.2 billion parameters and 12 billion parameters, respectively, and you can see that the BLEU of the model with more parameters is improved by 1.2.

By increasing the number of Transformer layers and the width of each layer, we train to get a larger model, which still trains efficient and convergent express delivery. It is worth noting that the many-to-many translation system uses Fairscale--, a new PyTorch library dedicated to pipeline and tensor parallel computing, for the first time. We have established a general architecture to train large models in parallel through Fairscale, avoiding the limitations of a single GPU. At the same time, we apply ZeRO optimizer, intra-layer model parallelism and pipeline model parallelism to speed up model training.

However, the multilingual translation model with 12 billion parameters is not enough, we need to train a more accurate and efficient model. Now there is a lot of research work using the multi-model integration method, that is, to train multiple models and apply them to the translation of the same source sentences. In order to reduce the complexity and computational complexity of multiple model training, we introduce the multi-source self-organizing technique, which translates the source sentences into multiple languages to improve the translation quality. With reference to LayerDrop and Depth-Adaptive, we train a model with a common trunk and specific parameter sets for different languages. This method can divide the model into blocks according to language pairs or language families, and is very suitable for many-to-many models. Finally, by combining the compressed multilingual model parameters (12B) with specific language parameters (about 3B), our model can not only be as scalable as a large model, but also can be processed for different languages.

Try to break down the barriers between different languages

For years, artificial intelligence researchers have been trying to build a universal model that can understand all languages. Such a common model that supports all languages or dialects will provide better services for all, and satisfactory translation will break down the language barriers of billions of people and give them a more equal understanding of the world. This work brings us closer to this goal.

In the long-term research, we have developed rapidly in pre-training language models, fine-tuning and self-supervised learning, and the research results are encouraging. This series of studies will further improve the ability of our system to use untagged data to understand text in low-resource languages. For example, XLM-R is a powerful multilingual model that can be learned from only one language data and then extended to 100 languages. For multilingual BART tasks, mBART is one of the first pre-training models. Recently, we propose a new self-monitoring method, CRISS, which uses untagged data from many different languages to mine parallel sentences in different languages and iteratively train better multilingual models.

We will continue to follow cutting-edge developments, learn the latest technologies, explore the deployment of MT systems and more professional computing architecture, in order to continue to improve the translation model.

GitHub

Https://github.com/pytorch/fairseq/tree/master/examples/m2m_100

AI Research Society is an online community for technical exchange between AI academic youth and AI developers. In cooperation with universities, academic institutions and industry, we provide learning, practical and job-hunting services to create an one-stop platform for the exchange, mutual assistance and career development of AI academic youth and developers, and strive to become the largest gathering place for scientific and technological innovation talents in China.

If, you are also an AI lover who loves sharing. Welcome to the translation station to learn new knowledge and share growth.

Copyright articles of Lei Feng net are forbidden to reprint without authorization. For details, please see the reprint instructions.

217d-kcysmrw8368376.png

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment