For the development of the machine translation system, a statistical approach was initially chosen. This determined the priority tasks, which involved building a morphological analyzer for the Tatar language capable of reducing model dependency on the complexity of Tatar morphology, and accumulating parallel Russian-Tatar sentences necessary for training statistical models. The result of the work on creating a statistical Russian-Tatar translator using phrase-based MT technology was the publicly available version of the translator from Yandex, launched in 2015. The first version of Yandex.Translate for this language pair was trained, among other things, using the morphological analyzer and parallel corpus developed at the Institute.
However, the greatest improvement in the quality of machine translation systems has occurred in recent years due to the use of neural network approaches. Starting in 2016, machine translation systems based on neural networks surpassed the performance quality of systems based on statistics. Since then, many companies have developed neural versions of their translators for the most popular language pairs. To create a machine translation system capable of demonstrating good performance for the low-resource Tatar-Russian language pair, the latest achievements in machine learning were utilized. The current system includes tools that allow supplementing training data, performing text preprocessing algorithms together with an attention-based translation algorithm.
To solve the task of building a Tatar-Russian translation system, it is necessary to create a sufficiently large parallel corpus. The limitation here is the small volume of sources on which it can be built. One of the main sources of bilingual information is the websites of ministries and other government agencies. Another source of data is literary works - printed books with available translations. The collected data were filtered according to the following criteria: both the source and translated sentences must contain between 1 and 80 words; duplicate sentences were removed; all collected texts were aligned using the ABBYY Aligner 2.0 tool.
Main stages of preparing the parallel corpus for training the Tatar-Russian translation direction:
Formation of a base corpus based on Internet resources and literary works;
Filtering, automatic alignment, and manual correction of results (328 thousand sentence pairs);
Building intermediate systems of the Russian-Tatar translator;
Manual and semi-automatic translation of Russian-language news texts (189 thousand sentence pairs);
Building the final version of the Russian-Tatar translator;
Expanding the training corpus using the back-translation algorithm: an additional 409 thousand parallel sentence pairs were automatically prepared using the Russian-Tatar translator;
Building the final version of the Tatar-Russian translator.
The Tatar language is an agglutinative language with rich morphology, which requires solving the problem of a large number of out-of-vocabulary words due to the limited dictionary size and training data. To overcome this problem, basic units built on the byte-pair encoding algorithm were used. The model for splitting words into subword units was applied to the combined Russian-Tatar corpus.
The results of building a machine translation system for the Russian-Tatar language pair show that modern neural network algorithms and approaches are capable of solving the translation task at a sufficiently high level. The prepared training corpus of parallel Russian-Tatar texts, as well as data augmentation methods, allow for further research in this area. The resulting translation system surpasses the only existing translation system for this language pair from Yandex in terms of the BLEU score.
The project is carried out within the framework of the State Program "Preservation, Study, and Development of the State Languages of the Republic of Tatarstan and Other Languages in the Republic of Tatarstan for 2014 – 2020."
Last updated: 8 December 2025, 16:42
All content on this site is licensed under
Creative Commons Attribution 4.0 International
If you see a mistake,
please select the word or sentence and press CTRL+ENTER