The "Tugan Tel" National Corpus is a linguistic resource of the modern literary Tatar language, intended for a wide range of users - linguists, specialists in Tatar, Turkic and general linguistics, typologists, teachers of the Tatar language, cultural figures, as well as for everyone who studies and is interested in the Tatar language. This electronic corpus is a fundamental component of a software and instrumental complex for research and development in Turkic languages.
As part of the project, various models of Tatar language morphology have been developed: generative, paradigmatic, and hybrid. The generative morphology model, based on inflection rules, ensures the completeness of wordform analysis, allowing for full consideration of the language's agglutinative nature, recognizing wordforms of potentially unlimited length. Another morphological analyzer model based on morphemes is actively being developed within the project to create a multifunctional multilingual internet service based on the structural-parametric functional model of the Turkic morpheme.
The Russian-Tatar translator was developed using the latest achievements in machine learning. The results of building a machine translation system for the Russian-Tatar language pair show that neural network algorithms are capable of solving the translation task at a sufficiently high level. The prepared training corpus of parallel Russian-Tatar texts, as well as data augmentation methods, allow for further research in this area. The resulting translation system surpasses the only existing translation system for this language pair from Yandex in terms of proximity to professional translation.
The task of creating a speech interface with a computer is one of the important tasks in the field of intelligent information technologies. To create a universal speech recognition system, a database of voices from more than 400 speakers with a total duration of about 60 hours has been collected. The necessary programs and models have been created, and the first experimental version of the recognition system, which understands 200 thousand Tatar words, has been launched. The achieved results are comparable to world analogues and allow for "communication" with a computer using voice commands.
The Atlas of Tatar Folk Dialects includes all major Tatar settlement areas and reflects information on the phonetics, morphology, lexicon, and syntax of the Tatar language, collected in 28 regions of Russia. The database of the electronic atlas consists of an attribute part, which presents information on the distribution of values for 215 linguistic phenomena across 1047 settlements, and a cartographic part. The maps display features of Tatar dialects in the sections of phonetics, morphology, lexicon, and syntax.
A morphological analyzer is a fundamental component of all fully-functional linguistic processors. This analyzer has been integrated into the Tatar National Corpus "Tugan Tel" system, is used in the University Information System RUSSIA, in the Yandex.Translate internet service, and is applied in the educational process for the "Philology: Applied Philology" program at Kazan Federal University.
Last updated: 8 December 2025, 16:38