Research Institute of Applied Semiotics Scientific activity Fundamental and applied research Portal "Turkic Morpheme"

Portal "Turkic Morpheme"

The morphology of the Tatar language is used for the following purposes in information retrieval tasks:

initial morphological analysis of text to build a search index;
morphological analysis of a user's query to match the result of the morphological analysis with the search index;
highlighting query words within found texts.

These tasks are addressed by morphological modules (TatParser and ConsoleParser), based on a two-level model of Tatar morphology. The first version of the Tatar language morphological analyzer was implemented using the PC-KIMMO software toolkit, where the language's morphological model is described using the language alphabet, phonological and morphotactic rules. However, this model has certain limitations. Specifically, it only allows the use of the Latin alphabet for language description, which initially required implementing intermediate procedures to convert input Cyrillic text into Latin alphabet, which in turn affected parsing speed. Therefore, in the subsequent version of the morphological analyzer, a library of PC-KIMMO functions was developed for working with Cyrillic characters, allowing for acceptable query parsing performance. For the optimized morphological analyzer, the average processing time is 0.014 seconds for a 100-byte text.

Currently, various models of Tatar language morphology have been developed: generative, paradigmatic, and hybrid. The generative morphology model, based on inflection rules, although less performant than other models, ensures completeness of word form analysis, fully accounting for the agglutinative nature of the language and recognizing word forms of potentially unlimited length. This model, developed at the Institute, was used in creating the Tatar text corrector TatCorr. The same model is included in the ABBYY text recognition software, starting from OCR FineReader version 4.0. The paradigmatic model (also a product of the Institute of Applied Semiotics) is used in Microsoft's word processor, starting from MS Word 2007.

Another model of a morphological analyzer based on morphemes has been actively developed since 2014 as part of a project to create a multifunctional multilingual internet service based on a structural-parametric functional model of the Turkic morpheme. The morpheme model describes the properties of morphemes at different linguistic levels, including defining rules for classifying and combining linguistic units. The database of the multilingual internet service allows for the creation of morphological analysis and synthesis programs for various Turkic languages.

Last updated: 8 December 2025, 16:41