Research Institute of Applied Semiotics Scientific activity Fundamental and applied research National Corpus of the Tatar Language "Tugan Tel"

National Corpus of the Tatar Language "Tugan Tel"

Website: tugantel.tatar

The National Corpus of the Tatar Language "Tugan Tel" is a linguistic resource of the modern literary Tatar language, intended for a wide range of users - linguists, specialists in Tatar, Turkic and general linguistics, typologists, teachers of the Tatar language, cultural figures, as well as for everyone who studies and is interested in the Tatar language. This electronic corpus is a fundamental component of a software and instrumental complex for research and development in Turkic languages.

The electronic corpus development project includes:

development of a conceptual model for the corpus of Tatar texts.
formation of a text collection, its technical and orthographic normalization.
development of a methodology for linguistic annotation of Tatar texts regarding the system of morphological markup and methods for its automation.
implementation of search functions within the corpus and visualization of search results.
development of a methodology for linguistic annotation regarding the system of semantic-syntactic markup and methods for its automation.

A specialized system for managing linguistic data, the "corpus manager" system, has been developed to manage the corpus data. This system is oriented towards working with Turkic languages but can also be used for working with electronic corpora of other languages. The corpus search system allows for searches by:

Word form;
Lemma (lexeme);
Set of morphological parameters.

The search system also supports search for negative words (words to be excluded from the search), partial word search, search using logical formulas, and phrase search; thus, users can formulate complex queries specific to their research needs.

To enable quick and convenient extraction of search results for further processing in application software, Corpus API tools have been developed – a set of functional APIs that allow extracting and presenting corpus samples in various formats based on specified criteria.

The project is carried out within the framework of the State Program "Preservation, Study, and Development of the State Languages of the Republic of Tatarstan and Other Languages in the Republic of Tatarstan for 2014-2020".

The corpus includes Tatar texts of various genres with a total volume of over 180 million word usages (as of December 2019).

Last updated: 8 December 2025, 16:40