Частотный словарь современного калмыцкого языка: правила анализа текстового материала

Bibliographic Details
Title:	Частотный словарь современного калмыцкого языка: правила анализа текстового материала
Source:	Вестник Калмыцкого института гуманитарных исследований РАН.
Publisher Information:	ФГБУН «Калмыцкий научный центр Российской академии наук», 2014.
Publication Year:	2014
Subject Terms:	КОРПУСНАЯ ЛИНГВИСТИКА,CORPUS LINGUISTICS,КВАНТИТАТИВНЫЕ МЕТОДЫ В ЛИНГВИСТИКЕ,QUANTITATIVE METHODS IN LINGUISTICS,ЧАСТОТНЫЙ СЛОВАРЬ,FREQUENCY DICTIONARY,КАЛМЫЦКИЙ ЯЗЫК,KALMYK LANGUAGE,ПРАВИЛА ЛЕММАТИЗАЦИИ,THE RULES FOR LEMMATIZATION
Description:	The article is devoted to description of the rules for text material analysis for creating the Frequency Dictionary of the Kalmyk language on the basis of the National Corpus of the Kalmyk Language (www.kalmcorpora.ru) which includes the texts of the literary works published in the second half of the 20 th and at the beginning of the 21 st centuries as well as newspaper articles and transcripts of spoken language. The volume of the ﬁ ction (prose and poetry) exceeds 10 mln. words. The texts in the Corpus as well as certain elements of the texts (word-forms, punctuations signs, paragraphs, etc.) have special annotations. The Frequency Dictionary created on the basis of the Corpus is a pilot model as it is the ﬁrst attempt to develop a dictionary of this type. In our opinion, the size of the created Corpus of the Kalmyk Language allows to describe the language from the point of view of usage frequency of language units and meanings: word-forms, words, constructions (2 and 3-gramms), grammatical meanings, letters, etc. In 2013, the experimental version of the National Corpus of the Kalmyk Language was launched, but it did not have any morphological and semantic annotations though the closed data had already possessed these types of annotations. The material containing the annotations will be open after the analyzer's program code will be adjusted, and its efﬁciency will reach 90%. At the present moment, the model of the algorithm of work of the morphological parser for the Kalmyk language successfully analyzes 70% of any text providing only unambiguous parsing at the same time. About 20% of the texts have multitude possible variants of automated analyses, though 10% of the texts have no parsing as there are no stems for them in the dictionary (they are mostly Russian loanwords which were not included into the Dictionary edited by B.D. Muniev [1977] and some proper names). The main idea of developing the Frequency Dictionary is that the most frequently used language units are the most signiﬁcant ones in any language but at the same time non-frequent elements are of the same signiﬁcance but from the other point of view. They can carry some traces of historical development and can belong to various terminological systems which evidences that a lexical unit is out of use in speech. The issue of the language units and meanings frequency is not developed in the Kalmyk linguistics that is why for researching the frequency characteristics of the Kalmyk speech one should ﬁrst of all identify and justify the parameters for distinguishing frequency and describing frequency characteristics of the Kalmyk speech. Thus the aim of this article is to describe the rules for analyzing lexical units in order to develop the Frequency Dictionary of the Kalmyk language where the observation unit is a lemma that is an initial form of the language without its lexical and grammatical annotations. However, it does not mean that the dictionary development will not take into account the Kalmyk grammar: processing of word-forms and working out lemma vocabulary are regulated by the rules of the formalized description of the Kalmyk language grammar, besides for each part of speech there is a separate description. The main and basic issue is to deﬁne the boundaries for the notions of a word and a lemma (an initial form of a word). The article provides the rules for textual material analysis in order to create the Frequency Dictionary of the Kalmyk language. These rules are built on the principles for developing “The Frequency Dictionary of the Russian Language” [Frequency Dictionary … 1977] and “The Grammar Dictionary of the Russian Language” [Zalizniak 1987] which were revised for the purposes of the Kalmyk language, while for the units which do not exist in the literary written language the rules have been developed anew. Each part of speech has its own set of rules which regulates the work of the morphological parser to process lineal Статья посвящена описанию правил анализа текстового материала для создания частотного словаря калмыцкого языка на материале Национального корпуса калмыцкого языка (www.kalmcorpora.ru), который состоит из художественных текстов второй половины XX начала XXI в., а также газетных статей и расшифровок устной речи. Объем художественных (прозаических и поэтических) текстов превышает 10 млн словоупотреблений. Тексты в корпусе, а также отдельные элементы текста (словоформы, знаки препинания, абзацы и т. п.) особым образом аннотированы. Создаваемый частотный словарь калмыцкого языка будет носить пилотный характер, поскольку это первый опыт разработки словаря подобного типа. На наш взгляд, объем созданного корпуса калмыцкого языка позволяет описать язык с точки зрения частотности употребления языковых единиц и значений: словоформ, слов, конструкций (2-и 3-граммных), грамматических значений, букв и др.
Document Type:	Article
File Description:	text/html
Language:	Russian
ISSN:	2410-7670 2075-7794
Access URL:	http://cyberleninka.ru/article/n/chastotnyy-slovar-sovremennogo-kalmytskogo-yazyka-pravila-analiza-tekstovogo-materiala http://cyberleninka.ru/article_covers/16983788.png
Accession Number:	edsair.od......2806..6e0bbf3d7cf5d94f672d5427a34b8424
Database:	OpenAIRE

View record at OpenAIRE

Description
ISSN:	24107670 20757794