A STATISTICAL INDEX CALCULATED USING THE TF-IDF FOR TEXTS IN THE UZBEK LANGUAGE CORPUS

15.12.2022 International Scientific Journal "Science and Innovation". Series B. Volume 1 Issue 8

B.Elov , Z.Xusainova, N.Xudayberganov

Abstract. One of the most common methods of processing textual data is TF-IDF. Google's search engine has been using the TF-IDF method for ranking content relevant to user queries for many years. According to the results of the conducted research, it was determined that the Google system paid more attention to the frequency of terms than to the calculation of keywords. The value determined by the TF-IDF method represents the relevance of the keyword in the language corpus. Using the TF-IDF method, a digital vector corresponding to corpus documents is generated. This numeric vector is a measure used in the fields of information retrieval (IR) and machine learning (ML) to represent the importance of string representations (words, phrases, lemmas, etc.) to a document. In this article, we will consider the process of sorting documents in the Uzbek language corpus using the TF-IDF method according to the keyword.

Keywords: TF-IDF, BoW, digital vectors, corpus, word frequency, document inverse frequency, tokenization, lemmatization.