A STATISTICAL INDEX CALCULATED USING THE TF-IDF FOR TEXTS IN THE UZBEK LANGUAGE CORPUS

B.Elov; Z.Xusainova; N.Xudayberganov

doi:10.5281/zenodo.7440059

A STATISTICAL INDEX CALCULATED USING THE TF-IDF FOR TEXTS IN THE UZBEK LANGUAGE CORPUS

15.12.2022 International Scientific Journal "Science and Innovation". Series B. Volume 1 Issue 8

B.Elov , Z.Xusainova, N.Xudayberganov

Abstract. One of the most common methods of processing textual data is TF-IDF. Google's search engine has been using the TF-IDF method for ranking content relevant to user queries for many years. According to the results of the conducted research, it was determined that the Google system paid more attention to the frequency of terms than to the calculation of keywords. The value determined by the TF-IDF method represents the relevance of the keyword in the language corpus. Using the TF-IDF method, a digital vector corresponding to corpus documents is generated. This numeric vector is a measure used in the fields of information retrieval (IR) and machine learning (ML) to represent the importance of string representations (words, phrases, lemmas, etc.) to a document. In this article, we will consider the process of sorting documents in the Uzbek language corpus using the TF-IDF method according to the keyword.

Keywords: TF-IDF, BoW, digital vectors, corpus, word frequency, document inverse frequency, tokenization, lemmatization.

References:

1. Stecanella, B. (2019). What is TF IDF? MonkeyLearn 2. Qin, J., Zhou, Z., Tan, Y., Xiang, X., & He, Z. (2021). A big data text coverless information hiding based on topic distribution and tf-idf. International Journal of Digital Crime and Forensics, 13(4). https://doi.org/10.4018/IJDCF.20210701.oa4 3. Cahyani, D. E., & Patasik, I. (2021). Performance comparison of tf-idf and word2vec models for emotion text classification. Bulletin of Electrical Engineering and Informatics, 10(5). https://doi.org/10.11591/eei.v10i5.3157 4. Pietro, M. di. (2020). Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT. Medium. 5. Qaiser, S., & Ali, R. (2018). Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications, 181(1). https://doi.org/10.5120/ijca2018917395 6. Ahmed, B., Ali, G., Hussain, A., Baseer, A., & Ahmed, J. (2021). Analysis of Text Feature Extractors using Deep Learning on Fake News. Engineering, Technology & Applied Science Research, 11(2). https://doi.org/10.48084/etasr.4069 7. O‘zbek tili ta’limiy korpusi - http://uzschoolcorpara.uz/ 8. Jalilifard, A., Caridá, V. F., Mansano, A. F., Cristo, R. S., & da Fonseca, F. P. C. (2021). Semantic Sensitive TF-IDF to Determine Word Relevance in Documents. Lecture Notes in Electrical Engineering, 736 LNEE. https://doi.org/10.1007/978-981-33-6987-0_27 9. Carneiro, D., Novais, P., & Neves, J. (2014). Information Retrieval. In Law, Governance and Technology Series (Vol. 18). https://doi.org/10.1007/978-3-319-06239-6_7 10. Azad, H. K., & Deepak, A. (2019). Query expansion techniques for information retrieval: A survey. Information Processing and Management, 56(5). https://doi.org/10.1016/j.ipm.2019.05.009 11. Kharis, M., Laksono, K., Suhartono, Ridwan, A., Mintowati, & Yuniseffendri. (2022). Tokenization and Lemmatization on German Learning Textbook Level A1 of CEFR Standard. Journal of Higher Education Theory and Practice, 22(1). https://doi.org/10.33423/jhetp.v22i1.4971 12. Razno, M. (2019). Machine learning text classification model with NLP approach. Computational Linguistics and IntelligeRazno, M. (2019). Machine Learning Text Classification Model with NLP Approach. Computational Linguistics and Intelligent Systems, 2(18-Apr-2019), 71–73. 13. Http://Ena.Lp.Edu.Ua:8080/Handle/Ntb/45487nt Systems, 2 (18-Apr-2019). 14. O‘zbek tili morfologik analizatori - http://uznatcorpara.uz/