Abstract. Uzbek, an agglutinative language, forms words by combining affixes with roots, utilizing inflectional endings for various morphological features. This property makes a large number of combinations of word ending, and greatly increases the word-vocabulary size, and data sparseness problems for statistical models. This paper discusses a morphological analyzing model which includes stemming, lemmatizing and extraction of morphological information considering morpho-phonetic exceptions. A main point of the model involves developing a complete set of word-ending with assign morphological information, and additional datasets for morphological analysis. The proposed model was evaluated using a curated test set comprising 5.3K words. It achieved a word-level accuracy over 91%, as determined through manual verification of stem, lemma, and morphological feature corrections conducted by linguistic experts. The created tool based on the proposed methodology is available as an open-source Python package, as well as a web-based application including a public API
References:
1. M. Ablimit, T. Kawahara, A. Pattar, and A. Hamdulla, “Stem-Affix based Uyghur Morphological Analyzer,” International Journal of Future Generation Communication and Networking, vol. 9, no. 2, 2016, doi: 10.14257/ijfgcn.2016.9.2.07.
2. U. Tukeyev, A. Karibayeva, and Z. h. Zhumanov, “Morphological segmentation method for Turkic language neural machine translation,” Cogent Eng, vol. 7, no. 1, p. 1856500, 2020, doi: 10.1080/23311916.2020.1856500.
3. A. and T. A. and A. D. Tukeyev Ualsher and Karibayeva, “Universal Programs for Stemming, Segmentation, Morphological Analysis of Turkic Words,” in Computational Collective Intelligence, L. and M. I. and T. B. Nguyen Ngoc Thanh and Iliadis, Ed., Cham: Springer International Publishing, 2021, pp. 643–654.
4. I. I. Bakaev and R. I. Bakaeva, “Creation of a morphological analyzer based on finite-state techniques for the Uzbek language,” in Journal of Physics: Conference Series, 2021. doi: 10.1088/1742-6596/1791/1/012068.
5. M. Sharipov and U. Salaev, “Uzbek affix finite state machine for stemming,” arXiv preprint arXiv:2205.10078, 2022.
6. Khamroeva Shahlo, “MORPHOTACTIC RULES IN THE MORPHOLOGICAL ANALYZER OF THE UZBEK LANGUAGE,” Middle European Scientific Bulletin, vol. 6, 2020, doi: 10.47494/mesb.2020.6.112.
7. N. Abdurakhmonova, I. Alisher, and R. Sayfulleyeva, “MorphUz: Morphological Analyzer for the Uzbek Language,” in Proceedings - 7th International Conference on Computer Science and Engineering, UBMK 2022, 2022. doi: 10.1109/UBMK55850.2022.9919579.
8. G. Matlatipov and Z. Vetulani, “Representation of Uzbek morphology in prolog,” in Aspects of Natural Language Processing, Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 83–110. doi: 10.1007/978-3-642-04735-0_4.