Online Submission!

Open Journal Systems

A BI-TECHNICAL ANALYSIS FOR ARABIC STOP-WORDS DETECTION

Driss Namly, Karim Bouzoubaa, Abdellah Yousfi

Abstract


Stop words are defined as words that frequently appear in texts without carrying any significant information. For the Arabic language, existing works suffer from two main drawbacks (i) the use of only proprietary corpus and (ii) the reliance of only the frequency metric. Our approach for automatic Arabic stop-words detection uses a new metric based on a supervised machine learning process and a vector space representation that can be applied to any corpus, taking into account both domain-independent and domain-dependent stop-words. Conducted experiments to evaluate the proposed approach show a significant improvement reaching 91.85% for the detection rate using the F-measure metric.

Keywords


NLP, Stop-words, Supervised machine learning, Arabic language

Full Text:

PDF

References


AL-SHALABI, Riyadh, KANAAN, Ghasan, JAAM, Jihad M., et al. Stop-word removal algorithm for Arabic language. In: Proceedings of 1st International Conference on Information & Communication Technologies: from Theory to Applications, CTTA'04. 2004. p. 545-550.

ABU EL-KHAIR, Ibrahim. Effects of stop words elimination for Arabic information retrieval: a comparative study. International Journal of Computing & Information Sciences, 2006, vol. 4, no 3, p. 119-133.

ZOU, Feng, WANG, Fu Lee, DENG, Xiaotie, et al. Automatic construction of Chinese stop word list. In: Proceedings of the 5th WSEAS international conference on Applied computer science. 2006. p. 1010-1015.

SAVOY, Jacques. A stemming procedure and stopword list for general French corpora. JASIS, 1999, vol. 50, no 10, p. 944-952.

ZHENG, Gong et GAOWA, Guan. The selection of Mongolian stop words. In: Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE International Conference on. IEEE, 2010. p. 71-74.

MEDHAT, Walaa, YOUSEF, Ahmed H., et KORASHY, Hoda. Corpora Preparation and Stopword List Generation for Arabic data in Social Network. arXiv preprint arXiv:1410.1135, 2014.

DAVARPANAH, Mohammad Reza, SANJI, M., et ARAMIDEH, M. Farsi lexical analysis and stop word list. Library Hi Tech, 2009, vol. 27, no 3, p. 435-449.

KUMARAN, Giridhar et ALLAN, James. Text classification and named entities for new event detection. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2004. p. 297-304.

SANGAIAH, Arun Kumar, FAKHRY, Ahmed E., ABDEL-BASSET, Mohamed, et al. Arabic text clustering using improved clustering algorithms with dimensionality reduction. Cluster Computing, 2018, p. 1-15.

Glossbrenner, Alfred, and Emily Glossbrenner. Search engines for the world wide web. Peachpit Press, USA, 2001.

METIN, Senem Kumova et KARAO─×LAN, Bahar. STOP WORD DETECTION AS A BINARY CLASSIFICATION PROBLEM. Anadolu University Journal of Science and Technology A-Applied Sciences and Engineering, 2017, vol. 18, no 2, p. 346-359.

HAO, Lili et HAO, Lizhu. Automatic identification of stop words in chinese text classification. In: Computer Science and Software Engineering, 2008 International Conference on. IEEE, 2008. p. 718-722.

CHEKIMA, Khalifa et ALFRED, Rayner. An Automatic Construction of Malay Stop Words Based on Aggregation Method. In: International Conference on Soft Computing in Data Science. Springer, Singapore, 2016. p. 180-189.

ALAJMI, A., SAAD, E. M., et DARWISH, R. R. Toward an ARABIC stop-words list generation. International Journal of Computer Applications, 2012, vol. 46, no 8, p. 8-13.

SALTON, Gerard, WONG, Anita, et YANG, Chung-Shu. A vector space model for automatic indexing. Communications of the ACM, 1975, vol. 18, no 11, p. 613-620.




DOI: http://dx.doi.org/10.6084/ijact.v8i5.880

Refbacks

  • There are currently no refbacks.




Copyright (c) 2019 COMPUSOFT: An International Journal of Advanced Computer Technology