Meeting Challenges of Modern Standard Arabic and Saudi Dialect Identification

Abstract

Dialect identification is a prior requirement for learning lexical and morphological knowledge a language variation that can be beneficial for natural language processing (NLP) and potential AI downstream tasks. In this paper, we present the first work on sentence-level Modern Standard Arabic (MSA) and Saudi Dialect (SD) identification where we trained and tested three classifiers (Logistic regression, Multi-nominal Na¨ıve Bayes, and Support Vector Machine) on datasets collected from Saudi Twitter and automatically labeled as (MSA) or SD. The model for each configuration was built using two levels of language models, i.e., unigram and bi-gram, as feature sets for training the systems. The model reported high-accuracy performance using 10-fold cross- validations with average 98.98%. This model was evaluated on another unseen, manually-annotated dataset. The best performance of these classifiers was achieved by Multi-nominal Naïve Bayes, reporting 89%.

Publication
In 11th International Conference on Embedded Systems and Applications (EMSA 2022)
Mohamed Mohana
Mohamed Mohana
Artificial Intelligence Research Engineer

My research interests include Artificial Intelligence, Computer Vision, Classical Machine Learning, AI for Environment, AI in Renewable Energy, Feature Selection.