Meeting Challenges of Modern Standard Arabic and Saudi Dialect Identification

Yahya Aseri, Khalid Alreemy, Salem Alelyani, Mohamed Mohana

March, 2022

Abstract

Dialect identification is a prior requirement for learning lexical and morphological knowledge a language variation that can be beneficial for natural language processing (NLP) and potential AI downstream tasks. In this paper, we present the first work on sentence-level Modern Standard Arabic (MSA) and Saudi Dialect (SD) identification where we trained and tested three classifiers (Logistic regression, Multi-nominal Na¨ıve Bayes, and Support Vector Machine) on datasets collected from Saudi Twitter and automatically labeled as (MSA) or SD. The model for each configuration was built using two levels of language models, i.e., unigram and bi-gram, as feature sets for training the systems. The model reported high-accuracy performance using 10-fold cross- validations with average 98.98%. This model was evaluated on another unseen, manually-annotated dataset. The best performance of these classifiers was achieved by Multi-nominal Naïve Bayes, reporting 89%.

Type

Conference paper

Publication

In 11th International Conference on Embedded Systems and Applications (EMSA 2022)

Meeting Challenges of Modern Standard Arabic and Saudi Dialect Identification

Abstract

Mohamed Mohana

PMI-CPMAI™ Certified | Head of AI | Digital Transformation Expert | Certified AI Scientist (CAIS™)