Enhancing Cybersecurity: Machine Learning and Natural Language Processing for Arabic Phishing Email Detection

Salloum, Said

Abstract

Phishing is a significant threat to the modern world, causing considerable financial losses. Although electronic mail has shown to be a valuable asset around the world in terms of facilitating communication for all parties involved, whether huge corporations or individuals communicating in their everyday lives, it has also brought with it its own set of issues. Scammers take advantage of such issues by sending out bogus emails to susceptible persons in order to acquire access to their personal information. Phishing email detection is considered an important research field, and the research community has tried hard to address this problem in various common languages like English. There are some other important languages, such as Arabic, which have not been given much attention when it comes to phishing detection. Arabic is the native language of more than 300 million people and is ranked as the fifth most extensively used language throughout the world. In terms of content-based phishing email detection, there has been relatively little research on Arabic language phishing emails. This study presents an English-Arabic Phishing Detection (EAPD) model developed on the word level (Term Frequency-Inverse Document Frequency (TF-IDF), Document-Term Matrix (DTM), and FastText embedding) and the character-level convolutional neural network (CharEmbedding) to decrease this gap. It will be one of the first studies to explore the extent to which machine learning (ML) and natural language processing (NLP) methods can be used to develop models for detecting English/Arabic phishing attacks. An English-Arabic parallel phishing email corpus was developed using the English and Arabic text provided by the leading security and privacy analytics anti-phishing shared task (IWSPA-AP 2018). To evaluate the effectiveness of the EAPD model, a collection of balanced 1258 emails in Arabic and English, featuring equal ratios of legitimate and phishing emails, was used. The experiments indicate that when using the Multilayer Perceptron (MLP) classifier combined with TF-IDF, the EAPD achieved an accuracy of 95.3% on Arabic datasets. The English text, on the other hand, reached a 95.7% accuracy when paired with the Support Vector Machine (SVM) classifier and TF-IDF. Salloum's list, a new set of Arabic stop words, was introduced and found that while traditional ML classifiers remained largely unaffected, deep learning (DL) models with FastText embedding, especially LSTM, showed a significant 14% variance following the integration of this extended list. Overall, this study presents a promising approach for detecting phishing emails in both English and Arabic, with high accuracy and efficiency.

Thesis Type	Thesis
Deposit Date	Jan 17, 2024
Publicly Available Date	Feb 27, 2024
Award Date	Jan 26, 2024

Enhancing Cybersecurity: Machine Learning and Natural Language Processing for Arabic Phishing Email Detection

Salloum, Said

Authors

Contributors

Abstract

Files

You might also like

Downloadable Citations