An exploratory and automated study of sarcasm detection and classification in app stores using fine-tuned deep learning classifiers

Eman Fatima, Hira Kanwal, Javed Ali Khan, Nek Dil Khan

Research output: Contribution to journalArticlepeer-review

Abstract

App stores enable users to provide insightful feedback on apps, which developers can use for future software application enhancement and evolution. However, finding user reviews that are valuable and relevant for quality improvement and app enhancement is challenging because of increasing end-user feedback. Also, to date, according to our knowledge, the existing sentiment analysis approaches lack in considering sarcasm and its types when identifying sentiments of end-user reviews for requirements decision-making. Moreover, no work has been reported on detecting sarcasm by analyzing app reviews. This paper proposes an automated approach by detecting sarcasm and its types in end-user reviews and identifying valuable requirements-related information using natural language processing (NLP) and deep learning (DL) algorithms to help software engineers better understand end-user sentiments. For this purpose, we crawled 55,000 end-user comments on seven software apps in the Play Store. Then, a novel sarcasm coding guideline is developed by critically analyzing end-user reviews and recovering frequently used sarcastic types such as Irony, Humor, Flattery, Self-Deprecation, and Passive Aggression. Next, using coding guidelines and the content analysis approach, we annotated the 10,000 user comments and made them parsable for the state-of-the-art DL algorithms. We conducted a survey at two different universities in Pakistan to identify participants’ accuracy in manually identifying sarcasm in the end-user reviews. We developed a ground truth to compare the results of DL algorithms. We then applied various fine-tuned DL classifiers to first detect sarcasm in the end-user feedback and then further classified the sarcastic reviews into more fine-grained sarcastic types. For this, end-user comments are first pre-processed and balanced with the instances in the dataset. Then, feature engineering is applied to fine-tune the DL classifiers. We obtain an average accuracy of 97%, 96%, 96%, 96%, 96%, 86%, and 90% with binary classification and 90%, 91%, 92%, 91%, 91%, 75%, and 89% with CNN, LSTM, BiLSTM, GRU, BiGRU, RNN, and BiRNN classifiers, respectively. Such information would help improve the performance of sentiment analysis approaches to understand better the associated sentiments with the identified new features or issues.

Original languageEnglish
Article number69
JournalAutomated Software Engineering
Volume31
Issue number2
Early online date27 Aug 2024
DOIs
Publication statusE-pub ahead of print - 27 Aug 2024

Keywords

  • Apps reviews
  • CrowdRE
  • Deep learning
  • Sarcasm detection & classification
  • Sentiment analysis

Cite this