Detecting duplicate questions in stack overflow via source code modeling

Wei GAO, Jian WU, Guandong XU

Research output: Contribution to journalArticlespeer-review

4 Citations (Scopus)

Abstract

Stack Overflow is one of the most popular Question-Answering sites for programmers. However, it faces the problem of question duplication, where newly created questions are identical to previous questions. Existing works on duplicate question detection in Stack Overflow extract a set of textual features on the question pairs and use supervised learning approaches to classify duplicate question pairs. However, they do not consider the source code information in the questions. While in some cases, the intention of a question is mainly represented by the source code. In this paper, we aim to learn the semantics of a question by combining both text features and source code features. We use word embedding and convolutional neural networks to extract textual features from questions to overcome the lexical gap issue. We use tree-based convolutional neural networks to extract structural and semantic features from source code. In addition, we perform multi-task learning by combining the duplication question detection task with a question tag prediction side task. We conduct extensive experiments on the Stack Overflow dataset and show that our approach can detect duplicate questions with higher recall and MRR compared with baseline approaches on Python and Java programming languages. Copyright © 2022 World Scientific.

Original languageEnglish
Pages (from-to)227-255
JournalInternational Journal of Software Engineering and Knowledge Engineering
Volume32
Issue number2
DOIs
Publication statusPublished - Feb 2022

Citation

Gao, W., Wu, J., & Xu, G. (2022). Detecting duplicate questions in stack overflow via source code modeling. International Journal of Software Engineering and Knowledge Engineering, 32(2), 227-255. https://doi.org/10.1142/S0218194022500073

Fingerprint

Dive into the research topics of 'Detecting duplicate questions in stack overflow via source code modeling'. Together they form a unique fingerprint.