TY - GEN
T1 - Text Similarity Between Concepts Extracted from Source Code and Documentation
AU - Pauzi, Zaki
AU - Capiluppi, Andrea
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/10/27
Y1 - 2020/10/27
N2 - Context: Constant evolution in software systems often results in its documentation losing sync with the content of the source code. The traceability research field has often helped in the past with the aim to recover links between code and documentation, when the two fell out of sync. Objective: The aim of this paper is to compare the concepts contained within the source code of a system with those extracted from its documentation, in order to detect how similar these two sets are. If vastly different, the difference between the two sets might indicate a considerable ageing of the documentation, and a need to update it. Methods: In this paper we reduce the source code of 50 software systems to a set of key terms, each containing the concepts of one of the systems sampled. At the same time, we reduce the documentation of each system to another set of key terms. We then use four different approaches for set comparison to detect how the sets are similar. Results: Using the well known Jaccard index as the benchmark for the comparisons, we have discovered that the cosine distance has excellent comparative powers, and depending on the pre-training of the machine learning model. In particular, the SpaCy and the FastText embeddings offer up to 80% and 90% similarity scores. Conclusion: For most of the sampled systems, the source code and the documentation tend to contain very similar concepts. Given the accuracy for one pre-trained model (e.g., FastText), it becomes also evident that a few systems show a measurable drift between the concepts contained in the documentation and in the source code.
AB - Context: Constant evolution in software systems often results in its documentation losing sync with the content of the source code. The traceability research field has often helped in the past with the aim to recover links between code and documentation, when the two fell out of sync. Objective: The aim of this paper is to compare the concepts contained within the source code of a system with those extracted from its documentation, in order to detect how similar these two sets are. If vastly different, the difference between the two sets might indicate a considerable ageing of the documentation, and a need to update it. Methods: In this paper we reduce the source code of 50 software systems to a set of key terms, each containing the concepts of one of the systems sampled. At the same time, we reduce the documentation of each system to another set of key terms. We then use four different approaches for set comparison to detect how the sets are similar. Results: Using the well known Jaccard index as the benchmark for the comparisons, we have discovered that the cosine distance has excellent comparative powers, and depending on the pre-training of the machine learning model. In particular, the SpaCy and the FastText embeddings offer up to 80% and 90% similarity scores. Conclusion: For most of the sampled systems, the source code and the documentation tend to contain very similar concepts. Given the accuracy for one pre-trained model (e.g., FastText), it becomes also evident that a few systems show a measurable drift between the concepts contained in the documentation and in the source code.
KW - Information Retrieval
KW - Natural language processing
KW - Text similarity
UR - http://www.scopus.com/inward/record.url?scp=85097388214&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-62362-3_12
DO - 10.1007/978-3-030-62362-3_12
M3 - Conference contribution
AN - SCOPUS:85097388214
SN - 9783030623616
SN - 978-3-030-62362-3
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 124
EP - 135
BT - Intelligent Data Engineering and Automated Learning – IDEAL 2020 - 21st International Conference, 2020, Proceedings
A2 - Analide, Cesar
A2 - Novais, Paulo
A2 - Camacho, David
A2 - Yin, Hujun
PB - Springer Science and Business Media Deutschland GmbH
CY - Cham
T2 - 21th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2020
Y2 - 4 November 2020 through 6 November 2020
ER -