Text Similarity Between Concepts Extracted from Source Code and Documentation

Zaki Pauzi*, Andrea Capiluppi*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

2 Citations (Scopus)
18 Downloads (Pure)


Context: Constant evolution in software systems often results in its documentation losing sync with the content of the source code. The traceability research field has often helped in the past with the aim to recover links between code and documentation, when the two fell out of sync. Objective: The aim of this paper is to compare the concepts contained within the source code of a system with those extracted from its documentation, in order to detect how similar these two sets are. If vastly different, the difference between the two sets might indicate a considerable ageing of the documentation, and a need to update it. Methods: In this paper we reduce the source code of 50 software systems to a set of key terms, each containing the concepts of one of the systems sampled. At the same time, we reduce the documentation of each system to another set of key terms. We then use four different approaches for set comparison to detect how the sets are similar. Results: Using the well known Jaccard index as the benchmark for the comparisons, we have discovered that the cosine distance has excellent comparative powers, and depending on the pre-training of the machine learning model. In particular, the SpaCy and the FastText embeddings offer up to 80% and 90% similarity scores. Conclusion: For most of the sampled systems, the source code and the documentation tend to contain very similar concepts. Given the accuracy for one pre-trained model (e.g., FastText), it becomes also evident that a few systems show a measurable drift between the concepts contained in the documentation and in the source code.

Original languageEnglish
Title of host publicationIntelligent Data Engineering and Automated Learning – IDEAL 2020 - 21st International Conference, 2020, Proceedings
EditorsCesar Analide, Paulo Novais, David Camacho, Hujun Yin
Place of PublicationCham
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages12
ISBN (Print)9783030623616, 978-3-030-62362-3
Publication statusPublished - 27-Oct-2020
Event21th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2020 - Guimaraes, Portugal
Duration: 4-Nov-20206-Nov-2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12489 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference21th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2020


  • Information Retrieval
  • Natural language processing
  • Text similarity

Cite this