Extracting and Comparing Concepts Emerging from Software Code, Documentation and Tests

Zaki Pauzi*, Andrea Capiluppi

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

1 Citation (Scopus)
27 Downloads (Pure)


Traceability in software engineering is the ability to connect different artifacts that have been built or designed at various points in time. Given the variety of tasks, tools and formats in the software lifecycle, an outstanding challenge for traceability studies is to deal with the heterogeneity of the artifacts, the links between them and the means to extract each. Using a unified approach for extracting keywords from textual information, this paper aims to compare the concepts extracted from three software artifacts: source code, documentation and tests from the same system. The objectives are to detect similarities in the concepts emerged, and to show the degree of alignment and synchronisation the artifacts possess. Using the components of three projects from the Apache Software Foundation, this paper extracts the concepts from 'base' source code, documentation, and tests (separated from the source code). The extraction is done based on the keywords present in each artifact: we then run multiple comparisons (through calculating cosine similarities on features extracted by word embeddings) in order to detect how the sets of concepts are similar or overlap. For similarities between code and tests, we discovered that using pre-trained language models (with increasing dimension and corpus size) correlates to the increase in magnitude, with higher averages and smaller ranges. FastText pre-trained embeddings scored the highest average of 97.33% with the lowest range of 21.8 across all projects. Also, our approach was able to quickly detect outliers, possibly indicating drifts in traceability within modules. For similarities involving documentation, there was a considerable drop in similarity score compared to between code and tests per module - down to below 5%.

Original languageEnglish
Title of host publication20th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2021V
Subtitle of host publicationCEUR Workshop Proceedings
EditorsG. Catolino , D. Di Nucci , D.A. Tamburri
PublisherCEUR Workshop Proceedings
Publication statusPublished - 2021
Event20th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2021 - Virtual, s-Hertogenbosch, Netherlands
Duration: 7-Dec-20218-Dec-2021

Publication series

NameCEUR Workshop Proceedings
ISSN (Print)1613-0073


Conference20th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2021
CityVirtual, s-Hertogenbosch


  • Information retrieval
  • Natural language processing
  • Software traceability
  • Textual analysis

Cite this