TY - GEN
T1 - Extracting and Comparing Concepts Emerging from Software Code, Documentation and Tests
AU - Pauzi, Zaki
AU - Capiluppi, Andrea
N1 - Publisher Copyright:
© 2021 CEUR-WS. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Traceability in software engineering is the ability to connect different artifacts that have been built or designed at various points in time. Given the variety of tasks, tools and formats in the software lifecycle, an outstanding challenge for traceability studies is to deal with the heterogeneity of the artifacts, the links between them and the means to extract each. Using a unified approach for extracting keywords from textual information, this paper aims to compare the concepts extracted from three software artifacts: source code, documentation and tests from the same system. The objectives are to detect similarities in the concepts emerged, and to show the degree of alignment and synchronisation the artifacts possess. Using the components of three projects from the Apache Software Foundation, this paper extracts the concepts from 'base' source code, documentation, and tests (separated from the source code). The extraction is done based on the keywords present in each artifact: we then run multiple comparisons (through calculating cosine similarities on features extracted by word embeddings) in order to detect how the sets of concepts are similar or overlap. For similarities between code and tests, we discovered that using pre-trained language models (with increasing dimension and corpus size) correlates to the increase in magnitude, with higher averages and smaller ranges. FastText pre-trained embeddings scored the highest average of 97.33% with the lowest range of 21.8 across all projects. Also, our approach was able to quickly detect outliers, possibly indicating drifts in traceability within modules. For similarities involving documentation, there was a considerable drop in similarity score compared to between code and tests per module - down to below 5%.
AB - Traceability in software engineering is the ability to connect different artifacts that have been built or designed at various points in time. Given the variety of tasks, tools and formats in the software lifecycle, an outstanding challenge for traceability studies is to deal with the heterogeneity of the artifacts, the links between them and the means to extract each. Using a unified approach for extracting keywords from textual information, this paper aims to compare the concepts extracted from three software artifacts: source code, documentation and tests from the same system. The objectives are to detect similarities in the concepts emerged, and to show the degree of alignment and synchronisation the artifacts possess. Using the components of three projects from the Apache Software Foundation, this paper extracts the concepts from 'base' source code, documentation, and tests (separated from the source code). The extraction is done based on the keywords present in each artifact: we then run multiple comparisons (through calculating cosine similarities on features extracted by word embeddings) in order to detect how the sets of concepts are similar or overlap. For similarities between code and tests, we discovered that using pre-trained language models (with increasing dimension and corpus size) correlates to the increase in magnitude, with higher averages and smaller ranges. FastText pre-trained embeddings scored the highest average of 97.33% with the lowest range of 21.8 across all projects. Also, our approach was able to quickly detect outliers, possibly indicating drifts in traceability within modules. For similarities involving documentation, there was a considerable drop in similarity score compared to between code and tests per module - down to below 5%.
KW - Information retrieval
KW - Natural language processing
KW - Software traceability
KW - Textual analysis
UR - http://www.scopus.com/inward/record.url?scp=85123481087&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85123481087
T3 - CEUR Workshop Proceedings
BT - 20th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2021V
A2 - Catolino , G.
A2 - Di Nucci , D.
A2 - Tamburri , D.A.
PB - CEUR Workshop Proceedings
T2 - 20th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2021
Y2 - 7 December 2021 through 8 December 2021
ER -