TY - JOUR
T1 - Detecting Java software similarities by using different clustering techniques
AU - Capiluppi, Andrea
AU - Di Ruscio, Davide
AU - Di Rocco, Juri
AU - Nguyen, Phuong T.
AU - Ajienka, Nemitari
PY - 2020/6/1
Y1 - 2020/6/1
N2 - Background: Research on empirical software engineering has increasingly been conducted by analysing and measuring vast amounts of software systems. Hundreds, thousands and even millions of systems have been (and are) considered by researchers, and often within the same study, in order to test theories, demonstrate approaches or run prediction models. A much less investigated aspect is whether the collected metrics might be context-specific, or whether systems should be better analysed in clusters. Objective: The objectives of this study are (i) to define a set of clustering techniques that might be used to group similar software systems, and (ii) to evaluate whether a suite of well-known object-oriented metrics is context-specific, and its values differ along the defined clusters. Method: We group software systems based on three different clustering techniques, and we collect the values of the metrics suite in each cluster. We then test whether clusters are statistically different between each other, using the Kolgomorov-Smirnov (KS) hypothesis testing. Results: Our results show that, for two of the used techniques, the KS null hypothesis (e.g., the clusters come from the same population) is rejected for most of the metrics chosen: the clusters that we extracted, based on application domains, show statistically different structural properties. Conclusions: The implications for researchers can be profound: metrics and their interpretation might be more sensitive to context than acknowledged so far, and application domains represent a promising filter to cluster similar systems.
AB - Background: Research on empirical software engineering has increasingly been conducted by analysing and measuring vast amounts of software systems. Hundreds, thousands and even millions of systems have been (and are) considered by researchers, and often within the same study, in order to test theories, demonstrate approaches or run prediction models. A much less investigated aspect is whether the collected metrics might be context-specific, or whether systems should be better analysed in clusters. Objective: The objectives of this study are (i) to define a set of clustering techniques that might be used to group similar software systems, and (ii) to evaluate whether a suite of well-known object-oriented metrics is context-specific, and its values differ along the defined clusters. Method: We group software systems based on three different clustering techniques, and we collect the values of the metrics suite in each cluster. We then test whether clusters are statistically different between each other, using the Kolgomorov-Smirnov (KS) hypothesis testing. Results: Our results show that, for two of the used techniques, the KS null hypothesis (e.g., the clusters come from the same population) is rejected for most of the metrics chosen: the clusters that we extracted, based on application domains, show statistically different structural properties. Conclusions: The implications for researchers can be profound: metrics and their interpretation might be more sensitive to context than acknowledged so far, and application domains represent a promising filter to cluster similar systems.
KW - Application Domains
KW - Expert Opinions
KW - FOSS (Free and open-source software)
KW - Latent Dirichlet Allocation
KW - Machine Learning
KW - OO (object-oriented)
UR - https://www.mendeley.com/catalogue/bce22923-02cc-34ff-8594-c52ab84dadb7/
U2 - 10.1016/j.infsof.2020.106279
DO - 10.1016/j.infsof.2020.106279
M3 - Article
VL - 122
JO - Information and Software Technology
JF - Information and Software Technology
SN - 0950-5849
M1 - 106279
ER -