Multi-granular software annotation using file-level weak labelling

Cezar Sas*, Andrea Capiluppi

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

1 Citation (Scopus)
65 Downloads (Pure)

Abstract

Context: One of the most time-consuming tasks for developers is the comprehension of new code bases. An effective approach to aid this process is to label source code files with meaningful annotations, which can help developers understand the content and functionality of a code base quicker. However, most existing solutions for code annotation focus on project-level classification: manually labelling individual files is time-consuming, error-prone and hard to scale. Objective: The work presented in this paper aims to automate the annotation of files by leveraging project-level labels; and using the file-level annotations to annotate items at larger levels of granularity, for example, packages and a whole project. Method: We propose a novel approach to annotate source code files using a weak labelling approach and a subsequent hierarchical aggregation. We investigate whether this approach is effective in achieving multi-granular annotations of software projects, which can aid developers in understanding the content and functionalities of a code base more quickly. Results: Our evaluation uses a combination of human assessment and automated metrics to evaluate the annotations’ quality. Our approach correctly annotated 50% of files and more than 50% of packages. Moreover, the information captured at the file-level allowed us to identify, on average, three new relevant labels for any given project. We can conclude that the proposed approach is a convenient and promising way to generate noisy (not precise) annotations for files. Furthermore, hierarchical aggregation effectively preserves the information captured at file-level, and it can be propagated to packages and the overall project itself. Conclusions: We can conclude that the proposed approach is a convenient and promising way to generate noisy (not precise) annotations for files. Furthermore, hierarchical aggregation effectively preserves the information captured at file-level, and it can be propagated to packages and the overall project itself.

Original languageEnglish
Article number12
Number of pages34
JournalEmpirical Software Engineering
Volume29
Issue number1
Early online date30-Nov-2023
DOIs
Publication statusPublished - Jan-2024

Keywords

  • File-level labelling
  • Program comprehension
  • Software classification
  • Weak labelling

Fingerprint

Dive into the research topics of 'Multi-granular software annotation using file-level weak labelling'. Together they form a unique fingerprint.

Cite this