Using the lexicon from source code to determine application domain

Andrea Capiluppi, Nemitari Ajienka, Nour Ali, Mahir Arzoky, Steve Counsell, Giuseppe Destefanis, Alina Miron, Bhaveet Nagaria, Rumyana Neykova, Martin Shepperd

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

2 Citations (Scopus)
45 Downloads (Pure)


Context: The vast majority of software engineering research is reported independently of the application domain: techniques and tools usage is reported without any domain context. As reported in previous research, this has not always been so: early in the computing era, the research focus was frequently application domain specific (for example, scientific and data processing). Objective: We believe determining the research context is often important. Therefore we propose a code-based approach to identify the application domain of a software system, via its lexicon. We compare its use against the plain textual description attached to the same system. Method: Using a sample of 50 Java projects, we obtained i) the description of each project (e.g., its ReadMe file), ii) the lexicon extracted from its source code, and iii) a list of its main topics extracted with the Latent Dirichlet Allocation (LDA) modelling technique. We assigned a random subset of these data items to different researchers (i.e., 'experts'), and asked them to assign each item to one (or more) application domain. We then evaluated the precision and accuracy of the three techniques. Results: Using the agreement levels between experts, We observed that the 'baseline' dataset (i.e., the ReadMe files) obtained the highest average in terms of agreement between experts, but we also observed that the three techniques had the same mode and median agreement levels. Additionally, in the cases where no agreement was reached for the baseline dataset, the two other techniques provided sufficient additional support. Conclusions: We conclude that the source code is sufficient for determining the application domain, so that classification is possible without special documentation requirements.

Original languageEnglish
Title of host publicationProceedings of The International Conference on Evaluation and Assessment in Software Engineering (EASE 2020), Trondheim, Norway, 15-17 April 2020)
PublisherACM Press
Number of pages10
ISBN (Electronic)9781450377317
Publication statusPublished - 20-Apr-2020
Externally publishedYes
EASE 2020: International Conference on Evaluation and Assessment in Software Engineering
- Trondheim, Norway
Duration: 15-Apr-202017-Apr-2020


EASE 2020


  • application domains
  • expert judgement
  • latent dirichlet allocation
  • source code

Cite this