The ‘age of abundance’ of historical newspaper material poses new challenges to historical research. Historical approaches to selecting and analyzing newspapers, rooted in the assumption of a scarcity of available material, had to be replaced with social scientific methods (Nicholson 2013; Broersma 2009). Yet, these manual quantitative methods are still highly time consuming and can only cover a small part of the available material (Harbers 2014). Automated analysis could potentially alleviate this issue. However, although they have a great appeal to researchers (Allen, Waldstein & Zhu 2008; Grimmer & Stewart 2013), such research is mostly done in information science and linguistics. It seldom has a press historical perspective (Broersma 2009; Arbesman 2013). Moreover, the emphasis has mostly been on topical modeling (Lee & Myaeng 2002), whereas attention for automatic classification of style and genre is scarce. More attention is beneficial for a range of research fields, as it gains insight in the mode of expression of the newspapers and sheds light on the discursive context (Handford 2010).
The DJS-project aims to 1) connect existing metadata from a large-scale manual content analysis of three Dutch newspapers, to the corresponding digitized articles in Delpher to subsequently 2) explore the possibilities of automating the analysis of the historical development (1880s-1930s) of journalistic style through (supervised/validated) machine learning, focusing on the classification of genre as an indicator of style (Grimmer & Stewart 2013; Ikonomakis, Kotsiantis & Tampakas 2005). It follows up on the NWO funded research project into the historical development of journalistic styles (1880-2005) (VIDI project Broersma, 2008-2013). Furthermore, DJS also functions as a pilot for a larger research proposal into automatic classification of newspapers styles, which I intend to write with Prof. Broersma.
The VIDI research entails a manual quantitative content analysis of 9 newspapers in 3 countries (NL, GB, FR). This has resulted in a database with coded metadata about 105000 articles (ca. 33000 Dutch articles) in 6 sample years (2 constructed weeks for each of the 9 dailies in 1885, 1905, 1925, 1965, 1985, 2005). The articles were coded for a range of manifest and latent variables (size, sourcing, topic, genre, author, images) to map the nuances of a general shift from a reflective reporting style to an event-centered style (Harbers 2014).