Background

Background

The field of genre and text analysis has been shaped by influential figures who contributed through non-computational theoretical approaches. Eagleton (2017) delves into the fundamental question of “What Is Literature?” providing theoretical insights into the nature of literary works in general. Furthermore, the book Theory and Practice in the Eighteenth Century explores how genres were perceived and used during the Enlightenment (Sadow, 2008). Besides Sadow, D. Duff’s “Romanticism and the Uses of Genre” examines the role and evolution of literary genres in the 18th century in England (Duff, 2009). Moretti’s work, “Distant Reading” (2013), introduces the concept of distant analysis, emphasizing the importance of macroscopic perspectives in understanding literary trends and patterns. Moretti’s distant reading approach is an alternative to traditional close reading methods, encouraging a broader examination of literary corpora to discern larger patterns and structures. This theoretical shift laid the groundwork for embracing computational methods in literary analysis. Later, computational methods have been deployed to genre analysis. Underwood’s (2016) work on genre theory and historicism, in a computational way, exemplifies the connection of theoretical frameworks with computational techniques. Similarly, Zhang et al.’s (2022) work on sequential genre change in 18th-century texts was another example. One of the computational ways seeked in genre classification was facilitating Bidirectional Encoder Representations from Transformers (BERT) (Devlin J. et al. 2018). In their paper, Kenton and Toutanova (2019) used BERT for language understanding by highlighting the model’s effectiveness in handling diverse language datasets. Furthermore, Akalp et al. (2021) demonstrated how BERT can be used for music domain specific texts. Moreover, some scholars investigated the potential impact of Optical Character Recognition (OCR) quality when dealing with genre classification via BERT such as for domain classification of book excerpts (Jiang et al., 2021) or Eighteenth Century books (Hill & Hengchen, 2019). Such crucial analyses signified the importance of addressing data quality issues in the application of advanced language models which we faced in this project as well.