Autumn 2023: Section-Level Genre Analysis

Division of Labour and Reflection on Learning

Antwan: I mainly focused on developing the classifier, and introducing myself to the existing code based. I started by reading the paper on ECCO-BERT classifier. I then attempted to replicate the functionality with our dataset. This was the most time consuming part of my project: creating a working classifier. There were a lot of bugs and nuances that kept popping up. The reference code base was not that readable, and frankly quite confusing. The mapping of the numerical values to their corresponding genres was also convoluted. However, once everything started rolling and the training and prediction started working, modifications to the code became easy. Matti and Enes were also very helpful with the coding process. Matti contributed by assigning weights to the genres, and Enes noted bugs in the code.

In addition, I was also contemplating on useful plots to depict. This was a rewarding project all together. I have no background in digital humanities, and seeing the way data science can contribute to science and how it’s applied was interesting. All in all, I enjoyed the project, despite not being physically present in a single meeting.

Enes: Like my fellow project members, I also contributed to the labeling part. To reduce noise in the data, I wrote a code and filtered some noise in the data. After the model predicted labels for sections, I formalized research questions for the project. Then, I searched for suitable ways of visualizing the results for interpreting them in collaboration with Matti. With his great help, we prepared some plots. In general, my role is like a connector between the project’s humanities and computational side. I have focused on how to integrate the contributions of data scientists in the project, taking into consideration the specific requirements of the humanities. Moreover, I worked on the analysis part and edited the text after all project members wrote their own part. I encountered some challenges during the project. Firstly, one of the challenges for me in the project was to find interesting analysis topics. Another challenge for me is to handle the results in a humanities perspective. There were lots of possible analysis topics to go on and it took me for a while to put the results into a frame. The project gave me an insight on how a digital humanities project works in practice. It was a good experience for me to see how the project was developing over time. Moreover, I got a little but highly valuable experience on dealing with noisy data in digital humanities projects. I would like to thank all of my friends in the project. It was a pleasure for me to work with them.

Matti: My primary role was to help with various issues at hand. I created scripts for tasks like exploring, processing, and visualizing data. For example, I made scripts to find interesting books with different genres or authors, which all of us could use for our research. I also made some adjustments to improve our classifier’s ability to predict genres. In addition to this, I wrote the introduction and conclusion chapter. I’d recommend this course to students from any background. The course gave me practical experience in addressing real-world problems by collaborating with people from diverse academic backgrounds. It allowed me to get familiar with the entire process of using a large language model.

Ville: My role in this project was twofold. I had the unofficial role of the project leader, which mainly consisted of steering the discussion during our group meetings, coordinating the group members assignments for each week and keeping the project on schedule to be finished on time. The second and bigger part was that I was responsible for the data. Once the labeling process was finished I was responsible for getting the training data from our label studio project to the classifier, exploratory data-analysis, and getting the 3000 book library to run our classifier on.

The data wrangling for this project was quite often not the simplest, since there were multiple different data formats, data stored in multiple different places, and challenging data formats. This has given me valuable insight into how data can be used in building full-scale language classifiers and limitations that need to be made in these types of projects, because of data limitations. I also learned how to best work with people who have different areas of expertise from me and how to best utilize these specialities to finish a project on time while not compromising on quality.

Tatiana: In this project I focused on stylometry possibilities usage and endeavored to identify patterns in genre categorization by measuring punctuation frequency. Additionally, my contributions involved data labeling with my colleagues and the documentation of this process through the study. I actively engaged in researching the general questions that guided our project and tried to figure out the great capacity of ECCO-BERT classifier potential.

This experience has not only broadened my knowledge but also allowed me to work in a team with a diverse academic background and make collaborative efforts within a digital humanities context.

I am grateful for the enriching experience gained during this project, and I extend my heartfelt thanks to our group, my colleagues, Antwan, Enes, Matti, Ville and our curators Lidia and Mikko. This study has been instrumental in advancing my skills and providing practical insights into the opportunities of digital humanities research. I look forward to applying the skills and insights gained from this project in my future endeavors and continuing my research on text rhythmic footprint.

< Previous section: Conclusion | Next section: Reference >

University of Helsinki

Division of Labour and Reflection on Learning

Autumn 2023: Section-Level Genre Analysis

Table of Contents

Division of Labour and Reflection on Learning