Division of Labor

Table of Contents

  1. Introduction and Background
  2. Data
  3. Methods and Key Measures
  4. Results
  5. Discussion
  6. Limitations
  7. Implications and Conclusions
  8. Division of Labor
  9. References

Division of Labour and Learning Reflection

Ilia

(1) Explained NORPPA in the blog post. (2) Annotated Tonson dataset to the format used by NORPPA. (3) Adapted NORPPA (originally content-based image recognition system) to clustering as it contains only unsupervised steps. (4) Implemented embedding generation with NORPPA in Puhti for headpieces and initials from the Tonson dataset, although, unfortunately, did not have time to test the embeddings due to the problems with Puhti. (5) Compiled a guide on how to use NORPPA in the project for future teams to catch up. (6) compiled a guide on best Puhti practices, which should make it easier for future groups to start working on the project smoothly, without the troubles we had to go through.

Moving to the reflection on learning, I believe the most notorious thing I have learned during the project was connected more to the DevOps side rather than to math, statistics, or the usage of data science algorithms. Right from the start I had trouble with installing packages for NORPPA to work correctly as it required not only specific packages (for example to create fisher vectors) but also a specific Python version to install those packages. For instance, one Sunday we met with Victor to work on the project the whole day, which I had to spend fighting with Puhti and figuring out how to set up a working environment. I had to go through the documentation a lot and infer the correct way to work in Puhti, which I compiled in a guide for future groups, which hopefully will help them start working smoothly. I believe the experience of installing packages will also help me in my future professional life as I know first-hand from my job that there are always some problems with installing packages.

Another thing I did during the project was apply knowledge gained in the statistics and ML courses. The entrance bar for image clustering is high and requires a substantial amount of background knowledge. Luckily, I have worked on NORPPA, which was originally designed to do content-based image recognition. However, after studying at the University of Helsinki I have developed skills in advanced statistics and machine learning, which helped me to identify that the same approach could have been used for image clustering. I have also figured out how the advanced math behind the approach works. For instance, how Fisher Vectors are used to combine local features of patches into global ones of images. I have also identified the advantages and shortcomings of such an approach applied to the problem at hand – printmark classification.

Citlali

My three main responsibilities in this project were the preparation of data, visualizations of the metadata, and preparation of the blog. I worked on the database connection and exploration, and during this phase helped others in the team to also access the data through the connection (credentials, running the script, sql queries, etc). I then explored the various tables available to use to identify the potential options for research from a humanities perspective, working with Annika to provide datasets that she deemed interesting given her expertise. Given an english-request, I would transform that into a SQL query, fetch the data, and export it as a CSV so that she could use it. I also wrote scripts to identify the ecco IDs for all the images in the collection and them paired them with the various databases I identified in order to develop the ~2,000 image:metadata pairings that we could do analysis on. In the next step, I took the metadata for these selected images and worked on fully developing out the Data section of the report, as it is not only important from the humanities perspective (to understand that most documents are in English, nearly entirely from London, etc), but also to contextualize how our audience receives the clusters and the cluster images, like understanding that these images did indeed come from before year 1800 and that several (but very few) come from India and the United States. Creating the interactive geographical visualization was a very exciting venture as I got to use Tableau, and I’m very pleased with the outcome — of not use visualizing the given data, but manipulating it so that users can see useful insights, like the dates represented by each city and the number of documents total coming from that city. These descriptions help set the stage, of course, for the clusters and their analysis. My initial goal was to do a visualization of the resulting clusters, but since results would come very late in the term (nearly just 1-2 weeks before compiling the blog), I decided I needed to switch paths, which is why I pivoted towards focusing on visualizing the metadata, and especially putting time into crafting the interactive visualization. Additionally, I decided to take responsibility for the blog, early on requesting access for the full team and preparing a short demo of how to make changes on wordpress during a team meeting. I also created an initial outline for the blog on google docs with suggestions on how to craft the story throughout all of our individual parts. Finally, I was responsible for re-formatting the blog (e.g. numbering of figures / creation of galleries) and porting it into wordpress. Throughout the full course, a lot of additional work went in that doesn’t fall into these categories, such as being present in the online team meetings and making suggestions/ offering help. As I had not ever done any computer vision work before, I did my best to keep up with the progress updates made by Olavi and Victor, looking over their results and requesting one-on-ones with them and Annika in order to ensure that I could use their insights in pulling the different parts of the blog together so that there would be continuity. I think that our team did a good job working together despite some mishaps. The biggest learning for me was learning how to use a super computer. At the beginning I was very intimidated by CSC but by the end I was very comfortable spinning up instances, creating notebooks (libraries like pyplot, sklearn, seaborne, numpy, pandas), pulling and manipulating data, and shutting things down. The project also helped me solidify the idea that initial ideas don’t always work out, so you have to keep your eyes open and pivot quickly so that you continue contributing to your team effort. I also worked about working in an interdisciplinary group, and that having your own responsibility (and making sure you pull your weight of course) can be a significant plus compared to having a group of people that are all doing/are all interested in the same thing. During the course I also learned a lot about working with massive datasets, and what comes with that (a requirement for patience, CSC instances crashing, computer overheating, trial and error). I am also really happy to have become familiar with the computer vision workflow and to understand the motivation for contextualizing results (from Annika) and implementing a research question prior to the technical work. Finally, I would say that in doing this over, everyone should have a plan B and plan C individually, as there were several meetings where a pivot was required and we were left with our hands in the air, and I also think that this is a good practice I will adopt in a project like this.

Victor

(1) Evaluating the previous approach to embedding generation done by Ruilin, reducing and documenting down to the necessary components (2) Exploration of other embedding generation and clustering techniques. Main techniques looked upon: SPICE (deep clustering algorithm), Neural Embedding Generation (3) Embedding generation, exploring generation with pre-trained neural networks and consequent dimensionality reduction with PCA. Tested neural networks: ResNet50 pre-trained on ImageNet (from PyTorch), YOLOv8 trained by Olavi. (4) Evaluating different clustering techniques on the data, along with hyperparameter tuning for the final model (OPTICS). Tested algorithms: k-means, DBSCAN, OPTICS. (5) Creating visualizations for the PCA dimensionality evaluation and the obtained clusters.

Learning reflections: Being fairly honest, I didn’t expect the challenges I faced when I started the project. I don’t mean this in a bad way, on the contrary, learning comes with challenges, and I feel like I had my fair share of learning through my work here. Although I have previous experience with Computer Vision, this is my first venture in using Unsupervised Learning (UL) with images, and it proved to be quite challenging.

I would summarize my main challenges in three sections: data preparation (and more generally documentation), Unsupervised Learning (and more generally, learning), and general resource organization. For the first point, we started with no other documentation than asking Ruilin about where was what, or just diving deep into many lines of code written before us. This makes ramp-up quite time consuming, and it can also hurt team morale as we faced many questions about what means what, what do we have to do, so on and so forth. We were able to overcome these challenges eventually, I would say in part thanks to weekly meetings (which did give me motivation to at least push something out each week), and some may call it sisu. Communication was a crucial part of this project, and although we had our ups and downs, it helped at least me to understand what to do. Particularly, when we would join our weekly meetings, everyone could speak about what has confused them, and many times another person could come in and clarify what was going on, and this was a very valuable component of our group work.

For the second point, unsupervised learning hasn’t been the focus of my attention in my professional development, and I never thought it could be this complicated to understand. I struggled to even understand what OPTICS was doing, and to top this off we also had to do a lot of multidisciplinary learning, as we have to put clustering in historical context. Again, communication was an important part on this step, as talking with Annika and Citlali helped me understand what was going on with the data, and eventually it would help me land to how to tune our final approach. I have learned a lot about Unsupervised Learning, and I appreciate the help I have received from others to get me there.

Finally, and the bane of any person who has to deal with technology, Puhti can sometimes be hard to work with: mainly, if you want access to a GPU, sometimes you will have to queue for hours before you can start the session. Once I got in, sometimes I would get weird problems with packages, so on and so forth. However, I could attribute most of my problems with simply not having enough experience in an HPC environment, and I have learned a lot in this regard as well.

All in all, this project has been a great learning experience (and hey, now I can tell my friends about 18th century books and their characteristics). Although I wish we could have got better results, sometimes we have to make do with what we get, and I wish that the future groups are able to take something out of this project as I did.

Olavi

(1) Tested the access to the ECCO metadata with Annika. (2) Created page-level Dataset 1 mapping to the ECCO metadata as a single CSV file. (3) All YOLO object detection tasks: i) Studied YOLO models ii) Installed YOLOv8 iii) Transformed the old annotation format of Dataset 1 to YOLO annotation format iv) Created training and validation sets with data-augmentation to increase the quantity of some classes. v) Trained and validated YOLOv8s  model for 40 epochs to detect 16 categories using pre-trained weights of COCO128. (4)  With the help of the annotations, extracted printmarks from Dataset 1 as separate images for the clustering task. (5) Generated the actual embeddings of Dataset 1 printmarks for clustering based on Victor’s code. (6) Extracted “the backbone” of trained YOLOv8s model for Victor to generate embeds and clustering (didn’t work that well)  (7) Attempted to find a way to easily extract more data from the 1TB ECCO zip file but with no success. Left this task for Ruilin and future projects.

Learning reflections. First, I learned about the printing practices of old books and especially about different types of printmarks. For me, this was a completely new angle into the domain of old literature, which made me hesitant at first but in the end, it was a great experience. Secondly, it’s always interesting (and a bit stressful) to form a group and work with people you don’t know before. Typically, there is a challenge to establish practices that work for this group and get the flow going. Luckily our group turned out to be motivated and communication was easy. For the technical part, I on purpose dove into the deep end and selected implementing data preprocessing functionality and classification model with technology I haven’t used before (but hey, YOLO I guess :)). My backup plan was that if the original task doesn’t work out, I’ll focus on providing support to others on their tasks. In the end, I did both, which was nice since at times data wrangling and testing the machine learning model repeatedly is excruciating. I’m happy with our final output and that we overcome lots of challenges as a group.

Annika:

(1) Conducted the literature review. (2) Found a suitable humanities question. (3) Tested the access to the ECCO metadata with Olavi. (4) Search and collection of possible biases in the dataset. (5) Search for information about single printer expected to be of relevance (most of them were not in the meaningful clusters, thus, the work was for the most parts not used). (6) Enrichment of the meaningful clusters with relevant metadata from ESTC and ECCO. (7) Analysis of the results of the clustering with Dataset 1.

During the project course, I improved especially my communication skills with regard to exchanging ideas to students with a completely different background. After all, I would say, that being the only Humanities student imposed some problems on me, because I were not able to discuss my problems with anybody and struggled totally on my own with regard to Humanities questions. Probably, I should have discussed my problems more with the other students in the group, but I was often not able to explain my questions and struggles in a way that students without humanities background were able to understand them. Moreover, being not able to discuss my problems with somebody else gave me often a feeling of uncertainty which prevented me from further discussing aspects although they would have been needed to be discussed. I think, I grew in this regard quite a lot over the runtime of the project, but I still need to work on that further.

The incorporation into a totally new topic (printing in the 18th century) gave me many topic-related new insights about the spread of information, its necessaries and historical context.Given, that my background is non-historic at all, I worked myself in a new way of writing and conducting research.

Additionally, I learnt more about the use of clustering and detection algorithms and the fine-grained difference between models like style- and content-based clustering and their implications on the clustering results. This allows me now, to better assess which models should be used for which context.  The analysis of the clustering results was difficult to conduct because of the bad quality and the missing transparency of the clustering algorithm. I would have needed more time (not only one week) to think of other ways of analyzing the results, if the results would have included more analyzable results. Still, the analysis of image clusters seems still relatively unsystematic and difficult to me.

 

< Previous section: Implications and Conclusions | Next section: References >