National Library of Finland has been one partner in the Fin-Clariah project of 2022-2023 . Last Friday there was a joint workshop across all working packages, in order to share news of what is going on and how things have been progessed, when the project is nearing its end.
The target of the project was to create an infrastructure, some data, tools and documentation for researchers utilizing Finnish cultural heritage from various sources. Idea was to investigate the flow of data to high-capability computing environment, for which CSC (IT Centre for science) was crucial core part. However, in order to keep the voice of researchers near, there was also a few research teams within the project, doing actual research now or in future, with the data or within the project itself. In the workshop we even talked about researcher inception – researchers researcing research process itself :).
The keynotes were super interesting and related to multiple participants of the session. Extracting image data and using the image contents together with textual contents is a topic that needs to be read bit more. Then the TurkuNLP group presentation of the state of AI and basics of large language models (LLMs) is super important at the current moment. We all are waiting for Poro model to be ready (but even current 50% done can be useful). This enables new ways to utilize AI even within context of a small language (where there are not even so many family languages that could be used.). About FinGPT (Arxiv). Unfortunately old digitized materials with OCR of varying of quality has not been able to be integrated to the models, however news web sites, newer ebooks and periodicals have been able to be integrated.
It was also interesting to hear about the status and learnings of each work package – there will be more of these at the December where the demos will be shown via videos. Some code is already open with instructions how to use the dataset from National Library of Finland:
and some additional information about the dataset at the data catalog