The project

Dialectology is concerned with the study of language variation across space. Current dialectological datasets typically consist of
interviews with informants. These interviews cannot easily be compared with each other as they differ considerably in length and
content. If informant A does not use word x, this does not necessarily mean that the word does not exist in A’s dialect. It may just be
that A chose to talk about topics that did not require the use of word x. This project aims to introduce comparability in dialect corpora.
In particular, we will use machine translation techniques to normalize the dialect texts, i.e., to transform them to standardized spelling.
However, we are not only interested in the result of this normalization process, but also in the transformation operations that the
normalization model learns. These model parameters will allow us to provide new visualisations of dialect landscapes and to confirm or
challenge traditional dialect classifications.

The CorCoDial project is funded by the Academy of Finland during the period 2021-2025.

The full research plan is available here.

The UH research portal provides a summary of the project activities.

People

Principal investigator: Yves Scherrer
Postdoctoral researchers: Aleksandra Miletić, Olli Kuparinen
PhD student: Janine Siewert
Affiliated and visiting students: Noëmi Aepli (University of Zurich), Dana Roemling (University of Birmingham), Erofili Psaltaki (University of Crete)
External collaborators and advisors: Jack Grieve, Nikola Ljubešić, Tanja Samardžić, Benedikt Szmrecsanyi

News

May 2024: After two years of waiting and confusion, our JLG paper is finally published! Corpus-based dialectometry with topic models
May 2024: Three project-related papers accepted at Coling-LREC and one at the HumEval workshop. Aleksandra and Janine participate on-site, Olli participates remotely.
- Murre24: Dialect Identification of Finnish Internet Forum Messages
- Loflòc: A Morphological Lexicon for Occitan using Universal Dependencies
- The Low Saxon LSDC Dataset at Universal Dependencies
- A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian, HumEval
Apr 2024: Aleksandra’s postdoc employment has ended, but she remains affiliated with he project.
Mar-May 2024: CorCoDial hosts two visiting students: Erofili Psaltaki (University of Crete) and Dana Roemling (University of Birmingham)
Jan 2024: Janine is now funded by the CorCoDial project.
Dec 2023: Three papers accepted at EMNLP 2023 and colocated workshops:
- Changing usage of Low Saxon auxiliary and modal verbs, LChange workshop
- The Helsinki-NLP Submissions at NADI 2023 Shared Task: Walking the Baseline, ArabicNLP conference
Sep 2023: Olli starts as a postdoctoral researcher in Finnish language at Tampere University.
Aug 2023: Yves starts a new position as an Associate Professor at the University of Oslo. He remains affiliated with the University of Helsinki and continues leading the CorCoDial project.
July 2023: Yves presents Character alignment methods for dialect-to-standard normalization at SIGMORPHON 2023.
May-July 2023: Janine is on another research visit at the University of Groningen.
May 2023: Olli presents Murreviikko-korpus: murteen mukaan annotoitu ja yleiskielistetty Twitter-aineisto at the Finnish Conference of Linguistics.
May 2023: Three papers accepted at VarDial 2023! Janine, Olli and Yves participate on-site at the workshop.
Nov-Dec 2022: Janine is on a research visit at the University of Groningen.
16 Oct 2022: Aleksandra, Noëmi and Yves participate at the VarDial workshop co-located with COLING.
5-6 Oct 2022: Aleksandra and Yves participate at the Estonian Digital Humanities conference, presenting joint work with Olli and Janine.
Sept-Oct 2022: Together with the GramAdapt project, we host HSSH Visiting Professor T. Mark Ellison from the University of Cologne.
Sept-Dec 2022: Olli is on a research visit in the QLVL group at KU Leuven.
1-5 Aug 2022: Janine, Olli and Yves participate at Methods in Dialectology.
26 May 2022: Janine presents Low Saxon dialect distances at the orthographic and syntactic level at the ACL LChange workshop.
12 May 2022: Olli presents Vaihtoehto Kettuselle? Murrekorpusten laskennallinen mallinnus at the Finnish Conference of Linguistics.
3 May 2022: Aleksandra Miletić has joined our team as a second post-doc.
29 Apr 2022: Thanks for joining our workshop and making it such a success! The final program and presentation slides are available on the workshop page.
11 Feb 2022: We will organize a workshop on Corpus-based and computational approaches to linguistic variation on 27-28 April. Further information on this page.
9 Dec 2021: Olli presents A topic modeling approach to dialect analysis and visualization in the language technology research seminar (Slides) (Video)

Yves Scherrer

Associate professor in Natural Language Processing (University of Oslo) — Docent in language technology (University of Helsinki)

CorCoDial – Corpus-based computational dialectology

The project

People

News