Our trip to India part III. – data collection

On the 10th of December, both teams traveled from Guwahati to Kohima, Nagaland. During the travels, more practical preparations for data collection were discussed, including meeting with a local guide experienced with fieldwork and recording speech in the area. On the 11th of December, the UH team went through the mandatory check-ins at the Nagaland police regarding entering the area. Social media guidelines and the promotion of our project on the Internet were also carefully discussed. It is always important for us that obtain consent from our informants not only in regard to elicitation but also when taking photos of them. The teams also had in-depth discussions on the reality of fieldwork, time management, planning, and on expectations of what can be done in a limited time period. It was established that this project, like others, wouldn’t be without its challenges.

The final preparations for data collection were made, and a small-scale pilot experiment was also conducted. The stimuli were prepared to be presented using a slideshow on a laptop rather than, for instance, using paper and printed pictures. The stimuli were also backed up to multiple computers. Settings for the recording equipment were investigated and calibrated to suitable levels. Possible interfering factors regarding the speech signal and how to avoid them, such as background noise, interfering noise from technology and cellular signals, as well as room echo, were discussed and planned for. 

A pilot experiment was not conducted for the sentence reading task since it has been conducted in the past multiple times. The picture naming task was piloted with 10 words, where our local guide and experienced fieldworker, and informant was the participant. He suggested the presentation of the carrier sentences in English alongside the pictures, which means that the task was a translation task from the English sentence to the speaker’s dialect of Angami, and this modification was made.

It was also decided that the participants should take part in the picture naming task, in which they use their own dialect, first and then the sentence reading task, in which they use standard Angami to make sure the reading of standard Angami doesn’t influence the tones of the speaker in the dialect task. The possible transferring of tones between tasks, and also between carrier sentence and target word, was also discussed, as well as some possible tonal realizations that might be speaker-specific rather than dialect-specific, and also how grammatical markers might influence tonal realizations.

On the 12th of December, one part of the team, including Juraj Šimko,  Priyankoo Sarmah,  Joona Rajala, and Monika Murgová, headed to Nagaland University to give lectures on the current speech research conducted both at the UH and at IITG.

The 12th  was also the first day of data collection, and it was conducted in the village of Jotsoma, located some 9 kilometers from Kohima westward. The fieldwork team included Anna Busheva, Ida-Lotta Myllylä, Daniel Pietschke, Kyan Wayman, and local Ph.D. candidates, Viya and Zhoeni. The day started with meeting and getting to know the locals and visiting different parts of the village. The data collection was conducted in two sessions. During the first session, five speakers took part in both speech tasks, first completing the dialect picture naming task and then the sentence reading task. All speakers were from the same khel. The second session included one speaker, meaning that in total, six speakers from Jotsoma village were recorded. 

The recording equipment was, as described above, a Tascam and a headset microphone adjusted at 1 inch from the speaker’s mouth. The stimuli were presented using a laptop placed in front of the speaker at a sufficient distance. For each speaker, the order of the stimuli presented was randomized. The recordings were conducted outside to minimize room echo, and cellular interference was disabled by turning off nearby mobile phones. 

Each participant was carefully informed about the background of the project and the purpose of data collection, and the nature of the task before starting the recording. The instructions included the notion that the participant could pause or stop the experiment at any time without punishment. The recordings did not commence without oral consent from the participant. For the sentence reading task, participants could read the sentences to themselves to get acquainted with them and to avoid excessive hesitations and reading intonation during recording. Of course, the speaker could at any time repeat a sentence or word if they wished so or skip entirely words they didn’t know. Each recording included the name, age, and home village of the speaker. No written materials were collected. Each recording lasted approximately 3-7 minutes, depending on the speaker and the possible amount of repetitions and corrections. During the picture naming task, most of the recordings include discussions between the participant and researcher about possible synonyms for the word the participant said.  

After both sessions, stimuli randomizations were prepared for the following day, as well as backing up the data. 

On the 13th of December, the rest of the data was collected in Kigwema village, located some 16 kilometers from Kohima southward. The fieldwork team consisted of Ida-Lotta Myllylä, Anna Busheva, Joona Rajala, and Kyan Wayman. As in Jotsoma, the team went around the village to get to know the local people and the surroundings. In total, four speakers were recorded during this day, two in session one and two in the remaining session. The recordings were conducted with the same process described above.

Yesterday was primarily dedicated to getting to know the speakers of Angami better. Our team went around Kigwema village, talked to the local people, and observed the interviews with the speakers, who shared with us their knowledge about their dialect and language and gladly talked to us about their families, customs, and life in Nagaland. We also have gifted local people a few books about Finland and Helsinki to their community library so children and adults can learn about our country.

When the interviews were over, we visited the village student union director and talked about Naga cultural traditions, community spirit, and the importance of language preservation.

Our last day was spent on a Safari! With our Indian friends, we experienced Kaziranga national park near Golaghat in the state of Assam. This national park hosts two-thirds of the world’s great one-horned rhinoceroses. Kaziranga is a UNESCO World Heritage Site where we saw not only beautiful rhinoceroses but also elephants (even the baby ones! 🐘), swamp deers, and many different species of birds. Kaziranga has achieved notable success in bird and mammal conservation within India.

And now back to Finland to look at the data 🔎


Our trip to India part II. – first days

Our trip to India took place from the 5th to the 17th of December, 2022. The first two days consisted of planning and preparatory discussions between the leaders of both team UH and team IITG, Juraj Šimko and Priyankoo Sarmah, respectively. It was during this time that Juraj went travelled to India ahead of the group, as there was a need to discuss logistical preparations and do administrating work. Part of this also included ensuring both teams’ successful travels to Nagaland and back.

On December 7th, the rest of the UH team arrived in Guwahati. The following two days were spent continuing the preparations mentioned above. Data collection, responsibilities, and the essential aspects of data management were also discussed. Data pre-processing was planned with regards to what type of metadata would be required for the machine learning (ML) systems intended for data analysis. Also, parts of the code for ML-based data analysis were introduced. On the 8th, both teams toured the IITG campus, including the Phonetics and Phonology laboratory. We also successfully completed and applied for essential permits regarding the visit to Nagaland.

On December 9th, a workshop led by Juraj Šimko about the capabilities of machine learning for data and speech analysis using deep learning methods was held at IITG. The central questions included the necessary data pre-processing and speech data types required for ML, such that the input resulted in a desired output. The focus was on analyzing speech prosody using deep learning methods. Some of the other questions discussed were: what kind of parameters do the ML systems extract from a speech signal, how does one input data into the systems in general, what is the significance of file naming and embeddings, and what is the importance of well-controlled data collection and compatible datasets.

Attention was turned to the practicalities of data collection during the latter portion of the workshop. The division of tasks and responsibilities was planned, as well as how fieldwork would be conducted. It was decided that the phoneticians with some experience, Anna Busheva and Ida-Lotta Myllylä, would take the lead. If the team needed to be split, they would each lead a group. Data management was discussed. Most importantly, the realities of fieldwork and realistic expectations about how much data could be collected were talked about. Estimates of 8 speakers were presented. The type of stimuli that would be used was debated, and ultimately, two sets of stimuli were decided upon.

The first task planned was a naming task, in which participants were asked to name an object in a picture (e.g., a cat, house, snake, or shoe) using their own dialect of Angami and then repeat the word within two carrier sentences. It was designed such that all tones of Angami would be represented with a diverse vowel and tone distribution. An example of the task:

I said ___
He / She sees one ___
(Angami doesn’t specify gender)

This task included 26 words in two carrier sentences, meaning the experiment consisted of 52 instances of picture naming. This task was created to collect data for Busheva’s MA thesis.

The other task was a repetition of a previous experiment designed to collect the standard Angami corpus, consisting of the participants reading 25 sentences written in standard Angami. Both tasks were based on the expertise of native speakers – our fellow colleagues at IITG – who deemed all words usable.

Lastly, at the end of the workshop, recording equipment was prepared and checked. Instructions were given by one of our colleagues at IITG to ensure that the entire team could use the equipment correctly. We used two Tascam audio recording devices and two headset microphones.

Yesterday we experienced India for the first time. We visited the museum of the river Brahmaputra, saw the campus of the Indian Institute of Technology in Guwahati, and met our Indian colleagues for the first time offline. The campus itself looked like a small town. Around 12 000 students study and live there, including the teaching staff. We ate a lot of Indian food, especially typical Assamese specialties, such as Assamese rice “Thaali” and the alkaline puree called “Khaar.”

Our trip to India part I. – preparations

Hello, world!
We are a group of students from the Linguistic Diversity and Digital Humanities program at the University of Helsinki (UH), attending the Experimental Laboratory course in Phonetics. On this account, we are documenting our progress for the project on Digital Language Typology. The project consists of collecting and analyzing speech data on the Angami language spoken in North-East India, and it is organized between UH and our colleagues at the Indian Institute of Technology in Guwahati (IITG).

After weeks of preparations and weekly meetings with our colleagues at IITG, we packed our bags and began our journey to India. We wrote everything down on Instagram and in our project diary, and now we will merge the two in the following posts.

From September to December, 2022, a regular weekly schedule for remote meetings was established, and both teams (UH and IITG) attended the meetings. One part of the meetings was used to plan and set up the excursion for fieldwork and data collection in December, 2022. This excursion took place from the 5-17th of December, 2022, during which the UH team traveled to Guwahati, Assam, India, and then both teams went together to Nagaland, India, specifically Kohima and its surrounding areas. During these meetings, we discussed the many different practical preparations related to traveling, such as visa applications, vaccinations, flights and accommodations. During the meetings, we created an Instagram page, which has been actively updated since.

Other aspects of the meetings were preparing for and planning data collection methods. We debated questions about how much data is realistic to collect, where the fieldwork was to happen, who the participants were going to be, and what kind of stimuli we would use to build an experiment. Most importantly, we discussed the possible research questions we would explore. We realised there were many possible questions we could attempt to answer, but our focus quickly shifted to Angami tones and their realizations among different dialects.

Other than this, the meetings included introductions to the Angami language and the people, as well as introductions to machine learning and data processing. We had decided early on that the data collected should be compatible with an existing corpus of standard Angami collected by the IITG team. We also investigated some samples from this corpus for the purpose of finding the right tools for annotation and automatic segmentation.

It was also established during these meetings that one of the UH team members, Anna Busheva, would conduct her MA thesis research in parallel with this project using some of the data collected during the excursion.

Priyankoo Sarmah

I am Priyankoo, a Professor of Linguistics at the Indian Institute of Technology Guwahati (IIT-G), India. I studied linguistics and earned my M.A. and M.Phil. from the Central Institute of English and Foreign Languages, Hyderabad, India. I completed my Ph.D. from the University of Florida, USA, on tones in Rabha and Dimasa, two Tibet-Burman tone languages spoken in North East India. Before joining IIT-G, I worked at the Hankuk University of Foreign Studies, Seoul, South Korea, teaching undergrad and grad students phonetics and phonology.

After joining IIT-G in 2011, I have been working on the phonetics of tones and vowels of several North-East Indian languages. I have been interested in the acoustic properties of tones, perception of tones, and statistical modeling of tones. Over the years, I have also ventured into the domain of speech technology development for resource-poor languages and have worked on speech recognition, automatic language, and dialect identification. An in-depth understanding of the acoustic characteristics of speech enables one to build better language technologies for resource-poor languages. At the same time, I believe that using machine learning techniques in speech helps us uncover new speech features in these languages.

I am glad that the Digital Language Typology project was born out of the passion for speech studies that I shared with Juraj and Daniil. It is also heartening to see several students from Helsinki and Guwahati working together on this project! Let’s keep moving ahead!


Juraj Šimko

My name is Juraj, and I am a University Lecturer in Phonetics at the University of Helsinki. Originally from Slovakia, I studied Mathematics in my home country, worked in the IT industry, and eventually completed a Ph.D. in Cognitive Science at the University College Dublin.

I am passionate about speech, both in daily life (well, I do speak quite a lot) and as a research subject. In my scientific career, I have explored many aspects of speech communication: minute details of articulation, phonological quantity in several Finnic languages, speaking in a noisy environment, speech prosody, speech technology, and some more. One of my primary interests in recent years has been figuring out how speech technology can be used as a tool for looking at theoretical investigation in language typology, in particular for less researched, endangered languages.

This project was born from many conversations over the years among three phoneticians, good friends with shared passions and complementary skills. Let’s keep following its progress!


Joona Rajala

My name is Joona, and I’m a first-year student in Phonetics. Since my main interest lies in Speech Technology, I’m also taking courses in the Language Technology track of our master’s program. I have previously studied French and English linguistics as well as Pedagogy and have worked as a language teacher for the past few years. In our project, I’m responsible for applying machine learning to our data collected in India. In my free time, I enjoy going out to restaurants, jogging, and traveling.

Kyan Wayman

My name is Kyan. I have a minor degree in Global Studies and Politics, and I received my BA in Linguistics with a minor in Asian Languages and Cultures from Rutgers University in the USA. I did part of my degree during an exchange at the International Christian College in Tokyo.
At the University of Helsinki, I am working towards an MA in General Linguistics within the Linguistic Diversity and Digital Humanities program. I have yet to decide on a research topic. Still, my interests include indigenous language rights, minority & endangered languages, language death, and the intersection of gender and LGBTQ+ studies with linguistics.
Within this research project, I am participating in fieldwork and creating interview materials.


Anna Busheva

My name is Anna, I am a second-year MA student at the University of Helsinki. I am studying phonetics in the Linguistic Diversity and Digital Humanities program, and this project is the basis for my thesis, which is very exciting! I have always been interested in field research and experimental phonetics, so in this project, I am responsible for phonetic background research of Angami and observing fieldwork. I love traveling to unusual places and talking to people, which aligns perfectly with my passion for dialectal phonetics, so this project is an excellent opportunity for me.


Monika Murgová

My name is Monika, and I come from Slovakia. I gained my BA in Dutch language and literature and Baltic studies with a specialization in Finnish language and literature at Masaryk University in the Czech Republic. I am now studying General Linguistics in the Linguistic Diversity and Digital Humanities program at the University of Helsinki. My research concentrates on language attitudes, language death, and the revitalization of minority and endangered languages. I like to paint, read and do yoga in my free time. In this project, I’m a social media person and a photographer, so that’s the reason I’m usually not in the photos.


Daniel Pietschke

Hello there! My name is Daniel. I studied Cognitive Science in Germany, focusing on computational linguistics, machine learning, and artificial intelligence.

At the University of Helsinki, I am studying in the Linguistic Diversity and Digital Humanities program as part of the Language Technology track, where I hope to deepen my knowledge and understanding of computational linguistics applications that can be created and optimized.

After my studies, I wish to work with machine translation and develop technologies that help people understand each other better in order to overcome obstacles on the way to creating a more globally connected community. My part of the research project is to work with machine learning models and algorithms and to deliver one-liners whenever someone needs a laugh.