Our trip to India part II. – first days

Our trip to India took place from the 5th to the 17th of December, 2022. The first two days consisted of planning and preparatory discussions between the leaders of both team UH and team IITG, Juraj Šimko and Priyankoo Sarmah, respectively. It was during this time that Juraj went travelled to India ahead of the group, as there was a need to discuss logistical preparations and do administrating work. Part of this also included ensuring both teams’ successful travels to Nagaland and back.

On December 7th, the rest of the UH team arrived in Guwahati. The following two days were spent continuing the preparations mentioned above. Data collection, responsibilities, and the essential aspects of data management were also discussed. Data pre-processing was planned with regards to what type of metadata would be required for the machine learning (ML) systems intended for data analysis. Also, parts of the code for ML-based data analysis were introduced. On the 8th, both teams toured the IITG campus, including the Phonetics and Phonology laboratory. We also successfully completed and applied for essential permits regarding the visit to Nagaland.

On December 9th, a workshop led by Juraj Šimko about the capabilities of machine learning for data and speech analysis using deep learning methods was held at IITG. The central questions included the necessary data pre-processing and speech data types required for ML, such that the input resulted in a desired output. The focus was on analyzing speech prosody using deep learning methods. Some of the other questions discussed were: what kind of parameters do the ML systems extract from a speech signal, how does one input data into the systems in general, what is the significance of file naming and embeddings, and what is the importance of well-controlled data collection and compatible datasets.

Attention was turned to the practicalities of data collection during the latter portion of the workshop. The division of tasks and responsibilities was planned, as well as how fieldwork would be conducted. It was decided that the phoneticians with some experience, Anna Busheva and Ida-Lotta Myllylä, would take the lead. If the team needed to be split, they would each lead a group. Data management was discussed. Most importantly, the realities of fieldwork and realistic expectations about how much data could be collected were talked about. Estimates of 8 speakers were presented. The type of stimuli that would be used was debated, and ultimately, two sets of stimuli were decided upon.

The first task planned was a naming task, in which participants were asked to name an object in a picture (e.g., a cat, house, snake, or shoe) using their own dialect of Angami and then repeat the word within two carrier sentences. It was designed such that all tones of Angami would be represented with a diverse vowel and tone distribution. An example of the task:

I said ___
He / She sees one ___
(Angami doesn’t specify gender)

This task included 26 words in two carrier sentences, meaning the experiment consisted of 52 instances of picture naming. This task was created to collect data for Busheva’s MA thesis.

The other task was a repetition of a previous experiment designed to collect the standard Angami corpus, consisting of the participants reading 25 sentences written in standard Angami. Both tasks were based on the expertise of native speakers – our fellow colleagues at IITG – who deemed all words usable.

Lastly, at the end of the workshop, recording equipment was prepared and checked. Instructions were given by one of our colleagues at IITG to ensure that the entire team could use the equipment correctly. We used two Tascam audio recording devices and two headset microphones.

Yesterday we experienced India for the first time. We visited the museum of the river Brahmaputra, saw the campus of the Indian Institute of Technology in Guwahati, and met our Indian colleagues for the first time offline. The campus itself looked like a small town. Around 12 000 students study and live there, including the teaching staff. We ate a lot of Indian food, especially typical Assamese specialties, such as Assamese rice “Thaali” and the alkaline puree called “Khaar.”

Our trip to India part I. – preparations

Hello, world!
We are a group of students from the Linguistic Diversity and Digital Humanities program at the University of Helsinki (UH), attending the Experimental Laboratory course in Phonetics. On this account, we are documenting our progress for the project on Digital Language Typology. The project consists of collecting and analyzing speech data on the Angami language spoken in North-East India, and it is organized between UH and our colleagues at the Indian Institute of Technology in Guwahati (IITG).

After weeks of preparations and weekly meetings with our colleagues at IITG, we packed our bags and began our journey to India. We wrote everything down on Instagram and in our project diary, and now we will merge the two in the following posts.

From September to December, 2022, a regular weekly schedule for remote meetings was established, and both teams (UH and IITG) attended the meetings. One part of the meetings was used to plan and set up the excursion for fieldwork and data collection in December, 2022. This excursion took place from the 5-17th of December, 2022, during which the UH team traveled to Guwahati, Assam, India, and then both teams went together to Nagaland, India, specifically Kohima and its surrounding areas. During these meetings, we discussed the many different practical preparations related to traveling, such as visa applications, vaccinations, flights and accommodations. During the meetings, we created an Instagram page, which has been actively updated since.

Other aspects of the meetings were preparing for and planning data collection methods. We debated questions about how much data is realistic to collect, where the fieldwork was to happen, who the participants were going to be, and what kind of stimuli we would use to build an experiment. Most importantly, we discussed the possible research questions we would explore. We realised there were many possible questions we could attempt to answer, but our focus quickly shifted to Angami tones and their realizations among different dialects.

Other than this, the meetings included introductions to the Angami language and the people, as well as introductions to machine learning and data processing. We had decided early on that the data collected should be compatible with an existing corpus of standard Angami collected by the IITG team. We also investigated some samples from this corpus for the purpose of finding the right tools for annotation and automatic segmentation.

It was also established during these meetings that one of the UH team members, Anna Busheva, would conduct her MA thesis research in parallel with this project using some of the data collected during the excursion.