Publicly available Finnish speech resources are becoming plentiful, with, for example, a large parliamentary corpus and the recent Lahjoita puhetta, “donate speech”, campaign. These are eminently suitable for many speech technology applications, like speech recognition, but of limited value for speech synthesis training, where large, high quality single-speaker corpora are ideal. So far, such corpora have been absent for Finnish language.
To fill this gap, we at Helsinki phonetics group designed and recorded a new speech corpus in autumn 2021, intended for Finnish speech synthesis research and applications. A dataset consisting of ~ 60h of speech was collected, recording two voice talents almost daily during a one month period.
Tentatively titled FinSyn, the design of this corpus is quite ambitious. Instead of collecting only carefully pronounced phonetically rich sentences (still regrettably common), we aimed at capturing many aspects of the huge stylistic variety of spoken communication in our recordings. Most of the material is still read-aloud speech, as text-to-speech is the most common application of speech synthesis, but we varied the formality of the text genres to induce different speaking styles, including movie subtitles, blogs, discussion forum texts, parliamentary speeches, novels and fairy tales. In addition, we recorded acted spontaneous speech using Youtube closed captions, task-based semi-spontaneous dialogues and fully spontaneous dialogues between voice talents.
A small sample of stylistic variation within the corpus:
The FinSyn corpus will be made publicly available through Kielipankki. While the details are being worked out, the license will be permissible, likely not excluding commercial use.
Bonus section, tips for corpus design and recordings:
- Close talking microphone is preferable for suppressing room acoustics and achieving consistent recordings between sessions.
- Don’t record just isolated sentences. Even for professional speakers, it is very difficult to avoid repetitive prosodic patterns and boredom.
- Limit the use of 19th century public domain books. Even if your language has not changes much, obsolete agricultural terminology confuses modern readers
- Try to select texts based on the interests of the voice talents, prosody is more natural if the readers are involved.
- Keep the voice talents content and recording sessions short, boredom in speech can be perceived easily.
- In addition, avoid correcting the voice talents for small mispronunciations or lexical mistakes. While post-processing will be a bit more involved, a nicely flowing recordings are more valuable.
- Record conversations and small talk between the actual sessions too, providing material for spontaneous speech synthesis
- Don’t forget to record plenty of short utterances like “hi”, “yes”, often TTS systems struggle with these.
- Voice changes a bit from day to day. If you are interested in styles, try to record each genre every day, otherwise it will be difficult to disentangle style and recording date.
- Good speech corpus will not become obsolete, technological improvements will not ease the human recordings significantly. So do not restrict yourself by the limitations of the current TTS methods,. Record small samples of shouting, whispering, singing, laughing and foreign languages etc. even if you see no immediate value for such data. Future improvements in synthesis methods might make such data usable.