Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection
Phat Do1, Matt Coler1, Jelske Dijkstra2, Esther Klabbers3 1Language, Technology and Culture Department, Campus Fryslân, University of Groningen (the Netherlands) 2Fryske Akademy/Mercator Research Centre (the Netherlands) 3ReadSpeaker (the Netherlands)
This is the companion webpage for our paper "Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection", which has been accepted for presentation at the Speech Synthesis Workshop 2023.
Below we present some of the synthetic audio samples from our experiment in the 6 low-resource languages (Bulgarian, Georgian, Kazakh, Swahili, Urdu, and Uzbek). For each language, we synthesized a total of 100 utterances, which were guaranteed to have a phoneme distribution as close as possible to that languages' training data (10 minutes, 160 to 200 utterances), and used them for evaluation. To avoid cluttering the page, here we only include 10 randomly-selected utterances as examples for each language.
Text - Orthographic text of the utterance
Ground-truth - Human recording of the utterance
English (en-US) - Source language (language of the pre-trained model)
Nomap - Sample from model fine-tuned without phoneme mapping
Map - Sample from model fine-tuned with phoneme mapping
Feature - Sample from model fine-tuned with phonological features input