The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech

Phat Do, Matt Coler*, Jelske Dijkstra, Esther Klabbers

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

45 Downloads (Pure)

Abstract

We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech (TTS) for low-resource languages (LRLs). Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness. For LRLs without pronunciation dictionaries, we propose two novel approaches: a) using a massively multilingual model to convert grapheme-to-phone (G2P) in both training and synthesizing, and b) using a universal phone recognizer to create a makeshift dictionary. Results show that the G2P approach performs largely on par with using a ground-truth dictionary and the phone recognition approach, while performing generally worse, remains a viable option for LRLs less suitable for the G2P approach. Within each approach, using articulatory features as input outperforms using phone labels.

Original languageEnglish
Title of host publicationProceedings of Interspeech 2023
PublisherISCA
DOIs
Publication statusPublished - 20-Aug-2023
EventInterspeech 2023 - Dublin, Ireland
Duration: 20-Aug-202324-Aug-2023

Conference

ConferenceInterspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/202324/08/2023

Fingerprint

Dive into the research topics of 'The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech'. Together they form a unique fingerprint.

Cite this