The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech

Phat Do, Matt Coler*, Jelske Dijkstra, Esther Klabbers

*Corresponding author voor dit werk

OnderzoeksoutputAcademicpeer review

66 Downloads (Pure)

Samenvatting

We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech (TTS) for low-resource languages (LRLs). Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness. For LRLs without pronunciation dictionaries, we propose two novel approaches: a) using a massively multilingual model to convert grapheme-to-phone (G2P) in both training and synthesizing, and b) using a universal phone recognizer to create a makeshift dictionary. Results show that the G2P approach performs largely on par with using a ground-truth dictionary and the phone recognition approach, while performing generally worse, remains a viable option for LRLs less suitable for the G2P approach. Within each approach, using articulatory features as input outperforms using phone labels.

Originele taal-2English
TitelProceedings of Interspeech 2023
UitgeverijISCA
DOI's
StatusPublished - 20-aug.-2023
EvenementInterspeech 2023 - Dublin, Ireland
Duur: 20-aug.-202324-aug.-2023

Conference

ConferenceInterspeech 2023
Land/RegioIreland
StadDublin
Periode20/08/202324/08/2023

Vingerafdruk

Duik in de onderzoeksthema's van 'The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech'. Samen vormen ze een unieke vingerafdruk.

Citeer dit