A Hybrid approach to neural networks based speech segmentation

[Conference] Frontiers of Research in Speech and Music (FRSM), Rourkela, India, December 2017


Arun Baby, Jeena Prakash and Hema A Murthy


Building speech synthesis systems for Indian languages is challenging owing to the fact that digital resources for Indian languages are hardly available. Vocabulary independent speech synthesis requires that a given text is split at the level of the smallest sound unit, namely, phone. The waveforms or models of phones are concatenated to produce speech. The waveforms corresponding to that of the phones are obtained manually (listening and marking), when digital resources are scarce. Manually labeling of data can lead to inconsistencies as the duration of phonemes can be as short as 10ms. The most common approach to automatic segmentation of speech is, to perform forced alignment using monophone HMM models that have been obtained using embedded re-estimation after flat start initialization. These results are then used in a DNN/CNN framework to build better acoustic models for speech synthesis. Segmentation using this approach requires large amounts of data and does not work very well for low resource languages. To address the issue of paucity of data, signal processing cues are used. The final waveforms are then used in an HMM based statistical parametric synthesis framework to build speech synthesis systems for 5 Indian languages. Qualitative assessments indicate that there is a significant improvement in quality of synthesis.