Resources for Indian languages

[Conference] Community-based Building of Language Resources(CBBLR), Brno, Czech Republic, September 2016


Arun Baby, Anju Leela Thomas, NL Nishanthi, TTS Consortium


This paper discusses a consortium effort with the design of database for a high-quality corpus, primarily for building text to speech(TTS) synthesis systems for 13 major Indian languages. Importance of language corpora is recognized since long before in many countries. The amount of work in speech domain for Indian languages is comparatively lower than that of other languages. This demands the speech corpus for Indian languages. The corpus presented here is a database of speech audio files and corresponding text transcriptions. Various criteria are addressed while building the database for these languages namely, optimal text selection, speaker selection, pronunciation variation, recording specification, text correction for handling out-of-the-vocabulary words and so on. Furthermore, various characteristics that affect speech synthesis quality like encoding, sampling rate, channel, etc is considered so that the collected data will be of high quality with defined standards. Database and text to speech synthesizers are built for all the 13 languages, namely, Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Odiya, Rajasthani, Tamil and Telugu.


    title = {Resources for {I}ndian languages},
    author = {Arun Baby and Anju Leela Thomas and Nishanthi, N. L. and TTS Consortium},
    booktitle = {CBBLR -- Community-Based Building of Language Resources},
    pages = {37--43},
    publisher = {Tribun EU},
    address = {Brno, Czech Republic},
    year = {2016},
    month = {Sep},
    day = {12},
    isbn = {978-80-263-1084-6},