Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition

Ayushi Pandey, B. M.L. Srivastava, Rohit Kumar, B. T. Nellore, K. S. Teja, S. V. Gangashetty

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The paper presents the development of a phonetically balanced read speech corpus of code-mixed Hindi-English. Phonetic balance in the corpus has been created by selecting sentences that contained triphones lower in frequency than a predefined threshold. The assumption with a compulsory inclusion of such rare units was that the high frequency triphones will inevitably be included. Using this metric, the Pearson's correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. The data for corpus creation has been extracted from selected sections of Hindi newspapers.These sections contain frequent English insertions in a matrix of Hindi sentence. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the reference corpus and the corpus sampled through our method.

Original languageEnglish
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages1480-1484
Number of pages5
ISBN (Electronic)9791095546009
Publication statusPublished - 01-01-2019
Externally publishedYes
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: 07-05-201812-05-2018

Publication series

NameLREC 2018 - 11th International Conference on Language Resources and Evaluation

Conference

Conference11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period07-05-1812-05-18

All Science Journal Classification (ASJC) codes

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

Fingerprint Dive into the research topics of 'Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition'. Together they form a unique fingerprint.

  • Cite this

    Pandey, A., Srivastava, B. M. L., Kumar, R., Nellore, B. T., Teja, K. S., & Gangashetty, S. V. (2019). Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, & T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 1480-1484). (LREC 2018 - 11th International Conference on Language Resources and Evaluation). European Language Resources Association (ELRA).