Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition

Ayushi Pandey, B. M.L. Srivastava, Rohit Kumar, B. T. Nellore, K. S. Teja, S. V. Gangashetty

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The paper presents the development of a phonetically balanced read speech corpus of code-mixed Hindi-English. Phonetic balance in the corpus has been created by selecting sentences that contained triphones lower in frequency than a predefined threshold. The assumption with a compulsory inclusion of such rare units was that the high frequency triphones will inevitably be included. Using this metric, the Pearson's correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. The data for corpus creation has been extracted from selected sections of Hindi newspapers.These sections contain frequent English insertions in a matrix of Hindi sentence. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the reference corpus and the corpus sampled through our method.

Original languageEnglish
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages1480-1484
Number of pages5
ISBN (Electronic)9791095546009
Publication statusPublished - 01-01-2019
Externally publishedYes
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: 07-05-201812-05-2018

Publication series

NameLREC 2018 - 11th International Conference on Language Resources and Evaluation

Conference

Conference11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period07-05-1812-05-18

Fingerprint

phonetics
newspaper
statistics
inclusion
Automatic Speech Recognition

All Science Journal Classification (ASJC) codes

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

Cite this

Pandey, A., Srivastava, B. M. L., Kumar, R., Nellore, B. T., Teja, K. S., & Gangashetty, S. V. (2019). Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 1480-1484). (LREC 2018 - 11th International Conference on Language Resources and Evaluation). European Language Resources Association (ELRA).
Pandey, Ayushi ; Srivastava, B. M.L. ; Kumar, Rohit ; Nellore, B. T. ; Teja, K. S. ; Gangashetty, S. V. / Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition. LREC 2018 - 11th International Conference on Language Resources and Evaluation. editor / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. pp. 1480-1484 (LREC 2018 - 11th International Conference on Language Resources and Evaluation).
@inproceedings{05063b0cdd8c4f9c8a699507ebdc25bd,
title = "Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition",
abstract = "The paper presents the development of a phonetically balanced read speech corpus of code-mixed Hindi-English. Phonetic balance in the corpus has been created by selecting sentences that contained triphones lower in frequency than a predefined threshold. The assumption with a compulsory inclusion of such rare units was that the high frequency triphones will inevitably be included. Using this metric, the Pearson's correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. The data for corpus creation has been extracted from selected sections of Hindi newspapers.These sections contain frequent English insertions in a matrix of Hindi sentence. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the reference corpus and the corpus sampled through our method.",
author = "Ayushi Pandey and Srivastava, {B. M.L.} and Rohit Kumar and Nellore, {B. T.} and Teja, {K. S.} and Gangashetty, {S. V.}",
year = "2019",
month = "1",
day = "1",
language = "English",
series = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",
publisher = "European Language Resources Association (ELRA)",
pages = "1480--1484",
editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
booktitle = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",

}

Pandey, A, Srivastava, BML, Kumar, R, Nellore, BT, Teja, KS & Gangashetty, SV 2019, Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition. in H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (eds), LREC 2018 - 11th International Conference on Language Resources and Evaluation. LREC 2018 - 11th International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA), pp. 1480-1484, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 07-05-18.

Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition. / Pandey, Ayushi; Srivastava, B. M.L.; Kumar, Rohit; Nellore, B. T.; Teja, K. S.; Gangashetty, S. V.

LREC 2018 - 11th International Conference on Language Resources and Evaluation. ed. / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. p. 1480-1484 (LREC 2018 - 11th International Conference on Language Resources and Evaluation).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition

AU - Pandey, Ayushi

AU - Srivastava, B. M.L.

AU - Kumar, Rohit

AU - Nellore, B. T.

AU - Teja, K. S.

AU - Gangashetty, S. V.

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The paper presents the development of a phonetically balanced read speech corpus of code-mixed Hindi-English. Phonetic balance in the corpus has been created by selecting sentences that contained triphones lower in frequency than a predefined threshold. The assumption with a compulsory inclusion of such rare units was that the high frequency triphones will inevitably be included. Using this metric, the Pearson's correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. The data for corpus creation has been extracted from selected sections of Hindi newspapers.These sections contain frequent English insertions in a matrix of Hindi sentence. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the reference corpus and the corpus sampled through our method.

AB - The paper presents the development of a phonetically balanced read speech corpus of code-mixed Hindi-English. Phonetic balance in the corpus has been created by selecting sentences that contained triphones lower in frequency than a predefined threshold. The assumption with a compulsory inclusion of such rare units was that the high frequency triphones will inevitably be included. Using this metric, the Pearson's correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. The data for corpus creation has been extracted from selected sections of Hindi newspapers.These sections contain frequent English insertions in a matrix of Hindi sentence. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the reference corpus and the corpus sampled through our method.

UR - http://www.scopus.com/inward/record.url?scp=85059880521&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059880521&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85059880521

T3 - LREC 2018 - 11th International Conference on Language Resources and Evaluation

SP - 1480

EP - 1484

BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Piperidis, Stelios

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Hasida, Koiti

A2 - Mazo, Helene

A2 - Choukri, Khalid

A2 - Goggi, Sara

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Calzolari, Nicoletta

A2 - Odijk, Jan

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

ER -

Pandey A, Srivastava BML, Kumar R, Nellore BT, Teja KS, Gangashetty SV. Phonetically balanced code-mixed speech corpus for Hindi-English automatic speech recognition. In Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, editors, LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). 2019. p. 1480-1484. (LREC 2018 - 11th International Conference on Language Resources and Evaluation).