Script identification for a Tri-lingual document

Prakash K. Aithal, G. Rajesh, Dinesh U. Acharya, M. Krishnamoorthi, N. V. Subbareddy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Tamil, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 20 different document images containing about 600 text lines. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 100% is achieved.

Original languageEnglish
Title of host publicationComputer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings
Pages434-439
Number of pages6
Volume142 CCIS
DOIs
Publication statusPublished - 2011
Event2nd International Conference on Advances in Communication, Network, and Computing, CNC 2011 - Bangalore, India
Duration: 10-03-201111-03-2011

Publication series

NameCommunications in Computer and Information Science
Volume142 CCIS
ISSN (Print)1865-0929

Conference

Conference2nd International Conference on Advances in Communication, Network, and Computing, CNC 2011
CountryIndia
CityBangalore
Period10-03-1111-03-11

Fingerprint

Optical character recognition
Feature extraction

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Cite this

Aithal, P. K., Rajesh, G., Acharya, D. U., Krishnamoorthi, M., & Subbareddy, N. V. (2011). Script identification for a Tri-lingual document. In Computer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings (Vol. 142 CCIS, pp. 434-439). (Communications in Computer and Information Science; Vol. 142 CCIS). https://doi.org/10.1007/978-3-642-19542-6_82
Aithal, Prakash K. ; Rajesh, G. ; Acharya, Dinesh U. ; Krishnamoorthi, M. ; Subbareddy, N. V. / Script identification for a Tri-lingual document. Computer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings. Vol. 142 CCIS 2011. pp. 434-439 (Communications in Computer and Information Science).
@inproceedings{ef1fd5a776ae458083657da97b5f7f7c,
title = "Script identification for a Tri-lingual document",
abstract = "India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Tamil, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 20 different document images containing about 600 text lines. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 100{\%} is achieved.",
author = "Aithal, {Prakash K.} and G. Rajesh and Acharya, {Dinesh U.} and M. Krishnamoorthi and Subbareddy, {N. V.}",
year = "2011",
doi = "10.1007/978-3-642-19542-6_82",
language = "English",
isbn = "9783642195419",
volume = "142 CCIS",
series = "Communications in Computer and Information Science",
pages = "434--439",
booktitle = "Computer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings",

}

Aithal, PK, Rajesh, G, Acharya, DU, Krishnamoorthi, M & Subbareddy, NV 2011, Script identification for a Tri-lingual document. in Computer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings. vol. 142 CCIS, Communications in Computer and Information Science, vol. 142 CCIS, pp. 434-439, 2nd International Conference on Advances in Communication, Network, and Computing, CNC 2011, Bangalore, India, 10-03-11. https://doi.org/10.1007/978-3-642-19542-6_82

Script identification for a Tri-lingual document. / Aithal, Prakash K.; Rajesh, G.; Acharya, Dinesh U.; Krishnamoorthi, M.; Subbareddy, N. V.

Computer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings. Vol. 142 CCIS 2011. p. 434-439 (Communications in Computer and Information Science; Vol. 142 CCIS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Script identification for a Tri-lingual document

AU - Aithal, Prakash K.

AU - Rajesh, G.

AU - Acharya, Dinesh U.

AU - Krishnamoorthi, M.

AU - Subbareddy, N. V.

PY - 2011

Y1 - 2011

N2 - India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Tamil, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 20 different document images containing about 600 text lines. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 100% is achieved.

AB - India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Tamil, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 20 different document images containing about 600 text lines. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 100% is achieved.

UR - http://www.scopus.com/inward/record.url?scp=79953013459&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79953013459&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-19542-6_82

DO - 10.1007/978-3-642-19542-6_82

M3 - Conference contribution

SN - 9783642195419

VL - 142 CCIS

T3 - Communications in Computer and Information Science

SP - 434

EP - 439

BT - Computer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings

ER -

Aithal PK, Rajesh G, Acharya DU, Krishnamoorthi M, Subbareddy NV. Script identification for a Tri-lingual document. In Computer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings. Vol. 142 CCIS. 2011. p. 434-439. (Communications in Computer and Information Science). https://doi.org/10.1007/978-3-642-19542-6_82