Text line script identification for a tri-lingual document

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83% is achieved.

Original languageEnglish
Title of host publication2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010
DOIs
Publication statusPublished - 2010
Event2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010 - Karur, India
Duration: 29-07-201031-07-2010

Conference

Conference2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010
CountryIndia
CityKarur
Period29-07-1031-07-10

Fingerprint

Optical character recognition
Feature extraction

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications

Cite this

Aithal, P. K., Rajesh, G., Acharya, D. U., & Subbareddy, K. M. N. V. (2010). Text line script identification for a tri-lingual document. In 2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010 [5592562] https://doi.org/10.1109/ICCCNT.2010.5592562
Aithal, Prakash K. ; Rajesh, G. ; Acharya, Dinesh U. ; Subbareddy, Krishnamoorthi M.N.V. / Text line script identification for a tri-lingual document. 2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010. 2010.
@inproceedings{e97a3e4bcff949e1a36e28c3b258c5de,
title = "Text line script identification for a tri-lingual document",
abstract = "India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83{\%} is achieved.",
author = "Aithal, {Prakash K.} and G. Rajesh and Acharya, {Dinesh U.} and Subbareddy, {Krishnamoorthi M.N.V.}",
year = "2010",
doi = "10.1109/ICCCNT.2010.5592562",
language = "English",
isbn = "9781424465910",
booktitle = "2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010",

}

Aithal, PK, Rajesh, G, Acharya, DU & Subbareddy, KMNV 2010, Text line script identification for a tri-lingual document. in 2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010., 5592562, 2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010, Karur, India, 29-07-10. https://doi.org/10.1109/ICCCNT.2010.5592562

Text line script identification for a tri-lingual document. / Aithal, Prakash K.; Rajesh, G.; Acharya, Dinesh U.; Subbareddy, Krishnamoorthi M.N.V.

2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010. 2010. 5592562.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Text line script identification for a tri-lingual document

AU - Aithal, Prakash K.

AU - Rajesh, G.

AU - Acharya, Dinesh U.

AU - Subbareddy, Krishnamoorthi M.N.V.

PY - 2010

Y1 - 2010

N2 - India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83% is achieved.

AB - India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83% is achieved.

UR - http://www.scopus.com/inward/record.url?scp=78549264084&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78549264084&partnerID=8YFLogxK

U2 - 10.1109/ICCCNT.2010.5592562

DO - 10.1109/ICCCNT.2010.5592562

M3 - Conference contribution

SN - 9781424465910

BT - 2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010

ER -

Aithal PK, Rajesh G, Acharya DU, Subbareddy KMNV. Text line script identification for a tri-lingual document. In 2010 2nd International Conference on Computing, Communication and Networking Technologies, ICCCNT 2010. 2010. 5592562 https://doi.org/10.1109/ICCCNT.2010.5592562