TY - GEN
T1 - Script identification for a Tri-lingual document
AU - Aithal, Prakash K.
AU - Rajesh, G.
AU - Acharya, Dinesh U.
AU - Krishnamoorthi, M.
AU - Subbareddy, N. V.
PY - 2011
Y1 - 2011
N2 - India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Tamil, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 20 different document images containing about 600 text lines. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 100% is achieved.
AB - India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Tamil, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 20 different document images containing about 600 text lines. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 100% is achieved.
UR - http://www.scopus.com/inward/record.url?scp=79953013459&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79953013459&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-19542-6_82
DO - 10.1007/978-3-642-19542-6_82
M3 - Conference contribution
AN - SCOPUS:79953013459
SN - 9783642195419
VL - 142 CCIS
T3 - Communications in Computer and Information Science
SP - 434
EP - 439
BT - Computer Networks and Information Technologies - Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Proceedings
T2 - 2nd International Conference on Advances in Communication, Network, and Computing, CNC 2011
Y2 - 10 March 2011 through 11 March 2011
ER -