Species identification based on approximate matching

Nagamma Patil, Durga Toshniwal, Kumkum Garg

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Genomic data mining and knowledge extraction is an important problem in bioinformatics. Existing methods for species identification are based on n-grams. In this paper, we propose a novel approach for identification of species. Given a database of genomic sequences, our proposed work includes extraction of all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These patterns are used as features for classifier. Classification of genome sequences has been done by using data mining techniques namely, Naive Bayes, support vector machine and nearest neighbor. Individual classifier accuracies are reported. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely E. coli and Yeast are used to verify proposed method.

Original languageEnglish
Title of host publicationCompute 2011 - 4th Annual ACM Bangalore Conference
DOIs
Publication statusPublished - 09-06-2011
Externally publishedYes
Event4th Annual ACM Bangalore Conference, Compute 2011 - Bangalore, India
Duration: 25-03-201126-03-2011

Conference

Conference4th Annual ACM Bangalore Conference, Compute 2011
CountryIndia
CityBangalore
Period25-03-1126-03-11

Fingerprint

Data mining
Classifiers
Genes
Sampling
Bioinformatics
Yeast
Escherichia coli
Support vector machines

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications

Cite this

Patil, N., Toshniwal, D., & Garg, K. (2011). Species identification based on approximate matching. In Compute 2011 - 4th Annual ACM Bangalore Conference [30] https://doi.org/10.1145/1980422.1980452
Patil, Nagamma ; Toshniwal, Durga ; Garg, Kumkum. / Species identification based on approximate matching. Compute 2011 - 4th Annual ACM Bangalore Conference. 2011.
@inproceedings{44fa879a91434f84b02f9be4b0f7697a,
title = "Species identification based on approximate matching",
abstract = "Genomic data mining and knowledge extraction is an important problem in bioinformatics. Existing methods for species identification are based on n-grams. In this paper, we propose a novel approach for identification of species. Given a database of genomic sequences, our proposed work includes extraction of all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These patterns are used as features for classifier. Classification of genome sequences has been done by using data mining techniques namely, Naive Bayes, support vector machine and nearest neighbor. Individual classifier accuracies are reported. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely E. coli and Yeast are used to verify proposed method.",
author = "Nagamma Patil and Durga Toshniwal and Kumkum Garg",
year = "2011",
month = "6",
day = "9",
doi = "10.1145/1980422.1980452",
language = "English",
isbn = "9781450307505",
booktitle = "Compute 2011 - 4th Annual ACM Bangalore Conference",

}

Patil, N, Toshniwal, D & Garg, K 2011, Species identification based on approximate matching. in Compute 2011 - 4th Annual ACM Bangalore Conference., 30, 4th Annual ACM Bangalore Conference, Compute 2011, Bangalore, India, 25-03-11. https://doi.org/10.1145/1980422.1980452

Species identification based on approximate matching. / Patil, Nagamma; Toshniwal, Durga; Garg, Kumkum.

Compute 2011 - 4th Annual ACM Bangalore Conference. 2011. 30.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Species identification based on approximate matching

AU - Patil, Nagamma

AU - Toshniwal, Durga

AU - Garg, Kumkum

PY - 2011/6/9

Y1 - 2011/6/9

N2 - Genomic data mining and knowledge extraction is an important problem in bioinformatics. Existing methods for species identification are based on n-grams. In this paper, we propose a novel approach for identification of species. Given a database of genomic sequences, our proposed work includes extraction of all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These patterns are used as features for classifier. Classification of genome sequences has been done by using data mining techniques namely, Naive Bayes, support vector machine and nearest neighbor. Individual classifier accuracies are reported. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely E. coli and Yeast are used to verify proposed method.

AB - Genomic data mining and knowledge extraction is an important problem in bioinformatics. Existing methods for species identification are based on n-grams. In this paper, we propose a novel approach for identification of species. Given a database of genomic sequences, our proposed work includes extraction of all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These patterns are used as features for classifier. Classification of genome sequences has been done by using data mining techniques namely, Naive Bayes, support vector machine and nearest neighbor. Individual classifier accuracies are reported. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely E. coli and Yeast are used to verify proposed method.

UR - http://www.scopus.com/inward/record.url?scp=79957996336&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79957996336&partnerID=8YFLogxK

U2 - 10.1145/1980422.1980452

DO - 10.1145/1980422.1980452

M3 - Conference contribution

AN - SCOPUS:79957996336

SN - 9781450307505

BT - Compute 2011 - 4th Annual ACM Bangalore Conference

ER -

Patil N, Toshniwal D, Garg K. Species identification based on approximate matching. In Compute 2011 - 4th Annual ACM Bangalore Conference. 2011. 30 https://doi.org/10.1145/1980422.1980452