TY - JOUR
T1 - Hamlet-Pattern-Based Automated COVID-19 and Influenza Detection Model Using Protein Sequences
AU - Erten, Mehmet
AU - Acharya, Madhav R.
AU - Kamath, Aditya P.
AU - Sampathila, Niranjana
AU - Bairy, G. Muralidhar
AU - Aydemir, Emrah
AU - Barua, Prabal Datta
AU - Baygin, Mehmet
AU - Tuncer, Ilknur
AU - Dogan, Sengul
AU - Tuncer, Turker
N1 - Publisher Copyright:
© 2022 by the authors.
PY - 2022/12
Y1 - 2022/12
N2 - SARS-CoV-2 and Influenza-A can present similar symptoms. Computer-aided diagnosis can help facilitate screening for the two conditions, and may be especially relevant and useful in the current COVID-19 pandemic because seasonal Influenza-A infection can still occur. We have developed a novel text-based classification model for discriminating between the two conditions using protein sequences of varying lengths. We downloaded viral protein sequences of SARS-CoV-2 and Influenza-A with varying lengths (all 100 or greater) from the NCBI database and randomly selected 16,901 SARS-CoV-2 and 19,523 Influenza-A sequences to form a two-class study dataset. We used a new feature extraction function based on a unique pattern, HamletPat, generated from the text of Shakespeare’s Hamlet, and a signum function to extract local binary pattern-like bits from overlapping fixed-length (27) blocks of the protein sequences. The bits were converted to decimal map signals from which histograms were extracted and concatenated to form a final feature vector of length 1280. The iterative Chi-square function selected the 340 most discriminative features to feed to an SVM with a Gaussian kernel for classification. The model attained 99.92% and 99.87% classification accuracy rates using hold-out (75:25 split ratio) and five-fold cross-validations, respectively. The excellent performance of the lightweight, handcrafted HamletPat-based classification model suggests that it can be a valuable tool for screening protein sequences to discriminate between SARS-CoV-2 and Influenza-A infections.
AB - SARS-CoV-2 and Influenza-A can present similar symptoms. Computer-aided diagnosis can help facilitate screening for the two conditions, and may be especially relevant and useful in the current COVID-19 pandemic because seasonal Influenza-A infection can still occur. We have developed a novel text-based classification model for discriminating between the two conditions using protein sequences of varying lengths. We downloaded viral protein sequences of SARS-CoV-2 and Influenza-A with varying lengths (all 100 or greater) from the NCBI database and randomly selected 16,901 SARS-CoV-2 and 19,523 Influenza-A sequences to form a two-class study dataset. We used a new feature extraction function based on a unique pattern, HamletPat, generated from the text of Shakespeare’s Hamlet, and a signum function to extract local binary pattern-like bits from overlapping fixed-length (27) blocks of the protein sequences. The bits were converted to decimal map signals from which histograms were extracted and concatenated to form a final feature vector of length 1280. The iterative Chi-square function selected the 340 most discriminative features to feed to an SVM with a Gaussian kernel for classification. The model attained 99.92% and 99.87% classification accuracy rates using hold-out (75:25 split ratio) and five-fold cross-validations, respectively. The excellent performance of the lightweight, handcrafted HamletPat-based classification model suggests that it can be a valuable tool for screening protein sequences to discriminate between SARS-CoV-2 and Influenza-A infections.
UR - http://www.scopus.com/inward/record.url?scp=85144847094&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85144847094&partnerID=8YFLogxK
U2 - 10.3390/diagnostics12123181
DO - 10.3390/diagnostics12123181
M3 - Article
AN - SCOPUS:85144847094
VL - 12
JO - Diagnostics
JF - Diagnostics
SN - 2075-4418
IS - 12
M1 - 3181
ER -