12 Citations (Scopus)

Abstract

Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open-source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers" performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.

Original languageEnglish
Pages (from-to)2215-2222
Number of pages8
JournalInternational Journal of Electrical and Computer Engineering
Volume7
Issue number4
DOIs
Publication statusPublished - 2017

Fingerprint

Public health
Classifiers
Costs
Sampling
Data mining
Health
Cost benefit analysis
Health care

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Hardware and Architecture
  • Computer Networks and Communications
  • Electrical and Electronic Engineering

Cite this

@article{a892c34126d3404c99ca1a79f4ead465,
title = "Learning from a class imbalanced public health dataset: A cost-based comparison of classifier performance",
abstract = "Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open-source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers{"} performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.",
author = "Rao, {Rohini R.} and Krishnamoorthi Makkithaya",
year = "2017",
doi = "10.11591/ijece.v7i4.pp2215-2222",
language = "English",
volume = "7",
pages = "2215--2222",
journal = "International Journal of Electrical and Computer Engineering",
issn = "2088-8708",
publisher = "Institute of Advanced Engineering and Science (IAES)",
number = "4",

}

TY - JOUR

T1 - Learning from a class imbalanced public health dataset

T2 - A cost-based comparison of classifier performance

AU - Rao, Rohini R.

AU - Makkithaya, Krishnamoorthi

PY - 2017

Y1 - 2017

N2 - Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open-source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers" performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.

AB - Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open-source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers" performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.

UR - http://www.scopus.com/inward/record.url?scp=85030844630&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85030844630&partnerID=8YFLogxK

U2 - 10.11591/ijece.v7i4.pp2215-2222

DO - 10.11591/ijece.v7i4.pp2215-2222

M3 - Article

VL - 7

SP - 2215

EP - 2222

JO - International Journal of Electrical and Computer Engineering

JF - International Journal of Electrical and Computer Engineering

SN - 2088-8708

IS - 4

ER -