Data Deduplication techniques and analysis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

Data warehouses are the repositories of data collected from several data sources, which form the backbone of most of the decision support applications. As the data sources are independent, they may adopt independent and potentially inconsistent conventions. In data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP (On Line Transaction Processing) applications we are often encountered with duplicate records in table. Moreover, data entry mistakes at any of these sources introduce more errors. Since high quality data is essential for gaining the confidence of users of decision support applications, ensuring high data quality is critical to the success of data warehouse implementations. Therefore, significant amount of time and money are spent on the process of detecting and correcting errors and inconsistencies. The process of cleaning dirty data is often referred to as data cleaning. To make the table data consistent and accurate we need to get rid of these duplicate records from the table. In this paper we discuss different strategies of Deduplication along with their pros and cons and some of methods used to prevent duplication in database. In addition, we have made performance evaluation with Microsoft SQL-Server 2008 on Food Mart and AdventureDB Warehouses.

Original languageEnglish
Title of host publicationProceedings - 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010
Pages664-668
Number of pages5
DOIs
Publication statusPublished - 2010
Event3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010 - Goa, India
Duration: 19-11-201021-11-2010

Conference

Conference3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010
CountryIndia
CityGoa
Period19-11-1021-11-10

Fingerprint

Data warehouses
Cleaning
Warehouses
Data acquisition
Servers
Processing

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Control and Systems Engineering
  • Electrical and Electronic Engineering

Cite this

Maddodi, S., Attigeri, G. V., & Karunakar, A. K. (2010). Data Deduplication techniques and analysis. In Proceedings - 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010 (pp. 664-668). [5698409] https://doi.org/10.1109/ICETET.2010.42
Maddodi, Srivatsa ; Attigeri, Girija V. ; Karunakar, A. K. / Data Deduplication techniques and analysis. Proceedings - 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010. 2010. pp. 664-668
@inproceedings{2acf1041d52645d3919b2af1d02f903f,
title = "Data Deduplication techniques and analysis",
abstract = "Data warehouses are the repositories of data collected from several data sources, which form the backbone of most of the decision support applications. As the data sources are independent, they may adopt independent and potentially inconsistent conventions. In data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP (On Line Transaction Processing) applications we are often encountered with duplicate records in table. Moreover, data entry mistakes at any of these sources introduce more errors. Since high quality data is essential for gaining the confidence of users of decision support applications, ensuring high data quality is critical to the success of data warehouse implementations. Therefore, significant amount of time and money are spent on the process of detecting and correcting errors and inconsistencies. The process of cleaning dirty data is often referred to as data cleaning. To make the table data consistent and accurate we need to get rid of these duplicate records from the table. In this paper we discuss different strategies of Deduplication along with their pros and cons and some of methods used to prevent duplication in database. In addition, we have made performance evaluation with Microsoft SQL-Server 2008 on Food Mart and AdventureDB Warehouses.",
author = "Srivatsa Maddodi and Attigeri, {Girija V.} and Karunakar, {A. K.}",
year = "2010",
doi = "10.1109/ICETET.2010.42",
language = "English",
isbn = "9780769542461",
pages = "664--668",
booktitle = "Proceedings - 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010",

}

Maddodi, S, Attigeri, GV & Karunakar, AK 2010, Data Deduplication techniques and analysis. in Proceedings - 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010., 5698409, pp. 664-668, 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010, Goa, India, 19-11-10. https://doi.org/10.1109/ICETET.2010.42

Data Deduplication techniques and analysis. / Maddodi, Srivatsa; Attigeri, Girija V.; Karunakar, A. K.

Proceedings - 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010. 2010. p. 664-668 5698409.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Data Deduplication techniques and analysis

AU - Maddodi, Srivatsa

AU - Attigeri, Girija V.

AU - Karunakar, A. K.

PY - 2010

Y1 - 2010

N2 - Data warehouses are the repositories of data collected from several data sources, which form the backbone of most of the decision support applications. As the data sources are independent, they may adopt independent and potentially inconsistent conventions. In data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP (On Line Transaction Processing) applications we are often encountered with duplicate records in table. Moreover, data entry mistakes at any of these sources introduce more errors. Since high quality data is essential for gaining the confidence of users of decision support applications, ensuring high data quality is critical to the success of data warehouse implementations. Therefore, significant amount of time and money are spent on the process of detecting and correcting errors and inconsistencies. The process of cleaning dirty data is often referred to as data cleaning. To make the table data consistent and accurate we need to get rid of these duplicate records from the table. In this paper we discuss different strategies of Deduplication along with their pros and cons and some of methods used to prevent duplication in database. In addition, we have made performance evaluation with Microsoft SQL-Server 2008 on Food Mart and AdventureDB Warehouses.

AB - Data warehouses are the repositories of data collected from several data sources, which form the backbone of most of the decision support applications. As the data sources are independent, they may adopt independent and potentially inconsistent conventions. In data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP (On Line Transaction Processing) applications we are often encountered with duplicate records in table. Moreover, data entry mistakes at any of these sources introduce more errors. Since high quality data is essential for gaining the confidence of users of decision support applications, ensuring high data quality is critical to the success of data warehouse implementations. Therefore, significant amount of time and money are spent on the process of detecting and correcting errors and inconsistencies. The process of cleaning dirty data is often referred to as data cleaning. To make the table data consistent and accurate we need to get rid of these duplicate records from the table. In this paper we discuss different strategies of Deduplication along with their pros and cons and some of methods used to prevent duplication in database. In addition, we have made performance evaluation with Microsoft SQL-Server 2008 on Food Mart and AdventureDB Warehouses.

UR - http://www.scopus.com/inward/record.url?scp=79952337104&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952337104&partnerID=8YFLogxK

U2 - 10.1109/ICETET.2010.42

DO - 10.1109/ICETET.2010.42

M3 - Conference contribution

AN - SCOPUS:79952337104

SN - 9780769542461

SP - 664

EP - 668

BT - Proceedings - 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010

ER -

Maddodi S, Attigeri GV, Karunakar AK. Data Deduplication techniques and analysis. In Proceedings - 3rd International Conference on Emerging Trends in Engineering and Technology, ICETET 2010. 2010. p. 664-668. 5698409 https://doi.org/10.1109/ICETET.2010.42