A knowledge acquisition method for improving data quality in services engagements
Abstract
Poor Data Quality is a serious problem affecting enterprises. Enterprise databases are large and manual data cleansing is not feasible. For such large databases it is logical to attempt to cleanse the data in an automated way. This has led to the development of commercial tools for automatic cleansing. However, offering data cleansing as a service has been a challenge because of the need to customize the tool for different datasets. This is because current commercial systems lack the ability to incorporate the unique exceptions of different data sources. This makes the migration of underlying data cleansing algorithms from one dataset to another difficult. In this paper we specifically look at the address standardization task. We use Ripple Down Rules (RDR) framework to lower the manual effort required in rewriting the rules from one source to another. The RDR framework allows us to incrementally patch the existing rules or add exceptions without breaking other rules. We compare the RDR approach with a conditional random field (CRF) address standardization system and an existing commercially available data cleansing tool. We demonstrate that RDR is an effective knowledge acquisition method and that its adoption for data cleansing can allow data cleansing to be offered as a service. © 2010 IEEE.