Data warehouses, where a rich repository of company data may be found, are being run by database management systems that need to see one homogenous data in order for it to flow smoothly and process data to be able to come up with statistical report about company trends and patterns.
But the problem arises because data warehouses gather, extract and transform data from a variety of sources. This means that data may come from a server that has totally different structure of hardware and the software behind the server format data differently. When this arrives to the data warehouse, it mixes data from yet other servers which are of disparate systems.
This is where data cleansing comes in. Data cleansing is also referred to as data scrubbing, an act of detecting and subsequently either removing or correcting a database’ dirty data. These dirty data refers to data which are out of date, incorrect, incomplete, redundant or formatted differently. It is the goal of the data cleansing process to not just clean up the data within the database but also to bring inconsistencies into different sets of data that have been lumped together from separate databases.
Despite being interpreted as similar by many people, data scrubbing and data cleansing differs in that data cleansing, validation almost invariable means that data is rejected from the system right then and there at entry time. In contrast, data scrubbing is done in batches of data.
After the data has been cleansed, the data set will be consistent and can already be used with similar data in the system so the database can already have a standard process to utilize the data. It is a common experience with data warehouse implementation to detect and remove inconsistent data which have been supplied from different data dictionary definitions of similar entities within different stores. Other data problems may have been due to errors during end user entry activities or corruption during the process of data transmission or during storage after receipt from the source.
A good way to guarantee that data is correct and consistent is having a pre-processing period in conjunction with data cleansing. This will help ensure that data is not ambiguous, incorrect or incomplete.
In real practice, the data cleansing process involves removal of typographical errors and validation and correction of values by comparing data against a known list of entities. The validation process may be very strict as in the case of address rejection when Zip or Postal Codes are invalid compared against the list. It could also be fuzzy as in the case when a record is corrected when it partially matches an existing known record.
Data cleansing in an important aspect in the goal of achieving quality data in data warehouses. Data are said to be of high quality if they fit the purposes to serve correct operations, decision making and planning of the organization implementing the data warehouse. This means that the quality of data is gauged by how realistically they represent real world constructs to which they are designed to refer to.