Data warehouses, where a rich repository of company data may be found, are being run by database management systems that need to see one homogenous data in order for it to flow smoothly and process data to be able to come up with statistical report about company trends and patterns.
This is where data cleansing comes in. Data cleansing is also referred to as data scrubbing, an act of detecting and subsequently either removing or correcting a database' dirty data. These dirty data refers to data which are out of date, incorrect, incomplete, redundant or formatted differently. It is the goal of the data cleansing process to not just clean up the data within the database but also to bring inconsistencies into different sets of data that have been lumped together from separate databases.
Despite being interpreted as similar by many people, data scrubbing and data cleansing differs in that data cleansing, validation almost invariable means that data is rejected from the system right then and there at entry time. In contrast, data scrubbing is done in batches of data.
After the data has been cleansed, the data set will be consistent and can already be used with similar data in the system so the database can already have a standard process to utilize the data. It is a common experience with data warehouse implementation to detect and remove inconsistent data which have been supplied from different data dictionary definitions of similar entities within different stores. Other data problems may have been due to errors during end user entry activities or corruption during the process of data transmission or during storage after receipt from the source.
A good way to guarantee that data is correct and consistent is having a pre-processing period in conjunction with data cleansing. This will help ensure that data is not ambiguous, incorrect or incomplete.
Data cleansing in an important aspect in the goal of achieving quality data in data warehouses. Data are said to be of high quality if they fit the purposes to serve correct operations, decision making and planning of the organization implementing the data warehouse. This means that the quality of data is gauged by how realistically they represent real world constructs to which they are designed to refer to.