The definition of what constitutes a duplicate has somewhat different interpretations. For instance, some define a duplicate as having the exact syntactic terms and sequence, whether having formatting differences or not. In effect, there are either no difference or only formatting differences and the contents of the data are exactly the same.
In any case, data duplication happens all the time. In large data warehouses, data duplication is an inevitable phenomenon as millions of data are gathered at very short intervals.
Data warehouse involves a process called ETL which stands for extract, transform and load. During the extraction phase, multitudes of data come to the data warehouse from several sources and the system behind the warehouse consolidates the data so each separate system format will be read consistently by the data consumers of the warehouse.
A data warehouse is basically a database and having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. In the data warehousing community, the task of finding duplicated records within large databases has long been a persistent problem and has become an area of active research. There have been many research undertakings to address the problems of data duplication caused by duplicate contamination of data.
Several approaches have been implemented to counter the problem of data duplication. One approach is manually coding rules so that data can be filtered to avoid duplication. Other approaches include having applications of the latest machine learning techniques or more advance business intelligence applications. The accuracy of the different methods for countering data duplication varies. For very large data collection implementing some of the methods may be too complex and also expensive to be deployed in their full capacity.
Despite all these counter measures against data duplication and despite the best efforts in trying to clean data, the reality still remains that that data duplication will never be totally eliminated. So it is extremely important to understand its impact on the quality of a data warehouse implementation. In particular, the presence of data duplication may potentially skew content distribution.
There are some application systems that have duplication detection functions. These functions are developed by calculating a unique hash value for a certain data or group of data such as a document. Each document, for instance, is being examined for cases of duplication by comparing it against some hash value in either an in-memory hash or persistent lookup system. Some of the most commonly used hash functions include MD2, MD5, or SHA. These three are the most preferred due to their desirable properties. They are also easily calculated based on arbitrary data or document lengths and they have lower collision probability.
Data duplication can also be similar to problems like plagiarism and clustering. But the case of plagiarism could either be exact data duplication or just plain similarity to a certain documents. Documents which are considered to be plagiarized may refer to the abstract idea and not the word for word content. Clustering on the other hand is a method which is used to make clusters of data that have somehow similar characteristics. Clustering is used for fast retrieval of relevant information from a database.
Careful planning in the implementation of data warehouse which include clear definition of the data architecture and investing in robust IT hardware and software infrastructure can help minimize problems brought about by incidences of data duplication.