Data Derivation refers to the process of creating a data value from one or more contributing data values through a data derivation algorithm.
Almost all business organizations in today’s environment are becoming more and more dependent on the data produced from the data warehouses and information systems in order to support the company’s operations. Since data accuracy is important, knowledge of how data is derived is very vital and important.
As systems evolve, the bulk data increases too, most especially that more people and business are moving to the internet for what used to be offline transactions. With the evolution of information systems, functionalities also grow complex so a need to have an associated documentation for data derivation becomes more indispensable.
Data derivation applies to all real life activities which are being represented in the data model and aggregated in the process within the data information systems or data warehouse. For instance, in a database that keeps records of wild migratory birds, there are records of data pertaining to a variable called "Population Size". The basic question would be "How was the population size of migratory birds derived?" The answer may be that data derived from the recorded observations, estimation, inference or a combination of all and then getting the sort of average or any other formula.
It is a known fact that proper data derivation is the key to having an accurate understanding of the core content of any output as the this is the process of making new and more meaningful data from the aggregation or basis of the raw data which had been collected by the database.
A derived data could be any variable. For example, in a database that computes a person’s age when the record only keeps his birthday, the age is computed using certain formula deriving age from the birthday.
In any data warehouse implementation, it is important to have data dictionary which details all specifications for derived data. In the case of the person’s age above, even if the derived data may be simple, pitfalls could exist because the data from which the person’s age will be calculated should be used consistent. Below shows the possible inconsistencies as the with the source variable set could make one have several choices for the age calculation algorithm:
Person_age = floor ((randdate – dob) / 365.25)
The algorithm above results in having an approximate age which accounts for leap years while the algorithm below with slight modification will take into account the century effect:
Person_age = floor ((randdate – dob) / 365.23)
As can be seen, the same variable can have several ways to derive. There it is extremely important to have a data dictionary so that the users can be guided about which data derivation they are using and stick to one algorithm if they want consistency.
Having a data derivation mechanism in a warehouse can make it improve performance. Most often, because warehouses contain billions of raw data which gets regularly updated, having a snapshot of, say, an inventory information, can result in slow performance because the information is so bulky to the point that many data warehouses do not even include them. This problem, because this arises from bulky inventory data can overcome by keeping the inventory information at weekly or even monthly level and then using data derivation formula to minimize sizes of data sets.
Problems arising with data derivation can be hard to find. Therefore, data derivation formulas should always be carefully planned and documented so the flow of day to day operations will definitely be smooth and very efficient.