Data extraction is the very first part of a popular and very integral part of a data warehousing implementation called the ETL which stands for extract, transform, load. This set of processes is the mechanism for a data warehouse to consolidate data coming from different database with disparate systems.
Each of the disparate database powering data sources may be employing different data organization formatting. Even in one business enterprise, it may be possible to have different departments using different databases – one may be a relational database and the other non-relational.
Even with today’s ubiquity of relational database system, problems may still arise within the data warehouse because there are may different relational database management system vendors which may some degree of disparity in data format. And to top all of them in the area of data disparity within an enterprise data warehouse, there may still be department which are using flat files such as documents and spreadsheets.
The process of data extraction is the act of retrieving or extracting the binary data (the very essence or the very value of the data) from data sources which has the all sorts of data which may be either unstructured or very badly structured. These data are to be readied for further processing or data storage and migration.
But the processing of data extraction is not as simple as it seems. In fact, with the nature of data warehouses wherein they are implemented, despite prohibitive costs, because they should handle high volume levels of data to be used by the business enterprise systems, the actual data extraction needs to be schedules by batch carefully so as not the block the system traffic or consume a huge chunk of valuable and expensive hardware resources.
The whole system of extract, transform, load (ETL) is a synchronized process. They can not just be done in random or as data comes and goes from one data source to another within the data warehouse because this can mean severe processing load as data, and large volumes of them, travel every second from source to the next in this data-driven business environment era.
Establishing an ETL system that can be effectively and efficiently administered from start of the data extraction process to the final data presented in the data load process depends on the side of the data warehouse and ETP system and the amount and complexity of the data being processed within the system.
But the extract frequency of the data before being staged is a vital determinant of the extent to which administrative tasks can be performed during the ETL process. The data extract frequency will have a ripple effect on the other processes which are the transformation and loading.
The frequency will have to be managed will because during the ETL process where data extraction comes first, there will be other checkpoints before data gets loaded into the warehouse and these checkpoints include the data cleansing and data conforming.
With managed extract frequency, the processes will be fully synchronized with great consideration on the entire system resources such as the capacity and the capability of staging areas, the capacity of the network, the storage medium capacity and the demands of the data consumers. Synchronized data extraction can also mean that data quality can be strictly monitored.