What is Data Extraction

Data extract is the output of the data extraction process, a very important aspect of data warehouse implementation.

A data warehouse gathers data from several sources and utilizes these data to serve as vital information for the company. These data will be used to spot patterns and trends both in the business operations as well as in industry standards.

Since the data coming to the data warehouse may come from different source which commonly are of disparate systems resulting in different data formats, a data warehouse uses three processes to make use of the data. These processes are extraction, transformation and loading (ETL).

Data extraction is a process that involves retrieval of all format and types of data out of unstructured of badly structured data sources. These data will be further used for processing or data migration. Raw data is usually imported into an intermediate extracting system before being processed for data transformation where they will possibly be padded with meta data before being exported to another stage in the data warehouse work flow. The term data extraction is often applied when experimental data is first imported into a computer server from the primary sources such as recording or measuring devices.

During the process of data extraction in a data warehouse, data may be removed from the system source or a copy may be made with the original data being retained in the source system. It is also practiced in some data extraction implementation to move historical data that accumulates in the operational system to a data warehouse in order to maintain performance and efficiency.

Data extracts are loaded into the staging area of a relational database which for future manipulation in the ETL methodology.

The data extraction process in general is performed within the source system itself. This is can be most appropriate if the extraction is added to a relational database. Some database professionals implement data extraction using extraction logic in the data warehouse staging area and query the source system for data using applications programming interface (API).

Data extraction is a complex process but there are various software applications that have been developed to handle this process.

Some generic extraction applications can be found free on the internet. A CD extraction software can create digital copies of audio CDs on the hard drive. There also email extraction tools which can extract email addresses from different websites including results from Google searches. These emails can be exported to text, html or XML formats.

Another data extracting tool is a web data or link extractor which can extra URLs, meta tags (like keywords, title and descriptions), body texts, email addresses, phone and fax numbers and many other data from a website.

There is a wide array of data extracting tools. Some are used for individual purposes such as extracting data for entertainment while some are used for big projects like data warehousing.

Since data warehouses need to do other processes and not just extracting alone, database managers or programmers usually write programs that repetitively checks on many different sites or new data updates. This way, the code just sits in one area of the data warehouse sensing new updates from the data sources. Whenever an new data is detected, the program automatically does its function to update and transfer the data to the ETL process.

Editorial Team at Geekinterview is a team of HR and Career Advice members led by Chandra Vennapoosa.

Editorial Team – who has written posts on Online Learning.


Pin It