Legacy data comes from virtually everywhere within the information system and support legacy systems. The many sources of legacy data include databases, often relational but hierarchical, network, object, XML, and object/relational databases as well. Legacy data is another term used for disparate data.
Some files such as XML documents or “flat files” such as configuration files and comma-delimited text files may also be sources of legacy data. But the biggest sources of legacy data are those from the old, updated and antiquated legacy systems.
A legacy system refers to an existing group of computers or application programs which have been old and outdated by companies still refuse to give them up because they still serve well.
These systems are usually large and companies have invested so much money in implementing legacy systems in the past that despite some potentially problematic identified by IT professionals, many still want to keep them for several reasons.
One of the main problems with legacy systems is that they often run on very slow and obsolete hardware parts which, when broken, would be very difficult to look for replacements. Because of the general lack of understanding of these old technologies, they are often very hard to maintain, improve and expand. And because they are old and obsolete, chances the operations manual and other documentations may have been lost through the years.
Despite the emergences of newer technologies with individual parts relatively cheaper, many companies still have compelling reasons why they are keeping such old and antiquated system whose data adds to the disparity in data warehouse systems.
One of the biggest reasons is the legacy systems were implemented to be large and monolithic in nature and coming up with a one time redesign and reimplementation would be very costly and complicated. If legacy systems are taken out at one single moment, the whole business process would be halted for sometime because of the monolithic and centralized nature of these systems.
Most companies cannot afford any business stoppage especially in today’s very fast paced data driven business environment. What worsens the situation even more is that legacy systems are not very understood by younger IT professional so redesigning them to adopt to newer technologies would take so long and intensive planning.
That is why it is very common to see data warehouses nowadays which are a combination of new and legacy systems. The effect would be having legacy data which are very incompatible with the data coming from the data sources using newer technologies.
In fact, different new technology vendors are encountering differing disparity data related problems with using legacy systems. IBM alone has enumerated some typical legacy data problems which include among others:
- Incorrect data values
- Inconsistent/incorrect data formatting
- Missing data
- Missing columns
- Additional columns
- Multiple sources for the same data
- A single column being used for several purposes
- The purpose of a column is determined by the value of one or more other columns
- Important entities, attributes, and relationships hidden and floating in text fields
- Data values that stray from their field descriptions and business rules
- Various key strategies for the same type of entity
- Unrealized relationships between data records
- One attribute is stored in several fields
- Inconsistent use of special characters
- Different data types for similar columns
- Different levels of detail
- Different modes of operation
- Varying timeliness of data
- Varying default values and other Various representations.
Legacy data and the problem regarding data disparity they bring to a data warehouse can be solved by the process of ETL (extract, transform, load). This is a mechanism of converting disparate data not just from legacy systems but all other disparate data sources as well before they are loaded into the data warehouse.