Unlocking insights of data integration: What is data integration?

Are you struggling to manage all your data due to multiple data sources? Is your business growing but you are growing more dependent on manual systems and processes?

The world is becoming more data driven

The volume and complexity of data available to support decision making continues to increase.  Many organisations encounter a similar set of issues with the quality and efficiency of Data Integration processes.  As a leading Data integration company, we find that these problems often stem from the same root causes:

  • Multiple disparate data sources
  • Legacy processes to access this data
  • Duplicated manipulation, aggregation and calculation logic in extracting and loading data
  • Inefficient data storage techniques
  • Lack of process monitoring and controls to manage changes to data

These issues can be mitigated against by following a set of best practice techniques and principles.

This post is the first in a series that will focus on the key aspects of Data Integration Services. We’ll be starting with a bit of jargon busting and data extraction principles in this edition.

[km-cta-block block-classes=”has-dark-teal-background-colour has-white-colour” label=”Contact us to discuss your requirements”]

Want to know more?

Our data experts would love to hear from you

[km_button link=”https://www.dufrain.co.uk/contact/” classes=”cta-2″]Contact us[/km_button] or [km_button link=”tel:08001303656″ classes=”cta-2″]Call us on 0800 130 3656[/km_button][/km-cta-block]

What do we Mean by Data Integration?

Simply put, Data Integration is the combining of data from multiple data sources into a single, unified data store. In the traditional sense, this is most likely to be an Enterprise Data Warehouse, or EDW if you’re into your TLAs! There are a variety of vendors and products that facilitate data integration, including those supporting legacy system migrations, but the principles and techniques generally remain consistent.

ETL is another acronym you’ve likely heard in passing but may not totally understand. It describes the logical steps required to process data for traditional data integration:

  • Extract  – Extract data from source as efficiently as possible
  • Transform  – Perform manipulation, aggregation and calculations on the data
  • Load  – Load the data into the target storage platform and data structure

Look at Data Integration


Historically, Data Integration and ETL have gone hand in hand as the preferred method of loading an EDW however the processes will only be effective if they are monitored on a regular basis and there is transparent and accurate metadata (data about data).

We’ll cover all these areas in the coming posts. Let’s start at the beginning – extracting data from source systems.


1 – Extract

The Extract stage of Data Integration focuses on the data extraction from source system and making it accessible for further processing. Efficiency and operational continuity when extracting data from source systems is key.
The main objective of the extract step is to retrieve all the required data from the source system with as little resource effort as possible that emphasize the importance of a proper data strategy framework.  A proper set of data requirements and an understanding of source data is essential for effective data extraction.

The extract step should be designed such that it does not negatively affect the source system in terms of performance, response time or any kind of locking.  It is usually possible to perform extract routines in parallel, reducing processing time.

[km-cta-block padding=20 block-classes=”has-dark-teal-background-colour has-white-colour” label=”Contact us to discuss your requirements” ]

Find out how we can help

Our data experts would love to hear from you

[km_button link=”https://www.dufrain.co.uk/contact/” classes=”cta-2″]Contact us[/km_button] or [km_button link=”tel:08001303656″ classes=”cta-2″]Call us on 0800 130 3656[/km_button][/km-cta-block]

Extract Methods

There are several ways to perform data extraction with the three most popular methods used detailed below

  • Update notification
    If the source system is able to provide a notification that a record has been changed and describe the change then it is possible to extract data records accordingly and track deletions/changes easily
  • Incremental extract (delta processing)
    Some systems may not be able to provide notifications that an update has occurred, but they are able to identify which records have been modified and provide an extract of these records
  • Full extract
    Some systems are not able to identify where data has been changed or modified. In this case, a full extract is the only possible way of ensuring all changes from source system are received

When using Incremental or Full extracts, the extract frequency is extremely important in order to manage resource overhead and storage capacity.  It is particularly inefficient and unsustainable to adopt a principle of extracting all data items from source along with full historic data extracts.

The most appropriate extract method should be chosen to maximise speed, minimise repetition of work and storage usage. Key considerations when choosing an extract method are:

  • Extensibility
    • Rather than bring everything from a given source, it should be possible to only include the variables which are identified as valuable to the business
    • Over time additional data elements can be added to the extract routines, with potentially new target tables
  • Deltas vs Full incremental refreshes
    • High volume, frequent updates – deltas should be considered
    • Low volume, infrequent updates – full incremental loads are possible

Extract Checks

A series of checks should be applied to extracted data in order to initially verify that the data is as expected and can be processed by the downstream solution.  Applying a combination of the following checks can reduce downstream processing failures and ultimately provide a more effective service:

  • File nomenclature
    • Conforms to the requirements
    • Date stamp of file as expected (if applicable)
    • File extension conforms to the requirements
  • File format
    • Conforms to the requirements
    • Possible header record
    • Delimited or fixed width as appropriate (comma, pipe, semicolon etc)
    • If files contain text it may be worth considering text enclosures
  • Row Counts
    • Within an expected range / deviation
    • Footer record with number of main records
  • Column Order
    • As per requirements

In considering these extract techniques when developing or reviewing Data Integration processes it is possible to hone the extract phase and gain maximum efficiency during processing.  Reducing overhead allows for more efficient extraction of data from disparate sources and sets the groundwork for the following transformation phase.

With the Extract phase now covered in detail, in the next post we will explore the key principles and techniques used within the Transformation phase of a Data Integration process.

2 – Transform

The data transformation stage of a data integration process is generally the most complex and, in its simplest form, applies a series of rules or functions to data in preparation for loading into the end target.

Further to this, data cleansing and data quality controls should be carried out in the transform stage to ensure accurate data which is:

  • Correct
  • Complete
  • Consistent
  • Unambiguous
  • Ready for loading into the end state data warehouse from a staging layer

[km-cta-block block-classes=”has-dark-teal-background-colour has-white-colour” label=”Contact us to discuss your requirements”]

Want to know more?

Our data experts would love to hear from you

[km_button link=”https://www.dufrain.co.uk/contact/” classes=”cta-2″]Contact us[/km_button] or [km_button link=”tel:08001303656″ classes=”cta-2″]Call us on 0800 130 3656[/km_button][/km-cta-block]

Staging Layer

A staging layer acts as an intermediate storage area used for data processing during data integration and it assists in getting your source data into a structured equivalent, prior to loading data into the end target.

A properly implemented staging layer should not be accessible by end users and tables will only persist for a defined (usually short) period of time for the following purposes:

  • Debugging
    You need not go back to the source data to troubleshoot, instead the staging layer can be referred to
  • Failure recovery
    If you encounter upstream failure, you may not be in a position to re-extract your data as the source has changed

Functions, Aggregations, and Calculations

When creating new data items, applying functions, or aggregating data; legibility, ease of understanding and adding helpful comments should be the highest priority more so than execution efficiency and coding-time. Consider the following to items for a good balance between legibility and efficiency:

  • Comments
  • Segmentation
  • Be Selective
  • Don’t Overwrite
  • Use Metadata

Data Cleansing

It may be necessary to carry out the manual correction of data to ensure Data Quality rather than simply reporting exceptions. This can take the form of adjusting values so that they fall within certain bounds, changing the data typing of data items, or adjusting the formatting of data items. Depending on the extent of the corrections required and long-term objectives, it will be prudent to consider strategic data quality initiatives to reduce/correct errors at source.

As we’ve already discussed, data shouldn’t be overwritten before loading into the staging layer, instead store the original and transformed value to allow for effective debugging.


3 – Load

The Load stage takes the transformed data and moves it into an end target where it can be accessed by users within the business. Some data warehousing processes may overwrite existing information with cumulative information, others maintain a history and track changes to the data loaded into the data warehouse. It is recommended to load end target tables outside of business hours, in an effort to avoid downstream performance issues and avoid any potential table locking.

Slowly Changing Dimension methodologies should be considered when loading data. Dimensions in data management solutions and data warehousing are variables which contain relatively static data, such as geographical locations, customers, or products. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule. Tracking the history of dimension records should be considered as one of the most critical tasks in an ETL process.

[km-cta-block block-classes=”has-dark-teal-background-colour has-white-colour” label=”Contact us to discuss your requirements”]

Want to know more?

Our data experts would love to hear from you

[km_button link=”https://www.dufrain.co.uk/contact/” classes=”cta-2″]Contact us[/km_button] or [km_button link=”tel:08001303656″ classes=”cta-2″]Call us on 0800 130 3656[/km_button][/km-cta-block]

There are several types of SCDs to consider, the most popular methods are:

Type 1 SCDs – Overwriting

In a Type 1 SCD the new data overwrites the existing data. Therefore the existing data is lost as it is not stored anywhere else. This allows for a full data refresh to take place and might be appropriate for loading a lookup table.

Example:

Company_ID Company_Name Company Location
001 ABC Limited London

This is the state of the Company table before loading. If ABC Limited relocates to Edinburgh, the record would be overwritten:

Company_ID Company_Name Company Location
001 ABC Limited Edinburgh

Type 2 SCDs – Creating another dimension record

A Type 2 SCD retains the full history of values. When the value of a chosen variable changes, the current record is closed. A new record is created with the changed data values, and this new record becomes the current record and is marked as active. Additionally, each record contains the effective-from date and effective-to date to identify the time period between for which the record was active.

If we consider the same example as above, but using Type 2 SCD logic, before we would have:

Company_ID Company_Name Company Location Effective_From Effective_To Active_Flag
001 ABC Limited London 2012-04-10 NULL 1

And then following the relocation of the business to Edinburgh on the 14th of October 2016:

Company_ID Company_Name Company Location Effective_From Effective_To Active_Flag
001 ABC Limited London 2012-04-10 2016-10-13 0
001 ABC Limited Edinburgh 2016-10-14 NULL 1

Type 3 SCDs – Creating a current value field

A Type 3 SCD tracks changes using separate columns and preserves limited history. Type 3 SCD stores two versions of values for certain selected variables. Each record stores the previous value and the current value of the selected variable. When the value of any of the selected variable changes, the current value is stored as the old value and the new value becomes the current value.

Using the same example and Type 3 SCD. Before:

Company_ID Company_Name Current_Company Location Effective_From Previous_Company_Location
001 ABC Limited London NULL NULL

And after the move:

Company_ID Company_Name Current_Company Location Effective_From Previous_Company_Location
001 ABC Limited Edinburgh 2016-10-14 London

One variation of this is to create an Original_Company_Location instead of Previous_Company_Location, with would track only the most recent value for the variable and the original.

The above three SCD approaches to handling changes in dimensions over time should be considered when loading a data warehouse. Type 2, in particular, allows for the full data history to be preserved explicitly and presents the data in format which is easy for the business to use.


4 – Monitoring

Monitoring the process enables verification of the data which is moved through the ETL job. Monitoring is a non-functional requirement required by the business for reconciliation purposes, debugging & incident resolution, and to calculate performance statistics. It is important to have a framework which can capture all the Operational Metadata you need without adding too much time to your development.

[km-cta-block block-classes=”has-dark-teal-background-colour has-white-colour” label=”Contact us to discuss your requirements”]

Want to know more?

Our data experts would love to hear from you

[km_button link=”https://www.dufrain.co.uk/contact/” classes=”cta-2″]Contact us[/km_button] or [km_button link=”tel:08001303656″ classes=”cta-2″]Call us on 0800 130 3656[/km_button][/km-cta-block]

Operational Metadata

These are a set of data values, which complement the execution of an ETL process. Examples of Operational Metadata Information which should be considered for capture:

  • The project the job was in
  • Whether the job run was successful, errored, or had warnings
  • If there were any errors or warnings, how many there were, and what they were
  • Which database tables or files were read from, written to, or referenced
  • When the job started and finished
  • The computer/server that the job ran on
  • Any runtime parameters that were used by the job
  • Which files/database tables were read from and written to
  • The number of records read and written
  • The unique number of records at each stage by key

It is especially important to capture Operational Metadata when you have jobs which are automated or operate outside of regular business hours. Also, it is worth considering automating the delivery of an operational metadata summary. This could simply take the form of an email being sent upon completion (or failure) of a batch process to a central inbox.


In summary, including operational metadata gives detailed row level information of the ETL process. Including these attributes into the data warehouse means that ETL processes can reconcile data quality issues, offer increased data integrity, and highlight areas where performance improvements may be made. Data consumers can be more confident in the data they use, and equally data stewards can be more confident in the data they’re providing.

5 – Metadata Management

In simple terms, metadata is “data about data”, and if managed properly, it is generated whenever data is created, acquired, added to, deleted from, or updated in any data store or data system in the scope of the enterprise data architecture. The 2 main forms of metadata to consider are Technical Metadata and Business Metadata. It is important that Technical and Business metadata items are created and updated promptly to provide full data lineage and data integrity which is key for a complete data audit.


Technical Metadata

Describes information about the technology and ETL process. Created and used by the business areas that create, manage, and use data. Many ETL applications provide a metadata repository with an integrated metadata management system to manage the Technical metadata definitions. Technical metadata is often defined in the following way:

  •  Source Database definition – can be a data warehouse, system, or File definition.
  • Target Database definition – often a data warehouse
  • Source Tables and Columns 
  • Target Tables and Columns

Technical metadata should be kept aligned in Development, Test, and Production environments where possible to reduce the risk of defects occurring when jobs are promoted through environments.

[km-cta-block block-classes=”has-dark-teal-background-colour has-white-colour” label=”Contact us to discuss your requirements”]

Want to know more?

Our data experts would love to hear from you

[km_button link=”https://www.dufrain.co.uk/contact/” classes=”cta-2″]Contact us[/km_button] or [km_button link=”tel:08001303656″ classes=”cta-2″]Call us on 0800 130 3656[/km_button][/km-cta-block]

Business Metadata

Documentation that adds values to data. Data layers which provide definition of functionality, and how the data is used in the business. Business metadata contains high-level definitions of all tables and columns which are present within the data warehouse. Examples of Business Metadata are: Data Dictionaries, Process flows, and Database relationship diagrams.

Business metadata is mainly created and used by the data warehouse users, report authors (for ad-hoc querying), data managers, testers, and analysts.

The following information needs to be provided to fully describe business metadata:

  • Physical characteristics:

For a database – name, read-write permissions, engine

For a flat file – header, footer, delimiter, end of line, text enclosures

  • Table name
  • Column name
  • Business Name – descriptive information on business rules for the field
  • Field Type – information on the format and data type

Accurate and up-to-date metadata increases programmer productivity, creates less data redundancy and gives greater security of information.

If you’d like to know more about how Dufrain can help with any of the items covered in this series of Articles on Best-Practice Data Integration, then please feel free to contact us by completing the form below or via LinkedIn.