SQL Business Intelligence Professional

ETL

Extract Manifests

by Barry on Jul.06, 2009, under ETL

 Almost universally, a BI consultant will eventually have a need to incorporate data that is for one reason or another, best transmitted to the BI ETL host in a text format.  Over the years I have found a few tips and tricks that have enabled me and my team to keep our ETL processes flowing.  Among the most used is the idea of an Extract Manifest included as part of every extract file specification.  This is used to guarantee the ETL process is reading the correct data, at the correct time, and process the Whole Set at the bizarre hours these processes normally are executed.  That is to say, the hours between 24:00 and 6:00am local to the process support team.

The Right Data, Right Time, Whole Set

First in the extract specification is a record that indicates what we are processing.  This is a pipe-delimited set of three elements that can drive the ETL process.  An example below:

CUSTOMER|20090701|12345

This tells the ETL process the file currently being read (regardless of filename) is a CUSTOMER extract, for the time period of July 1, 2009 and there are 12,345 rows in the set.  This is information that is of tremendous use to a dimensional model ETL process.  From this, we know we are directing the information to the correct dimensional refresh process – CUSTOMER.  Additionally, we know we are refreshing data for July 1, 2009.  Since a dimensional model is a stack, the need to ensure July 1 is processed ONLY after June 30, 2009 and BEFORE July 3, 2009 is paramount.  Else Slowly Changing Dimension (SCD) can get as a colleague put it, sideways.  Finally, we have a check number for auditing to know that we have:

  1. Processed the complete file, the process did not process a fragment, updating the dimensional model partially
  2. This also allows auditing of the entire dimension update process, we received 12,345 rows of which 12,000 were updates, 300 were new and 45 were discarded bringing no new information to the model as they were duplicates of existing attributes for their business (aka Natural) keys.

The Right Format

I have already mentioned that with the exception of development and continuous integration testing, ETL processing generally occurs in the dreadful, single-digit pre-dawn hours.  Therefore it is incumbent upon the BI Architect to make the life of the support team as easy as possible.  Therefore, I have advocated for some time the practice of including the column names of the extract file as the second row of the file.  So to further the example above:

CUSTOMER|20090701|12345

CUSTOMER_ID|LAST_NAME|FIRST_NAME|DOB|CUST_SCORE

With these rows in place, the support team need not even open the ETL code to discover an issue with the extract.  For example, if PUBLIC had been placed in the Date of Birth (DOB) column, it would be a reasonable conclusion as to the source of type conversion errors.  Additionally the support team could, at a glance, ensure the columns are in the correct order, type, etc without ever having to open up the ETL code.  Finally, they could easily open the file in their favorite editor, use a mono-space font and nail the data issue (or for small files use Excel).

Cherry On Top

There are enhancements to the basics above that I have employed from time to time, depending on the exact circumstances around the project.  For example, the addition of a hash total to the first line of the manifest.  This is valuable when knowing 12,345 rows were processed is not nearly as important as knowing the total rows PLUS a value that all but guarantees the data was not tampered with between the Extract and the Load (or Transform and Load).   In the event of extracts that need a documented start and end date, the execution date could be augmented with another to indicate the range for which data was extracted.

Leave a Comment more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Blogroll

A few highly recommended websites...