A slowly changing dimension is a common occurrence in data warehousing. In general, this applies to any case where an attribute for a dimension record varies over time. There are three typical solutions for this. In the first, or type 1, the new record replaces the old record and history is lost. The second, or type 2, a new record is added into the customer dimension table and the customer is treated essentially as two people. The third, type 3, the original record is updated to reflect the change. It is considered and implemented as one of the most critical ETL task in tracking the history of data belonging in the dimension. The advantage of a type 2 solution is the ability to accurately retain all historical information in the data warehouse. This blog will focus on how to create a basic type 2 slowly changing dimension with an effective date range in Informatica. First we will take a look at the table structures.
Target Dimension Table
CREATE TABLE CUSTOMERS
(
CUSTOMER_KEY NUMBER,
CUSTOMER_ID NUMBER,
NAME VARCHAR2(40),
STATE VARCHAR2(2),
BEGIN_DATE DATE,
END_DATE DATE)
Here BEGIN_DATE and END_DATE are used to identify history data, while CUSTOMER_ID is used to identify the dimension record and CUSTOMER_KEY is the primary key used to track new dimensions in the target table. A record is considered new if the CUSTOMER_ID does not exist in the target and the END_DATE is null. A record is considered changed if the CUSTOMER_ID exist in the target table and the STATE from the source is different from the STATE matched to the CUSTOMER_ID in the target.
The source for this demonstration will be similar to the target structure, only without the dates or CUSTOMER_KEY.
Source Table
CREATE TABLE CUSTOMERS_SRC
(
CUSTOMER_ID NUMBER,
NAME VARCHAR2(40),
STATE VARCHAR2(2))
To determine the status of the source records as new or changed first create a lookup transformation, lkp_CUSTOMERS on the target table CUSTOMERS_tgt. Add two input ports in_CUSTOMER_ID and in_NULL_DATE. Under the conditions tab check that CUSTOMER_ID = in_CUSTOMER_ID and END_DATE = in_NULL_DATE. This transformation will output the CUSTOMER_KEY, NAME and STATE.
Next to the lookup transformation, create an expression transformation, exp_VALIDATE_CHANGE, to determine if the incoming source record is new or changed. Connect the lookup ports CUSTOMER_KEY, NAME, STATE, and connect the ports NAME and STATE from the source qualifier transformation to the exp_VALIDATE_CHANGE. Create two new output ports in the expression transformation named new_FLAG and chg_FLAG. The logic for new_FLAG should check to see if the CUSTOMER_KEY returned from the lookup is null. If it is null then you know the current record, identified by the key, has no CUSTOMER_ID on the target table. Thus the record is new and should be inserted. The logic for chg_FLAG should check to see if the CUSTOMER_KEY returned form the lookup is not null. If so, then it must check the incoming NAME and STATE columns. If the STATE is different, then the record has changed. The incoming record must then be inserted as it is the most current, and the previous record should be updated with an END_DATE. Assign a value of ‘Y’ or ‘N’ to the flags. ‘Y’ if the record has changed, ‘N’ if the record has not changed. In addition to the flags, add another output column called out_SYSDATE, to hold the current date. So far, the mapping should look like this.
Next create a router transformation, rtr_UPD_INS_RECORDS. Inside this transformation create three new groups, UPD_CHG, INS_CHG and INS_NEW. These groups will filter the records based on the incoming flags chg_FLAG and new_FLAG. Connect the ports CUSTOMER_KEY, NAME, STATE, CUSTOMER_ID, new_FLAG, chg_FLAG and out_SYSDATE to the router.
For the first group UPD_CHG, filter the records by checking to see if the chg_FLAG = ‘Y’. If it does, allow the records to pass, otherwise filter the records from moving forward. Now only records where the STATE port has changed for the CUSTOMER_ID will be passed. Next to this group, create an update strategy upd_UPD_CHG. Pass the CUSTOMER_KEY, and out_SYSDATE ports from the UPD_CHG group. In the update strategy, rename the out_SYSDATE port to END_DATE. The update strategy expression for this should be DD_UPDATE. Connect the appropriate ports to the target instance. The mapping should look like the below picture.
For the next group INS_CHG, we will need the CUSTOMER_ID, NAME and STATE from exp_VALIDATE_CHANGE and the chg_FLAG. Filter the records so that if chg_FLAG is ‘Y’ the records will be passed. Otherwise if the chg_FLAG = ‘N’, do not pass those records forward. This transformation will be used to insert a new row into the target table with the changed information while keeping the history. Create a new update transformation upd_INS_CHG, and connect the NAME, STATE, and CUSTOMER_ID from the group INS_CHG. Connect the out_SYSDATE port from the same group and rename it to BEGIN_DATE, this will be the new start date for the record. Next create a sequence generator transformation. This will generate the primary key for the dimension, CUSTOMER_KEY. Connect the NEXTVAL port to the upd_INS_CHG transformation and rename it to CUSTOMER_KEY. The update strategy expression should be DD_INSERT. Connect all appropriate ports to the target instance. Now the mapping should look like the below picture.
The last group INS_NEW will be used to insert new records into the target table. Filter the incoming records so that when new_FLAG = ‘Y’ the records will be allowed forward. Otherwise when new_FLAG = ‘N’, the records should be rejected. This will allow records forward where the CUSTOMER_ID is not already present on the target table. Next to the group, create an update transformation upd_INS_NEW. Connect the NAME, STATE, CUSTOMER_ID ports from the group INS_NEW. Connect the out_SYSDATE port from the same group and rename it to BEGIN_DATE, this will be the new start date for the record. The update strategy expression should be DD_INSERT. Next connect the NEXTVAL port from the SEQTRANS, and rename it to CUSTOMER_KEY. Connect all appropriate ports to the target instance.
For an example, consider the following scenario. We have the following record on the target table.
CUSTOMER_KEY | NAME | STATE | CUSTOMER_ID | BEGIN_DATE | END_DATE |
5 | Doe, Jon | MI | 5 | 01-01-2014 | (null) |
Jon Doe moved to Texas, and needs a new entry into the dimension table to track the history and have an updated record. The source record is below.
CUSTOMER_ID | NAME | STATE |
5 | Doe, Jon | TX |
When this record is processed through the mapping, the lookup will find the existing CUSTOMER_ID. It will return a valid CUSTOMER_KEY = 5. The change flag will be set to ‘Y’ since the state is different in the new record. This will pass the filters from the router groups INS_NEW and UPD_CHG conditions, thus it will insert a new record with a new CUSTOMER_KEY, and also update the previously existing record in the target table with an END_DATE. The result is below.
CUSTOMER_KEY | NAME | STATE | CUSTOMER_ID | BEGIN_DATE | END_DATE |
5 | Doe, Jon | MI | 5 | 01-01-2014 | 12-15-2014 |
20 | Doe, Jon | TX | 5 | 12-15-2014 | (null) |
In conclusion, a type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. Although this will cause the size of the table to grow very fast, it is used about 50% of the time when dealing with cases where the attribute for a record varies over time.