Performing ELT Lookups with Data Sync

For other A-Team articles about BICS and Data Sync, click here

Introduction

The Data Sync tool provides the ability to extract from both on-premise, and cloud data sources, and to load that data into BI Cloud Service (BICS), Oracle Analytics Cloud Service (OACS), and other relational databases.

In the 2.3 release of Data Sync, new functionality was added that provided the ability to transform the data as part of the load process.  This new functionality is covered in more details in this article

The process uses the target Oracle database as the transformation engine

As part of this new functionality, Data Sync also added ‘Lookup’ functionality, which provides the ability to lookup data from the target database, based on data in the source, and use that as part of the load.  This is very useful for loading data into a warehouse, or denormalized data source.

Take an example where there is a PRODUCT table in the target warehouse that has a PRODUCT_ID as the primary key.

While loading in new Order data into an ORDERS table, we may want to denormalize the data and include a ‘PRODUCT_ID’ in the ORDERS data row.  If the PRODUCT_ID field is not available in the source data, then Data Sync can ‘lookup’ the PRODUCT_ID from the PRODUCT table as part of the load, and then load that into the ORDERS table.

Should it be required, it can also create the sequence in the database, and use that to populate the PRODUCT_ID field in the PRODUCT table.

There are some caveats:

– The Target database must be a DBaaS or other Oracle Database target as it will not work with a BICS target loading through the API, and it will not support non-Oracle database targets

– DVCS datasets are also not supported

Limitations aside, if the target is DBaaS or an Oracle database, then this new functionality adds a lot of options for transforming data based on existing data from the Target database.

This article will walk you through the steps to do this.

 

Main Article

Download The Latest Version of Data Sync Tool

Be sure to download and install the latest version of the Data Sync Tool from OTN through this link.

For further instructions on configuring Data Sync, see this article.  If a previous version of Data Sync is being upgraded, use the documentation on OTN.

TIP :  

Data Sync added 3 new variables that can be included as part of a load to add metadata or new functionality to loads.  These 3 variables are:

%%DML_CODE – which provides the DML used for each row, being it INSERT or UPDATE.  It gets a value of ‘I’ or ‘U’ depending on if the record got inserted/updated.  The target column to take this value should be CHAR(1) data type.  This value can be loaded into the target table in the database and can aid trouble-shooting.

%%UPSERT_TIME – which captures the precise time the target row was updated or inserted.  The target column should be of ‘DATE’ format.  Again, this can be loaded into the target table.

%%SURROGATE_KEY – which can be used as a lookup to get a surrogate key, particularly useful in joins.  It should be a NUMBER field.

They are available in the expression editor in the dropdown under ‘Default’:

Windows7_x64

The last of these, %%SURROGATE_KEY, will be used in this article to create a unique key for a target table

 

Process Overview

This article will expand on the example from the Introduction, where data in the Source is used to lookup to a table in the Target to get a surrogate key, which is then populated as part of the load.

In the Target warehouse there is an ORDERS table.  Note the ‘PRODUCT_ID‘ field.

Windows7_x64

And a PRODUCT table

Windows7_x64

PRODUCT_ID is the Primary Key for the PRODUCT table. It is also a Foreign Key in the ORDERS table.

The SOURCE data that is being loaded by Data Sync does not have the PRODUCT_ID, instead it has the PRODUCT_NAME (PROD_NAME) as shown below.

Notice that there are duplicate values in the PROD_NAME field.  This makes sense in the context of the data, but it is something that we need to let the tool know about so it can handle the loading correctly.

SAMP_REVENUE_DENORM

For simplicities sake, we will assume the PRODUCT_NAME is good unique joining key in the PRODUCT table, and that it matches 1:1 with a PRODUCT_ID.

When we load the ORDERS table in the TARGET we want to do a lookup to the PRODUCT table on the PROD_NAME field to return the PRODUCT_ID.  We will then populate PRODUCT_ID in the ORDERS table as a foreign key.

Data Sync will take care of this process for you by firstly loading the PRODUCT table to make sure all unique Products in the load exist in the PRODUCT table.

While not required, Data Sync can also create a Sequence on the target database which will create a unique PRODUCT_ID in the PRODUCTS table.   This will be covered in this example as well.  If the table being ‘looked up’ already exists with a unique ID, then this step can be skipped.

 

Process Steps

The following steps are provided in details below:

1. Set up the PRODUCT Source

2. Setup the ORDERS source

3. Allow for Duplicate Product Records

4. Add PRODUCT_ID to the ORDERS target table

5. Add PRODUCT_ID to the PRODUCT target table

6. Create Surrogate Key for PRODUCT_ID field

7. Add the Product ID to the ORDERS source table

8. Create Lookup / Join

9. Populate the PRODUCT_ID field in the ORDERS table

10. Run the Job and Examine the Results

 

In this example we are going to use a SAMPLE_FILE.csv file.  The file used is available for download here for testing (Sample_File.csv).  This has both the ORDERS and PRODUCTS data in it.

 

1. Set up the PRODUCT Source

a. Under Project / File Data, select ‘New’ and browse to the file and then click ‘next’

1

b. Take the default values in the ‘Import Options’ and then ‘next’.

c. Select ‘Create new:’ and name the table PRODUCT.  Also check the ‘Remove duplicates’.

Cursor_and_Windows7_x64

d. In the following screen, select the Product related columns from the input file as shown below, then ‘OK’ to finish.

Cursor_and_Windows7_x64

 

2. Setup the ORDERS source

For this example we will use the same source file, just select different fields.

a. Under ‘Project’ / ‘File Data’ select the ‘SAMPLE_FILE’ used in the previous step, but this time select ‘New Target Table’

Cursor_and_Windows7_x64

b. Name the table ‘ORDERS’

Cursor_and_Windows7_x64

c. In the New Source File definition, select the Order related fields as shown below and the PROD_NAME field.  This is the Product Name and will be used in the join.

Cursor_and_Windows7_x64

3. Confirm we are Removing Duplicate Product Records

Because the same Product name is listed in multiple orders in this source file, we need to confirm that Data Sync knows that it should remove duplicated values.

a. Under ‘File Data’, with ‘SAMPLE_FILE’ selected, go to the ‘File Targets’ menu item, and then select the ‘PRODUCT’ table.

b. Make sure ‘Remove Duplicates’ option is checked.  It should be already (from Step 1c), but this is where to change that setting if you subsequently discover the source has duplicate values.

Windows7_x64

 

4. Add PRODUCT_ID to the ORDERS target table

We need to add the PRODUCT_ID to the ORDERS target table so this can be populated at load time.

a. Go to ‘Target Tables / DataSets’, select the ‘ORDERS’ table, and then in the bottom part of the screen select ‘Table Columns’, then ‘New’ and add a field called PRODUCT_ID.  Make it a NUMBER (38,0)

b. We are going to join the ORDERS and PRODUCT table on the PROD_NAME field while joining, but we will be pulling back the PRODUCT_ID and loading that in the target table.  The PROD_NAME field is only needed in the Staging process, so we can check it as ‘Stage Only’.  That way it is not populated in the final target table.

Windows7_x64

 

5. Add PRODUCT_ID to the PRODUCT target table

In this scenario, PRODUCT_ID does not exist in the source and we are going to assume we do not have a trigger / sequence on the target table in the Target that maintains this ID.  We will therefore have Data Sync create the sequence for us.

a. Go the the PRODUCT target table and add a new column ‘PRODUCT_ID’ as a Number (38,0)

Windows7_x64

6. Create Surrogate Key for PRODUCT_ID field

This step is only required if we are having Data Sync create and maintain the surrogate key.  If the field already exists in the target, then this step can be skipped.

a. Under ‘File Data’, ‘File Targets’ select the ‘PRODUCT’ table and then the ‘Column Mapping’ option.

Windows7_x64

b. Select ‘Unmapped Columns’ and bring the new ‘PRODUCT_ID’ field over and then select ‘OK’

Windows7_x64

c. Select the PRODUCT_ID and open the Expression Editor, and then select the ‘%%SURROGATE_KEY’ option in the drop down as shown below:

Windows7_x64

d. Click ‘OK’ and then ‘Save’

Cursor_and_Windows7_x64

7. Add the PRODUCT_ID field to the ORDERS source table

a. Select the ‘ORDERS’ table and then ‘Column Mappings’ to bring up the Column Mapping Editor:

Windows7_x64

b. Select ‘Unmapped Columns’, and bring ‘PRODUCT_ID’ over and select ‘OK’

Cursor_and_Windows7_x64

8. Create Lookup / Join

We want to add the PRODUCT_ID into the ORDERS table, so we need to create the join from the ORDERS table

a. Under ‘Project’ / ‘File Data’  / ‘File Target’, select the ‘ORDERS’ table and ‘Column Mapping’ to open the editor

b. In the Column Mapping editor, select ‘Joins’

Cursor_and_Windows7_x64

c. In the Joins editor, give the Join a meaningful name

d. In the Table Names column, enter the table name that will be joined to – in this case PRODUCT

e. To guard again null product names in the source, we are going to create a left outer join from the ORDERS table to the PRODUCT table.  That way, if the PROD_NAME field is null, we would still retain the ORDER data.

Using the editor, type

left outer join “PRODUCT” on “PRODUCT”.”PROD_NAME” = “%%ORDERS”.”PROD_NAME”

Cursor_and_Windows7_x64

f. Select ‘OK’ to return to the Column Mapping editor.

TIP :  

While in this case we are using a simple join using a single field (‘PROD_NAME’), a more complex join using multiple values in the where clause could be used.

 

9. Populate the PRODUCT_ID field in the ORDERS table

a. In the mapping editor for the ORDERS table, select ‘PRODUCT_ID’ from the PRODUCT table.

Data Sync already knows how to join these 2 tables (on PROD_NAME), so will populate the PRODUCT_ID from the PRODUCT table as the ORDERS table is loaded.

b. Click ‘OK’ and save the changes.

Cursor_and_Windows7_x64

10. Validate the Transformations

Data Sync can validate the expressions to confirm that the syntax is correct.

a. Highlight the table in ‘File Targets’ and right click to bring up the available options.

b. Select ‘Validate Transformations’

Cursor_and_Windows7_x64

If the expressions are valid, a message similar to the one below will be shown.

If the expressions are not valid, then details will be provided in the log file referenced.  In that case, examine the log, then correct the expressions and re-validate.

TIP :  

This log file also contains the SQL that Data Sync will uses to perform the ELT, so can be a useful reference

98

 

11. Run the Job and Examine the Results

a. If the job fails or does not work as expected, confirm the previous steps were followed.

b. Once it completes successfully, take a look at the job details.

In this example, notice how the PRODUCT table mapping was run first, finishing at 3:37:14PM.  This was to  make sure that any new PRODUCT records in the source data were loaded into the PRODUCT table before the ORDERS mapping was run, and so would be available in the Lookup.

Then the ORDERS table mapping was run.

Cursor_and_Windows7_x64

If you look at the PRODUCT table, you’ll see the new PRODUCT_ID field, populated with a unique key.

Cursor_and_Windows7_x64

And then the ORDERS table has the appropriate PRODUCT_ID populated as a foreign key reference.

Cursor_and_Windows7_x64

Look at the database, and you will see a new Sequence called SQ_PRODUCT against the PRODUCT table.

In the event you drop and recreate the PRODUCT table in Data Sync to do a full new load, the tool will also drop and recreate the Sequence so it will start from 1 again.

Cursor_and_Windows7_x64

You have used Data Sync to perform a lookup to data in the target database, and to use that as part of the load.

 

Summary

This article walked through an example of creating a Lookup in Data Sync to lookup data in the target database, and to use the data as part of the loading process.

For other A-Team articles about BICS and Data Sync, click here.

Add Your Comment