The Data Sync tool provides the ability to extract from both on-premise, and cloud data sources, and to load that data into BI Cloud Service (BICS), Oracle Analytics Cloud Service (OAC), and other relational databases.
I have previously written about the tool's ability to extract data from OTBI, both cloud and on-premise, and to load into BICS or OAC (that article is here).
New functionality has been added which allows automatic date based partitioning of the data being extracted from the OTBI source. This can help improve performance and reduce the load on the server and database underlying the OTBI environment.
This article will walk through the simple process of setting this up, giving 2 examples - first for an extract based on an OTBI subject area, and the secondly for one an extract based on an OTBI SQL query.
This is a continuation of this previous article, so read that for additional information.
Please also note that both the BICS and OAC versions of Data Sync include this functionality. The screen shots for this article were taken from the OAC version. The look may be slightly different in the BICS version, but both are similar enough that this one article should be easily replicated for both versions.
In this context, Data Based Partitioning, is the method by which a large data set is reduced into smaller 'chunks' by filtering on date ranges.
As an example, let us assume a table in OTBI has 10 million rows of data from customer orders placed in the year 2016. If you were to write a traditional query to extract all the data from that table, with no filters, the query would try to extract and download the 10 million rows of data in one shot. This could cause performance issues within the OTBI server.
To reduce the data set, you could write multiple reports to reduce the extract to smaller data sets, each with a filter for a month or week at a time, but who has time for that ?!
Data Sync can now do this automatically for you. As long as there is a date field in the OTBI table or query, that can be used to break the data into smaller sets - for instance - ORDER_DATE, or perhaps CREATE_DATE - then Data Sync will use that date field to automatically create the sub-queries for you, based on whatever size 'date chunk' you specify, down to as small a period as a single day. It is preferable to choose a date field that is immutable - so one that does not change.
Based on our example of a table with 10 million rows spread out through the year 2016, once set up and run, Data Sync will first query the OTBI server to find the minimum date (1/1/2016 in this case) and maximum (12/31/2016) from the data set.
Then, based on the 'date chunk' size chosen - we'll say 7 days - Data Sync will start by pulling back only data from the 7 day period 12/25/2016 to 12/31/2016. Once that data has been returned, it will run again, pulling back data from 12/18/2016 to 12/24/2016, and so on. Assuming an even distribution of data, the 10 million rows of data will be broken down into chunks of under 20,000 rows, which should lessen the impact on the underlying OTBI database for each query.
The reason the process starts at the most recent date in the dataset, is because that is the data most likely to be changing in an active transactional database, so this method will better capture the state of the data at the time the load commenced.
Be sure to download and install the latest version of the Data Sync Tool from OTN through this link.
Data Sync requires JDK8. You can download that through this link.
(Note - customers still using BI Cloud Service (BICS) and loading into the schema service database, can not use this version of Data Sync. The version for BICS is available to download here, but does not have all the functionality of the OAC / OAAC version.)
For further instructions on configuring Data Sync, see this article. If a previous version of Data Sync is being upgraded, use the documentation on OTN.
1. Within the Project / Source Data, create a new 'Manual Entry' for data, and in the 'Source' select your OTBI connection:
2. In the next screen, at the bottom, select that you want to pull in data based on 'Day Based Partitioning Read from Subject Area Table'
3. Give the Extract a logical name, and either create a new data target table (as in this example), or select an existing target table:
4. Following from the example given in the previous article, enter the Subject Area 'table', the 'Periodicity Column' - meaning the date column that will be used to automatically partition the data being retrieved, and the 'Partition Read' size - meaning the number of days that will make up each chunk.
In this example the Subject Area Table to be used is "Marketing - CRM Leads"."Contact" (meaning the 'Contact' table from the 'Marketing - CRM Leads' Subject Area),
The date column used will be "Marketing - CRM Leads"."Contact"."Creation Date" - so the 'Creation Date' field in the 'Contact' table.
We entered 30 days for the read partition, so each extract will break the data returned down into 30 day chunks, using the 'Creation Date' field.
The other two fields, 'READ_TYPE', and 'VERSION' are auto populated, so no further editing is required.
5. Hit 'OK' and as long as the syntax is correct, you will receive a notification similar to the following. If you receive an error, check the syntax and try again.
6. If necessary, go to the target table that was just created, and edit the data type and length, then run the job. We will look at the results after the next section.
1. As in the previous example, create a new Manual Entry data source, and select the OTBI connection:
2. This time, select the 'Day Based Partitioning Read from SQL' option:
3. Once again, give an appropriate logical name for the extract, and either create a new target table, or select an existing:
4. In the editing window, first enter the Initial SQL used for the OTBI query. This previous article explained how to easily get this SQL from the OTBI server.
In this example the SQL used is simply selecting 6 fields from the Contact table.
"Marketing - CRM Leads"."Contact"."City" s_1,
"Marketing - CRM Leads"."Contact"."Contact Row ID" s_2,
"Marketing - CRM Leads"."Contact"."Country" s_3,
"Marketing - CRM Leads"."Contact"."Creation Date" s_4,
"Marketing - CRM Leads"."Contact"."First Name" s_5,
"Marketing - CRM Leads"."Contact"."Last Name" s_6
FROM "Marketing - CRM Leads"
5. Enter the SQL query that can be used to calculate the minimum date to be used in the partition:
In this example, the SQL used is:
select min("Marketing - CRM Leads"."Contact"."Creation Date") from "Marketing - CRM Leads”
6. Enter the SQL query that can be used to calculate the maximum date to be used in the partition:
In this example, the SQL used is:
select max("Marketing - CRM Leads"."Contact"."Creation Date") from "Marketing - CRM Leads"
7. Finally, enter the 'Partition Read' value - the number of days to be used in each date partition 'chunk'. In this case 10 days was used.
Note - you also have the ability to use Incremental SQL statements - see the previous article for more information on using that functionality.
8. Once complete, hit 'OK'. As before, if you receive errors, double-check the syntax. Once accepted, change the data types and field lengths of the target table as needed, then run the job.
If you look at the output from the Jobs panel, you will see the normal high level summary of the extract job. In this case, both the Subject Area example, and the SQL example, succeeded. There is nothing to tell you that date based partitioning was used.
To get a better understanding, you need to first set the log level ('View' / 'System Properties' / 'Server Log Level') temporarily to 'FINEST' level, then re-run the job. Once complete, examine the log files for the job.
This log below from the SQL query, where we used a 10 day 'chunk' size. You will notice that the first query ran from 2017-10-21 to 2017-10-31, the second from 2017-10-11 to 2017-10-21, and so on (because this date field included a time-stamp, that was used as part of the 10 day chunk as well for more granularity)
This article walked through the steps to set up Date Based Partitioning within Data Sync for an OTBI source. This method can be used to reduce the data extracted by each query call to the OTBI server, and therefore reduce the load on that server and database.
For other A-Team articles about BICS, OAC and Data Sync, click here.