* This blog was last tested on Oracle Analytics Cloud Data Sync (2.6.1) by Jay Pearson (Oracle A-Team) *
The Data Sync Tool provides the ability to synchronize data from / to various sources. The tool can source from both relational data and flat files. It can be used for both one-off and incremental data loads.
This article will cover installation and configuration steps, then walk through the creation of data loads including: load strategies, scheduling, and monitoring jobs.
This article can be read in its entirety, or used as a reference guide for a specific function. The following topics will be covered, and the links below can be used to go directly to that topic.
Be sure to download and install the latest version of the Data Sync Tool from Oracle Analytics Downloads.
For further information on the Data Sync Tool, upgrade instructions, OAC Data Sync readme (Data Sync release notes and known issues), and OAC Data Sync Documentation (Data Sync install and user guide) can be found here.
It is a Java application, and requires a Java V1.8.x JDK and jdbc drivers. A Java Runtime Environment (JRE) is not sufficient.
Once downloaded and unzipped, locate the 'config.bat' or 'config.sh' depending on the operating system. Open this in a text editor, and set the path for the Java home for the machine where the application will be run.
set JAVA_HOME="C:\Program Files\Java\jdk1.8.0_271"
Additionally, follow instructions in: BICS|OAC: Unable to Start Data Sync using JDK 1.8.0_181+ (Doc ID 2439197.1) to copy and rename the db-derby-<release>-bin directory to JAVA_HOME.
The Data Sync tool uses JDBC to connect to source databases. If you intend to source from a relational database, you need to make sure the JDBC drivers are the correct versions. The tool ships with Oracle JDBC version 22.214.171.124.0. If the source database is different - then obtain the correct JDBC drivers for that version of the database, and copy into the relevant \lib directory of the Data Sync tool and overwrite the version that is there. This table shows the files required for each database vendor that need to be obtained and copied.
JDBC Driver Name
|Microsoft SQL Server||sqljdbc.jar|
|Timesten||ttjdbc6.jar, orai18n.jar, timestenjmsxla.jar, jms.jar, javax.jms.jar|
|Teradata||terajdbc4.jar, log4j.jar, teradata.jar, tdgssjava.jar, tdgssconfig.jar|
To start the Data Sync tool, run the datasync.bat for windows, or datasync.sh command for linux. These commands can be found in the root of the unzipped directory.
The first time the tool is run a configuration wizard will take you through the setup. In subsequent sessions of the application the tool will recognize that the wizard has been run before and open the existing configuration.
A System Properties panel can be opened from under the 'Views' menu.
Within the properties are a number of options, including the default folder where flat files will be searched for, and the number of log files that will be kept before they are purged.
An explanation for each option is available in the 'Comments' column.
The Data Sync tool can be set up to send e-mail notifications about jobs. This is done under the Tools / E-Mail menu.
1. Select the 'Recipients' sub-menu and add one or more e-mail addresses.
2. Under the Setup menu, add the details of the e-mail server and connection details.
3. Hit the 'Send Test Email' button to confirm the settings. The test e-mail will be sent to the recipients defined in the first step.
The tool has 3 main sections.
Connections - this is where the source and target connections are defined.
Project - a workspace where the source files and targets are defined, as well as their attributes. This is where the load strategy for each table is also defined.
Jobs - a unit of work that can be used to upload the data from one of more sources defined in the project. A project can have one or more jobs. This is also where the status of data loads can be found and examined.
Each section can be reached by selecting the corresponding button in the main menu bar:
There are two default connections. 'TARGET' which for non-file sources, and 'File Source' for file sources.
Do not delete or rename either entry, they are referenced internally by the tool.
The 'TARGET' connection must be edited to point to the OAC Cloud instance. Under Connections, select 'TARGET' and edit the details. The Connection Type should be left as 'Oracle (OAC)'. A username and password for a user with the 'Data Loader' application role should be entered. The URL will be the same as the entry to the OAC/analytics portal, but without the "/analytics" suffix. For instance:
If this URL is not known, it can be located on the Oracle Business Intelligence Cloud Service page within My Services, as shown below. Copy that URL, and remove the "/analytics" suffix.
Select 'Test Connection' to confirm connectivity to the OAC target.
To add a new RDBMS as a source, select 'New', then name it. Select the correct connection type for the database vendor, and enter the connection details. This is an example of an Oracle DB source:
Test the connection to confirm the settings.
A DBaaS Instance can be used as the target for the Data Sync tool. This can be done in 2 ways.
1. The Connection can be changed in the BI Cloud Service Console, by selecting to option to 'Use a different cloud database' for the connection, as shown below. If this is done, then the standard 'TARGET' connection outlined above will automatically load all data into the DBaaS Instance.
2. In some circumstances it may be preferable to keep the OAC connection unchanged, but to have the Data Sync tool upload selected data to the DBaaS Instance. To do that, create a new 'Source / Target' under the Connections section. The connection should be an 'Oracle (Thin)' type, using the Service Name, Host, Port and an appropriate user with rights to create and load data into the DBaaS Instance, as shown below.
Once the source and target connections have been setup, the next step is to define the actual tables, sql, or files to be used as the sources, and the table(s) that will be loaded. This is done under the 'Project' section.
To source from a table in the on-premise RDBMS created in the previous section, select the 'Data From Table' option under the Project / Relational Data section.
Firstly select the Data Source where the table is located (1), then 'Search Tables' to bring up a list of the tables available (2). Next select the Table or Tables to be imported by checking the 'Import' checkbox (3), and finally the 'Import Tables' button (4).
Instead of using a table, it may be preferable to use a SQL statement against the source database. It could be that the data is coming from multiple tables, or that the data in the source tables is at a more detailed level so aggregate functions need to be used to reduce the data set and increase performance.
Select the 'Data from SQL' option. The query requires a logical name, and the results can be persisted either in a new table or in an existing table. If 'Create a table' is select, the column definitions are inferred from the SQL structure. If an existing table is selected, any new columns from the SQL will be added to the list of columns.
A similar process is used for setting up a file source. Select the 'New' button under the Project / File Data selections.
Select the details of the source file location, and then chose the Target Table option. Again, there is an option to load an existing table, or to create a new one. Select the delimiter for the file, the timestamp format, and the number of lines the tool will sample to estimate field sizes and types. If the first row contains headers, select that here. If not - the column names will need to be entered in the following screen. If the file contains duplicates, there is an option to remove these if required.
The next screen provides the options to chose the fields to be imported from the flat file, as well as some basic Data Transformations for changing the case of the field, converting to a number, or trimming spaces. The tool makes its best 'guess' for each data-type based on sampling up the the first 10,000 rows of the file, but the user should go through and make sure each has been classified correctly. Special characters will be removed from the column header, and converted to upper case. If a column name length is more than 30 characters, it will be truncated.
If the 'Remove Duplicates' option was selected in the previous screen, then the 'Update Rows Match' column would be used to select what makes the records unique.
If everything is set correctly and there are no data errors, a 'Success' notification will be displayed. If there is a problem, the error message will provide details of the problem.
When errors are encountered, a 'bad' file is created in the log directory with the convention of <logical name>.bad.
The bad file contains information about which line is the bad record, the record itself, and what were the problems parsing that file.
Once the bad records have been identified, these must be fixed in the original file and the process re-run. The file will only be loaded once there are no errors.
There are multiple load strategies that can be used with the Data Sync Tool.
For Data from Table, or Data from SQL the load strategy can be found and changed within the 'Edit' properties of the Connection under the Project Section of the tool:
For a File source it can be found under the 'File Targets' tab by selecting the 'Load Strategy' option:
The load strategies available are:
Replace Data in Table: this will truncate the target table each run and then reload the table. If the target table has indexes, these will be dropped prior to the load and are then recreated afterwards.
Append Data in Table: data will be appended to the target table. No checks will be done to see if the data already exists in the target.
Update Table: the first time this is run the target table, if it exists, will be truncated and indexes removed and recreated post load. There are 2 options, both of which can be run independently or concurrently.
1. Add new records - the target will be checked to see if the key exists - if it does not, then the row will be added
2. Update Existing Records - the target will be checked to see if the key exists and an update date in the source system is more recent than the date the job was last run. If that is the case, the record in the target will be updated.
For the Update Table strategy the tool requires 2 fields. A unique key, and an update date.
In this example, the COUNTRY_ID is the unique key, and the M_LAST_UPDATE_DTTM is the field that the source system uses to specify when that record was last updated.
There is also an option to 'Never Delete Data' - this prevents the 'Truncate' and removal of indexes from running for the 'Replace Data in Table' option, and also the first run of the 'Update table' load strategies.
Check the 'Target Tables' section. This will have a row for each file or RDBMS table source. Select the table, and then within the options below the table name, table columns and properties, and indexes can be edited if necessary.
If the target tables were created by another process - perhaps from a manual load through the File Data Loader, or through the Apex tool - there may be issues for the Data Sync tool when it tries to truncate and remove and recreate indexes, or with existing constraints on the table that do not match the constraints set up in the data loader tool.
To fix this, have the tool drop and recreate the target table itself first. To do this, under the 'Target Tables' section, right click on the table name, and then select the 'Drop/Create/Alter Tables' option:
Several options will be presented. Select the 'Create New' and also the 'Drop Tables'. This will drop the table created previously be the other process and recreate it with the tool.
If the data set is large, performance can be improved by purging data that has reached a certain age. This age is calculated based on the date used in the filter for the Incremental Update. In this example any rows updated in the source table more than 30 days ago will be deleted.
Once the source tables and files, and the load strategies have been defined, the job can be run by selecting the 'Run Job' button in the toolbar.
Jobs can also be scheduled to run from the 'Job Schedules' sub tab.
Within the properties of the job the number of retries can be set up. By default this is set to null, or 0.
To monitor select the 'Jobs' tab.
While the job is running, it can be viewed on the 'Current Jobs' tab. The 'Auto Refresh' can be set to automatically refresh the status.
Once a job has finished - either successfully, or with an error - it is moved to the 'History' tab.
The record on the top shows the summary of the job. The tasks tab below lists one line item per data flow as defined by the user. The details tab elaborates on the line items for the data flow. In this example below a job is highlighted in the top section, and the tasks for that job are listed in the bottom section with details of the rows successfully loaded.
Logs are also saved to the file system under the 'log' sub-directory of the OACDataSync folder.
If a task fails, it may be necessary to re-run the whole job. Before running it again, right click on the failed Job name, and select 'Mark As Completed'.
Should it be necessary to do a full data load, this can be achieved in 2 ways.
To reload all tables, go to 'Tools / Re-extract and load all data'. A confirmation box will open.
To reload a single table, go to Connections, then the 'Refresh Dates' tab. Select the table to be refreshed, and select the 'Re-Extract Data' option.
A further option box will be presented, where 'all data' can be reloaded, or all 'data from a specified date'. Make the appropriate choice and then select 'ok'. The subsequent run will extract all the data from the source table, and reload the table.
This article walked through the steps to download, configure, and set-up and monitor data loads using Data Sync Tool for Oracle Analytics Cloud (OAC).