Using an External Database for your XML Schema, Part I: Understanding the ODI XML JDBC Driver

September 20, 2013 | 6 minute read
Christophe Dupupet
Sr Director, A-Team - Cloud Solutions Architects
Text Size 100%:

Introduction

Whether you are processing XML files or XML messages on JMS, ODI provides a JDBC driver that will translate the XML structure into a relational schema. By default, this schema is created in an in-memory database. While this approach will work extremely well for small XML files or messages, it can become challenging for larger volumes. In this two part article we are first reviewing how the driver works, then we are detailing the benefits of using an actual database instead of the default in-memory storage for the creation of the relational schema.

Using an External Database for your XML Schema, Part I: Understanding the ODI XML JDBC Driver

It is important to understand how the driver works before we look into the specifics of particular configurations. This is our immediate focus.

How the driver works

If you want to know everything there is to know about the XML JDBC driver, the Oracle® Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator has a very extensive appendix that lists all commands, parameters and options supported by the driver: http://docs.oracle.com/cd/E28280_01/integrate.1111/e12644/appendix_xml_driver.htm#CHDICHDB

To allow us to cover both JMS XML messages and XML files, we will talk generically about XML structures when the explanation applies to both files and messages. In the case where the behavior differs, we will explicitly mention XML files and JMS XML messages. Note that when we define the JDBC URL to connect to a JMS XML message, we can directly use the parameters of the XML JDBC driver in the URL: ODI will know which parameters to use specifically for JMS or for XML.

For now, all we need is to understand the mechanisms behind the driver. If you read the ODI documentation, you will see that elements get converted to tables and attributes get converted to columns. By default, the database schema used to host these tables resides in the memory space of the agent processing the XML structure.

In addition, the driver will automatically add new columns to handle the relationship between elements: primary keys and foreign keys are added to retain the XML hierarchy in the form of a parent-child relationship. For instance, consider the following basic structure in XML:

<GEOGRAPHY_DIM> <country COUNTRY_ID="6" COUNTRY_NAME="Autralia"> <region REGION_ID="72" REGION_NAME="Queensland"> <city CITY_ID="63" CITY_NAME="Brisbane" POPULATION="505179"> </city> </region> </country> </GEOGRAPHY_DIM>

This will be translated to 4 tables: GEOGRAPHY_DIM, COUNTRY, REGION, CITY.

Note: we should actually look at the DTD or XSD for an accurate definition of the XML structures. The file is provided here for illustration purposes only.

The COUNTRY table derived from the original XML file country elements has two columns: COUNTRY_ID, COUNTRY_NAME

The JDBC driver adds columns to link the countries to their parent element (GEOGRAPHY_DIM) and to allow for the regions to be associated with the proper country.  The new columns that are created automatically by the driver are: GEOGRAPHY_DIMFK, COUNTRYPK

The driver has also added “orders” columns to keep track of the relative position of the elements in the original file, should that information be needed later on.

XML Schema in database

Figure 1: relational representation of an XML file.

Loading the data

The default behavior is that when you first connect to the data, the driver will immediately load the data in the database schema. If you are using the agent’s out-of-the-box configuration, this means that the data is loaded in the in-memory database. Subsequent accesses to the data read directly from the database schema and do not need to connect back to the original XML structure. The connection to XML is made when:

  • You test an XML connection from Topology

  • You reverse-engineer an XML model

  • You view the data from the Studio

  • A task needs to connect to XML (when executing a select statement for instance)

One direct benefit is that generic Knowledge Modules can be used to read data from XML files (LKM SQL to Oracle for instance): all select statements are executed against the database schema where the driver has loaded the data. There is nothing specific to the driver itself anymore.

There are several ways you can control and alter the default behavior. The first one is by adding parameters to the XML JDBC URL used in the definition of the server in Topology. The second is by issuing specific commands to the XML JDBC driver.

XML JDBC Driver Properties

All properties for the XML JDBC driver are available in the Oracle® Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator available here: http://docs.oracle.com/cd/E28280_01/integrate.1111/e12644/appendix_xml_driver.htm#CHDECBHH. In particular, the properties “load_data_on_connect” and “drop_on_disc” can impact when the data is loaded and removed from the database schema. The following table is an extract from the documentation:

Property Mandatory? Values Default Behavior
load_data_on_connect or ldoc No boolean (true | false) true Load automatically the data in the schema when performing the JDBC connection. If set to false, a SYNCHRONIZE statement is required after the connection to load the data.This option is useful to test the connection or browse metadata without loading all the data.
drop_on_disc or dod No boolean (true | false) false Drop automatically the schema when closing the JDBC connection.If true, the schema is stored in the built-in engine, it is always dropped.If true and the data is on an external database, only the current reference to the schema in memory will be dropped, but the tables will remain in the external database. This means that if you try to connect to this schema again, it will reuse the tables in the external database rather than starting from scratch (as it would when the data is loaded in memory).

 

 

XML driver commands

All commands for the XML JDBC driver are available in the Oracle® Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator available here: http://docs.oracle.com/cd/E28280_01/integrate.1111/e12644/appendix_xml_driver.htm#CHDFDFJF

As long as the technology in your Knowledge Modules or ODI procedures is set to “XML” you can use these commands directly. In particular, the SYNCHRONIZE command will allow you to update data in the database from the file, or to overwrite the file with the content of the tables in the schema. For instance, to load all data from the file and overwrite the content of the database schema you would use the command:

SYNCHRONIZE [SCHEMA <schema_name>] FROM FILE

Conversely, to overwrite the file with the content of the database schema you would use the command:

SYNCHRONIZE [SCHEMA <schema_name>] FROM DATABASE

 

The consequences for data movement

There are many benefits to the processing of XML files in an in-memory engine: separate agents can connect to the same JMS queue and work in parallel to process more messages; loading small messages is extremely efficient; no additional footprint is required for the transformation of the canonical data into a relational format.

Figure 2 below illustrates the data movement when using the memory engine with the XML JDBC driver.

in-memory

Figure 2: data movement with the in-memory engine

When large volumes of data have to be processed for each message, it can be more efficient to use an actual database for the tables created by the XML JDBC driver. This will be true for three main reasons:

  • Memory limitations: if all the processing of the XML file happens in memory, all of a sudden large volumes of data are clogging the memory space of the agent and can potentially require memory swaps that are costly in terms of performance

  • Insert-select: when the data is in an in-memory database, you still have to move the data to the target database. If you can stage the data directly in the target server, then you will be able to leverage set-based operations – one of the main reasons for why ELT is so much more efficient than ETL

  • Staging tables: if you ever have to perform heterogeneous join operations (i.e. joining with data that resides outside of the XML structure) ODI will create staging table in the target server. This amounts to staging twice: once in memory, once in the target server… If the tables are created directly in the target database, there is no need for additional staging.

Figure 3 illustrates the data movement when using an external database to store XML data for the XML JDBC driver:

External Database Schema

Figure 3: data movement with external database

Part II of this blog focuses on how to best take advantage of this external storage.

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit Oracle A-Team Chronicles for ODI.

Christophe Dupupet

Sr Director, A-Team - Cloud Solutions Architects


Previous Post

Part 1: How To Load Test OAM11g using Apache JMeter

Tim Melander | 16 min read

Next Post


Using an External Database for your XML Schema, Part II: Optimizing the Use of the External Schema

Christophe Dupupet | 8 min read