Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI)

Introduction

 

This article illustrates how to import data from relational databases into Hadoop Distributed File Systems (HDFS) with Sqoop and Oracle Data Integrator (ODI).  This article features two ODI knowledge modules designed to work with Sqoop: IKM SQL to HDFS File (Sqoop) and IKM SQL to HDFS Hive (Sqoop).

If your source data store is a relational database such as Oracle, Teradata, and MySQL, the Sqoop tool provides a fast and efficient way to extract and load data into your Hadoop Distributed File System.

If you would like to download a copy of these knowledge modules, go to “Oracle Data Integrator Knowledge Modules – Downloads.” Search for “sqoop”.

The examples discussed in this article have been developed with ODI 11.1.1 and ODI 12.1.3.  A copy of these ODI repositories with examples can be found at “ODI Repository Sample SQL to HDFS Using Sqoop.”

 

Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI)

 

Sqoop is a tool designed to transfer data between relational databases and Hadoop clusters.

Using Sqoop, data can be imported from a relational database into a Hadoop Distributed File System (HDFS).  In Hadoop, data can be transformed with Hadoop MapReduce – a software framework capable of processing and transforming enormous amounts of data in parallel.  Once the data is transformed in the Hadoop cluster, it can then be exported with Sqoop into another relational database.

This article illustrates how to import data from a relational database into a Hadoop cluster using Sqoop and Oracle Data Integrator (ODI).

Sqoop supports three types of import operations:  import into HDFS files, import into Hive tables, and import into HBase tables.

This article focuses on how to import data from relational databases into HDFS files and Hive tables.

 

ODI 12.1.3 Application Adapters for Hadoop


ODI 12.1.3 offers a new set of knowledge modules called the Oracle Data Integrator Application Adapters for Hadoop.  These application adapters include the following main capabilities:

 

  • Loading data from files and SQL databases into HDFS, HBase, and Hive tables using Sqoop.

  • Performing additional validation and transformation of data within the Hadoop cluster using Apache Hive and the Hive Query Language (HiveQL).

  • Loading processed data from Hadoop into SQL databases or HBase using Sqoop.

Figure 1 illustrates the ODI 12.1.3 Application Adapters for Hadoop.  It is recommended to get familiar with this new set of knowledge modules.  They offer additional capabilities and features that are not in the scope of this article.

 

Figure 1:  Applications Adapters for Hadoop

Figure 1: Applications Adapters for Hadoop

 

To learn more about the ODI 12.1.3 Application Adapters for Hadoop, go to “Application Adapters Guide for Oracle Data Integrator 12.1.3.

 

About the Knowledge Modules in this Article

 

The knowledge modules discussed in this article have been developed as a tool to help ODI users to get started with the capabilities of using the Sqoop tool with Oracle Data Integrator.

These knowledge modules offer basic capabilities and options with the Sqoop tool.  They are a great resource for learning how to integrate data from relational databases into Hadoop using Sqoop and Oracle Data Integrator.

Two ODI knowledge modules are featured in this article:

  • IKM SQL to HDFS File (Sqoop) – imports data from a relational database into a HDFS directory.

  • IKM SQL to HDFS Hive (Sqoop) – imports data from a relational database into a Hive table.

Figure 2 shows these two knowledge modules in an ODI project called Movie.   Both knowledge modules are of type Integration.

The Movie project contains ODI objects – mappings, packages, scenarios, and variables – to import data from an Oracle database into a HDFS cluster.  The Sqoop knowledge modules are used to perform this import operation.  The Oracle database contains data about movies such as movie titles, movie genres, and movie casts.

This ODI project is discussed throughout this article, and it is used to demonstrate how to import data from relational databases into Hadoop with Sqoop and Oracle Data Integrator.

 

Figure 1:  Sqoop Knowledge Modules

Figure 2: Sqoop Knowledge Modules

 

These two knowledge modules are compatible with Oracle Data Integrator 11g and Oracle Data Integrator 12c.  Both knowledge modules were developed using Sqoop version 1.4.3-cdh4.6.0.

For additional documentation on Sqoop commands and features, go to “Sqoop User Guide”.

 

IKM SQL to HDFS File (Sqoop)

 

The IKM SQL to HDFS File (Sqoop) is designed to import data from a relational database into a HDFS directory of a Hadoop cluster.

This knowledge module (KM) is recommended for users that would like to import raw data into a HDFS directory.  The raw data can then be used in MapReduce programs or as a source to create Hive tables for further analysis.

Also, this knowledge module is recommended for users that would like to import data into a HDFS directory where the HDFS files are the source of a Hive external table.

Importing raw data into a HDFS directory offers the following benefits: 

  • Raw data files can be used by both MapReduce programs, and Hive external tables. 

  • If the Hive external table is deleted, the raw data is retained, and it can still be used by other Hive tables or MapReduce programs.

If users would like to import data from a relational database directly into a Hive table, the IKM SQL to HDFS Hive (Sqoop) should be used instead.

 

IKM SQL to HDFS Hive (Sqoop)

 

The main function of the Sqoop tool is to upload data into files in the HDFS cluster.  However, Sqoop can also import data from a relational database into a Hive data warehouse.

The IKM SQL to HDFS Hive (Sqoop) is designed to import data from a relational database into a Hive table.

Importing data into a Hive data warehouse offers the following benefits:

  • Large datasets can be manipulated with the Hive query tool called HiveQL.  This Hive query tool allows users familiar with SQL to manipulate and retrieve data in a structured format.

  • HiveQL supports custom scalar functions, aggregations, and table functions.

  • Hive can scale out by dynamically adding more hardware to the Hadoop cluster; thus, massive amount of data can be stored in a Hive data warehouse.

  • Hive offers extensibility with the MapReduce framework.  MapReduce programmers can plug-in their custom mappers and reducers with HiveQL to perform complex transformations that may not be supported by the built-in features of the HiveQL language.

Hive is an open source technology under the Apache Software Foundation.  To learn more about Hive, go to “The Apache Hive Project”.

 

Knowledge Modules Tasks

 

Figure 3 shows a list of tasks for both knowledge modules:  IKM SQL to HDFS File (Sqoop) and IKM SQL to HDFS Hive (Sqoop).  Both knowledge modules have the same number of tasks and task names, but the code generated by each task is different.  There are five knowledge module tasks:

  • Generate Sqoop script This step creates a shell script, the Sqoop script, in the Unix/Linux file system environment. The shell script contains the actual Sqoop command with all the necessary parameters to perform the Sqoop import operation. The output generated by the Sqoop import operation will be re-directed to a log file.

  • Add execute to Sqoop script – This step changes the mode of the shell script to Execute (x), so it can be executed in the Unix/Linux environment. The file-mode change is done for all users (a+) in the Unix/Linux environment.

  • Execute Sqoop script – This step executes the Sqoop script, and it performs the actual Sqoop import operation. If the returned code of the Sqoop import operation is not equal to 0, an error is raised, and the step fails. If the returned code of the Sqoop import operation is equal to 0, the import operation is successful.

  • Remove Scoop script  This step removes the Sqoop script created by the knowledge module.

  • Remove Log file – This step removes the log file created by the knowledge module.

 

Figure 2:  Sqoop Knowledge Modules Tasks

Figure 3: Sqoop Knowledge Modules Tasks

 

Knowledge Modules Options

 

Figures 4 and 5 illustrate a list of knowledge module options for the IKM SQL to HDFS File (Sqoop) and the IKM SQL to HDFS Hive (Sqoop), respectively.  By default, the knowledge module options have been configured as follow:

 

  • Both knowledge modules perform the Sqoop import operation in APPEND mode; existing data in the HDFS directory is preserved.

  • Both knowledge modules perform the import operation with 1 Sqoop mapper (parallel process).  This value can be increased, but a splitting column is required.

  • Temporary objects will be deleted once the Sqoop operation is completed successfully.

  • The Hive knowledge module, IKM SQL to HDFS Hive (Sqoop), includes additional options to control the target Hive table.  By default, the target Hive table will not be created, and its structure and content will not be replaced.

 

Figure 3:  IKM SQL to HDFS File (Sqoop) - Options

Figure 4: IKM SQL to HDFS File (Sqoop) – Options

 

Figure 4:  IKM SQL to HDFS Hive (Sqoop) - Options

Figure 5: IKM SQL to HDFS Hive (Sqoop) – Options

 

  • Table 1 below shows a list of options for both knowledge modules; each option is fully documented in this table.

  • Most of these options use ODI variables; thus, increasing the power and flexibility of the knowledge modules.  Examples of how to use ODI variables with these options will be illustrated in the following sections of this article.

  • Use the column called Knowledge Module, in Table 1 below, to determine which knowledge module uses the option.

 

Table 1: Knowledge Modules Options

Table 1: Knowledge Modules Options

 

 

Benefits of Using Sqoop with Oracle Data Integrator (ODI)

 

Using Sqoop with Oracle Data Integrator offers great benefits.  Sqoop imports can be designed with both ODI 11g and ODI 12c.

The following sections outline some of the benefits of using ODI with Sqoop.  Most of the discussions in the following sessions are focused on the benefits of using the new features of ODI 12c.

 

Using ODI 12c Flow-Based Editor to Design Sqoop Imports

The new flow-based editor in ODI 12c offers the following benefits when designing Sqoop imports:

  • ODI 12c components such as filters, joins, and splitters can be used when designing Sqoop imports with ODI mappings.

  • ODI 12c reusable mappings can be used when designing Sqoop imports that make use of the same set of relational tables. ODI reusable mappings reduce the amount of effort that takes to design a mapping because reusable units of work can be incorporated in other mappings.

  • ODI 12c multi-target datastores can be designed with Sqoop imports. Multiple targets such as HDFS directories and Hive tables can all be designed in the same ODI mapping, increasing efficiency and enabling parallelism when loading data with the Sqoop tool.

  • Figure 6 shows an ODI 12c mapping, SQL to HDFS File (Sqoop), with two instances of the same reusable mapping, MY_TV_SHOWS and MY_DRAMAS, to import data from a set of relational tables into two separate target HDFS directories: TV_SHOWS, and DRAMAS.

 

Figure 5:  ODI 12c Flow-Based Editor with Sqoop

Figure 6 ODI 12c Flow-Based Editor with Sqoop

 

  • In this example, a filter component is used to filter data by movie genre.  Figure 7 shows the filter condition for the target HDFS directory called TV_SHOWS.

 

Figure 6:  ODI 12c Filter Component

Figure 7: ODI 12c Filter Component

 

  • Figure 8 shows the ODI reusable mapping, MY_MOVIES, used by the ODI 12c mapping, SQL to File (Sqoop).  This reusable mapping contains the source relational tables used by Sqoop to load data into the HDFS directories.  

  • This reusable mapping uses other ODI components such as datasets, joins, filters, and distinct-sets.  The filter component filters data by movie year.

 

Figure 7:  ODI 12c Reusable Mapping for Sqoop

Figure 8: ODI 12c Reusable Mapping for Sqoop



Creating ODI 12c Deployment Specifications for Sqoop Imports

 

Oracle Data Integrator 12c offers the capability of designing multiple physical implementations of the same ODI mapping.  This is known as ODI Deployment Specifications.  For additional information on how to create deployment specifications with ODI 12c, go to Creating Deployment Specifications with Oracle Data Integrator 12c.

 

  • Figure 9, section A, illustrates an ODI mapping with two deployment specifications: the Initial Import and the Incremental Import.

  • These two deployment specifications have been configured to use the IKM SQL to HDFS File (Sqoop) in both target datastores: TV_SHOWS and DRAMAS. Section B highlights these two target datastores.

 

Figure 8: ODI 12c Deployment Specifications with Sqoop

Figure 9: ODI 12c Deployment Specifications with Sqoop

 

  • Figure 10 shows the Initial Import deployment specification for the target datastore called TV_SHOWS. This deployment specification imports data in OVERWRITE mode; thus, the existing data in the target HDFS directory is purged prior adding the new dataset.

  • Two Sqoop mappers (parallel processes) are used with this deployment specification, since a significant amount of data is imported during the initial load.

  • A splitting column called MOVIE_YEAR is used to split the workload among the Sqoop mappers. The splitting column comes from the ODI distinct component called OUT_TV_SHOWS. Figure 9 above shows this component. OUT_TV_SHOWS is the same distinct component called MOVIES in the reusable mapping.

  • An ODI variable called VAR_MOVIE_YEAR is used as the suffix name for the temporary object names and the HDFS directory name.

Figure 9:  Deployment Specification for Sqoop Initial Imports

Figure 10: Deployment Specification for Sqoop Initial Imports

 

  • Figure 11 shows the Incremental Import deployment specification for the same target datastore, TV_SHOWS. In this example, the import operation is performed in APPEND mode; thus, the existing data in the target HDFS directory is preserved.

  • Since small datasets are imported during incremental loads, one Sqoop mapper is used with this deployment specification. As a result, a splitting column is not required.

  • The ODI variable called VAR_MOVIE_YEAR is also used as the suffix name for the temporary object names and the HDFS directory name.

 

Figure 10:  Deployment Specification for Sqoop Incremental Imports

Figure 11: Deployment Specification for Sqoop Incremental Imports

 

 

Using ODI 12c In-Session Parallelism with Sqoop

Oracle Data Integrator 12c introduces a new feature that allows parts of an ODI mapping to execute in parallel.  This feature is configured in the physical deployment specification of the ODI mapping.  Using this feature with Sqoop adds an additional level of parallelism when importing data from relational databases into Hadoop clusters.

  • Figure 12 shows two execution strategies to import data into two HDFS directories: the Serial Execution and the Parallel Execution.

  • The Serial Execution strategy contains one execution unit, Movie Files, for both HDFS directories: TV_SHOWS and DRAMAS. The import operation of these two HDFS directories is done in serial mode because there is only one execution unit for both HDFS directories.

  • The Parallel Execution strategy contains two execution units, one for each HDFS directory. Multiple execution units within the same execution group run in parallel; thus, the import operation of these two HDFS directories is done in parallel.

  • The Parallel execution can be designed by selecting a datastore such as TV_SHOWS, and dragging the datastore outside of its execution group. This action will automatically create another execution unit for the selected datastore in the same execution group as shown in Figure 12.

 

Figure 11:  ODI 12c In-Session Parallelism with Sqoop

Figure 12: ODI 12c In-Session Parallelism with Sqoop

 

 

  • Figure 13 shows the Initial Import deployment specification for both HDFS directories, TV_SHOWS and DRAMAS. In this example, two sets of Sqoop imports run in parallel, reducing the amount of time it takes to load data from the relational database tables into the HDFS directories.

  • Notice that in Figure 13, the reusable mapping is being used in two separate execution units in the same execution group called SQL to HDFS Files – Initial Import.

 

 

Figure 12:  Two Parallel Imports with Sqoop and ODI

Figure 13: Two Parallel Imports with Sqoop and ODI

 

 

  • Figure 14 shows the session log for the Initial Import deployment specification.  Both execution units, Dramas and TV Shows, were executed in parallel.

 

Figure 13:  ODI 12c Session Log Parallel Execution with Sqoop

Figure 14: ODI 12c Session Log Parallel Execution with Sqoop

 

 

Using ODI Packages with Sqoop Mappings

 

ODI packages can be used to implement additional parallelism when loading data from a relational database into a HDSF cluster.  For instance, an ODI package can be designed to launch multiple executions of the same ODI mapping in parallel.

 

  • Figure 15 illustrates the design of an ODI package called PKG_SQL_TO_HDFS_FILES – INITIAL. The package flow loops 3 times and launches 3 executions of the same ODI scenario in parallel, the SQL to HDFS Files. This scenario is the compiled object of the ODI mapping that imports data from the Oracle relational database into the HDSF directories.

  • In this ODI package, the SQL to HDFS Files scenario has been configured to run in Asynchronous (parallel) mode. In ODI, packages can execute scenarios in asynchronous mode, which means the package does not wait for the execution of the scenario. Instead, the package launches the scenario and the next step in the package is executed. The scenario then runs in parallel with other steps in the package.

  • The ODI package shown in Figure 15 uses a variable called VAR_MOVIE_YEAR to filter and load data by movie year. This variable is also used by the IKM SQL to HDFS File (Sqoop) to create unique temporary object names, so multiple instances of the same scenario can run in parallel. The variable has been added into the knowledge module option called SUFFIX_NAME.
Figure 14:  ODI Package with Sqoop Scenario

Figure 15: ODI Package with Sqoop Scenario

 

  • Figure 16 illustrates the ODI Operator with 3 scenarios running in parallel.  These scenarios are child sessions of the ODI package PKG_SQL_TO_HDFS_FILES – INITIAL.  Notice that the actual package has completed its execution, but all scenarios are in Running  status.  The session name of each scenario includes the actual value of the VAR_MOVIE_YEAR variable, the movie year.

 

Figure 15:  ODI Operator with Scenarios Running in Parallel

Figure 16: ODI Operator with Scenarios Running in Parallel

 

In this section, three levels of parallelism have been discussed in great detail:

  • Level 1 – An ODI package can be used to launch multiple executions of the same ODI mapping in parallel.

  • Level 2 – ODI 12c In-Session parallel can be used when designing a Sqoop mapping with more than one HDFS directory or Hive table.

  • Level 3 – Sqoop mappers can be used to import data in parallel.

Figure 17 shows these three levels of parallelism.  In this example, one ODI package launches three executions of the same ODI scenario in parallel.  Each ODI scenario loads data into each HDFS directory in parallel.  For each HDFS directory, two Sqoop mappers are used to load data in parallel.

 

Figure 16:  Levels of Parallelism with Sqoop and ODI

Figure 17: Levels of Parallelism with Sqoop and ODI

 

 

Configuring Oracle Data Integrator for Sqoop Imports

 

ODI Topology Configuration

 

The Sqoop knowledge modules use two ODI technologies:  File and Hive.  The IKM SQL to HDFS File (Sqoop) uses the File technology and the IKM SQL to HDFS Hive uses the Hive technology.

These two technologies must be configured in the ODI Topology Navigator before using the Sqoop knowledge modules.

The following steps are a guideline on how to configure the ODI Topology for the File and Hive technologies.  For additional information on how to configure the ODI Topology, go to “Setting up the Oracle Data Integrator (ODI) Topology”.

 

The steps to configure Sqoop with the ODI Topology are as follow:

 

  • Create the necessary contexts based on your environment.

  • Using the File and Hive technologies, create the physical data servers.

  •  For each physical data server, File and Hive, create the physical schemas corresponding to the schemas containing the data to be integrated with ODI.

  • Using the File and Hive technologies, create the logical schemas and associate them with the corresponding File and Hive physical schemas and contexts.

  • Create the physical and logical agents and associate them with the contexts.  The ODI agent must be located in the big data appliance where the Sqoop tool is installed, so it can execute the Sqoop scripts.  The physical agent must have access to both the Sqoop tool and the source relational database.

  • Figure 18 shows an example of how to configure the physical schema for the File technology.  The Directory Schema in Figure 18, movie_files, represents the HDFS directory used by the Sqoop knowledge module to store the HDFS data files.

  • The Directory Work Schema in Figure 18, /tmp, is the UNIX directory used by the Sqoop knowledge module to create temporary objects at runtime.  This UNIX directory is typically located in the same machine where the ODI agent is installed, so the agent can create the temporary objects at runtime.

 

Figure 17:  ODI Physical Schema for HDFS Files

Figure 18: ODI Physical Schema for HDFS Files

 

  • Figure 19 shows an example of how to configure the physical schema for the Hive technology.  The Directory Schema in Figure 19, moviedemo,is the name of the Hive database where the target Hive table is located.

  • The Directory Work Schema in Figure 19, moviework, is a HDFS directory used by the Sqoop knowledge module to import the source data files into a temporary location.  The source data files are then moved from the HDFS directory into the Hive warehouse directory.

 

Figure 18:  ODI Physical Schema for Hive Tables

Figure 19: ODI Physical Schema for Hive Tables

 

  • Use the ODI Models section of the ODI Designer Navigator to create and configure your ODI models and datastores.  For more information on how to create Data Models and Datastores in ODI, see “Creating and Using Data Models and Datastores in Oracle Data Integrator”.

  • The following section describes how to configure your ODI mappings and interfaces with the Sqoop Knowledge modules.

 

ODI 12c Mapping Configuration

 

There are 4 basic configuration steps in order to enable the Sqoop knowledge modules in an ODI 12c mapping:

 

  • Create the desired number of physical deployment specifications for your ODI mapping.  Two deployment specifications are recommended:  the initial import, and the incremental import. 

  • Select a deployment specification and identify the ODI access point of the target execution unit.  Figure 20, shows an example.  In this example, the ODI access point of execution unit TV_SHOWS is a filter called FILTER_AP.

 

Figure 19:  ODI Access Point Configuration

Figure 20: ODI Access Point Configuration

 

 

  • Using the Properties window of the ODI access point, expand the loading knowledge module option and select LKM SQL Multi-Connect.GLOBAL.  Figure 21 shows the loading knowledge module for the ODI access point called FILTER_AP.

 

Figure 20:  Load Knowledge Module Configuration

Figure 21: Load Knowledge Module Configuration

 

  • On the target execution unit, select the target datastore.  Using the Properties window of the target datastore, expand the integration knowledge module option and select the Sqoop integration knowledge module.

  • Figure 22 shows the Properties window of the target datastore called TV_SHOWS.   The integration knowledge module for this datastore is IKM SQL to HDFS File (Sqoop).

 

Figure 21:  Integration Knowledge Module Configuration

Figure 22: Integration Knowledge Module Configuration

 

ODI 11g Interface Configuration

 

There are 6 basic configuration steps in order to enable the Sqoop knowledge modules in an ODI 11g interface:

 

  • Create your ODI interface and ensure that a target table has been already added in the interface.

  • Select the Overview tab of the ODI interface, and locate the Definition menu option.  Under the Definition menu option, check the option called “Staging Area Different from Target”.  Once this option is checked, a list of SQL technology servers will be available for selection.

  • Select the desired SQL technology server.  Figure 23 shows an example.  In this example, the SQL technology is Oracle, and the logical schema is called Movie Demo.  This is the logical schema of the source tables (the relational database) in the ODI interface.

 

Figure 22:  ODI 11g Interface - Definition

Figure 23: ODI 11g Interface – Definition

 

  • Select the Flow tab of the ODI interface to see the flow diagram.  Figure 24 shows an example.  Select the Staging Area box, and ensure that no loading knowledge module can be selected in this area.

  • Select the Target Area box as shown in Figure 24.

 

Figure 23:  ODI 11g Interface - Flow Tab

Figure 24: ODI 11g Interface – Flow Tab

 

  • Using the Target Area Property Inspector, select the Sqoop integration knowledge module as shown in Figure 25.  Proceed to customize the knowledge module options based on your Sqoop import strategy.

 

Figure 24:  ODI 11g Interface - Target Area Property Inspector

Figure 25: ODI 11g Interface – Target Area Property Inspector

 

Conclusion

 

If your source data store is a relational database such as Oracle, Teradata, and MySQL, the Sqoop tool provides a fast and efficient way to extract and load data into your Hadoop Distributed File System.

If you would like to download a copy of these knowledge modules, go to “Oracle Data Integrator Knowledge Modules – Downloads.” Search for “sqoop”.

The examples discussed in this article have been developed with ODI 11.1.1 and ODI 12.1.3.  A copy of these ODI repositories with examples can be found at “ODI Repository Sample – SQL to HDFS Using Sqoop.”

 

For more ODI best practices, tips, tricks, and guidance that the A-Team members gain from real-world experiences working with customers and partners, visit Oracle A-Team Chronicles for ODI.

 

Add Your Comment