Best Practices – Data movement between Oracle Storage Cloud Service and HDFS

Introduction

Oracle Storage Cloud Service should be the central place for persisting raw data produced from another PaaS services and also the entry point for data that is uploaded from the customer’s data center. Big Data Cloud Service ( BDCS ) supports data transfers between Oracle Storage Cloud Service and HDFS. Both Hadoop and Oracle provides various tools and Oracle engineered solutions for the data movement. This document outlines various tools and describes the best practices to improve data transfer usability between Oracle Storage Cloud Service and HDFS.

Main Article

Architectural Overview

 

new_oss_architecture

Interfaces to Oracle Storage Cloud Service

 

Interface

Resource

odcp

Accessing Oracle Storage Cloud Service Using Oracle Distributed Copy

Distcp

Accessing Oracle Storage Cloud Service Using Hadoop Distcp

Upload CLI

Accessing Oracle Storage Cloud Service Using the Upload CLI Tool

Hadoop fs -cp

Accessing Oracle Storage Cloud Service Using hadoop File System shell copy

Oracle Storage Cloud Software Appliance

Accessing Oracle Storage Cloud Service Using Oracle Storage Cloud Software Appliance

Application Programming platform

Java Library – Accessing Oracle Storage Cloud Service Using Java Library
File Transfer Manager API – Accessing Oracle Storage Cloud Service Using File Transfer Manager API
REST API – Accessing Oracle Storage Cloud Service Using REST API

 

Oracle Distributed Copy (odcp)

Oracle Distributed Copy (odcp) is a tool for copying very large data files in a distributed environment between HDFS and an Oracle Storage Cloud Service.

  • How does it work?

odcp tool has two main components.

(a) odcp launcher script

(b) conductor application

odcp launcher script is a bash script serving as a launcher for the spark application which provides a fully parallel transfer of files.

Conductor application is an Apache Spark application to copy large files between HDFS and an Oracle Storage Cloud Service.

For end users it is recommended to use the odcp launcher script. The odcp launcher script simplifies the usage of Conductor application by encapsulating environment variables setup for hadoop/Java, spark-submit parameters setup and invoking spark application etc. The conductor application is an ideal approach while performing data movement using spark application.

blog3

odcp takes the given input file (source) and splits it into smaller file chunks. Each input chunk is then transferred by one executor over the network to destination store.

basic-flow

When all chunks are successfully transferred, executors take output chunks and merge them into final output files.

flow

  • Examples

Oracle Storage Cloud Service is based on Swift, the open-source Open Stack Object Store. The data stored in Swift can be used as the direct input to a MapReducer job by simply using the “swift:// <URL>” to declare the source of the data. In a Swift File system URL, the hostname part of the URL identifies the container and the service to work with; the path identifies the name of the object.

Swift syntax:

Swift://<MyContainer.MyProvider>/<filename>

odcp launcher script

Copy file from HDFS to Oracle Storage Cloud Service

odcp hdfs:///user/oracle/data.raw swift://myContainer.myProvider/data.raw

Copy file from Oracle Storage Cloud Service to HDFS:

odcp swift://myContainer.myProvider/data.raw hdfs:///user/oracle/odcp-data.raw

Copy directory from HDFS to Oracle Storage Cloud Service:

odcp hdfs:///user/data/ swift://myContainer.myProvider/backup

In case the system has more than 3 nodes, transfer speed can be increased by specifying a higher number of executors. For 6 nodes, use the following command:

odcp –num-executors=6 hdfs:///user/oracle/data.raw swift://myContainer.myProvider/data.raw

 

Highlight of odcp launcher script Options
–executor-cores: This option is called number of executor cores. This specifies the number of thread counts which depends on vCPU. This allows scripts to run in parallel based on the thread count.  The default value is 30.
–num-executors: This option is called number of executors. This will be the same as the number of physical node/VMs. The default value is 3.

 

Conductor application

Usage: Conductor [options] <source URI...> <destination URI>
<source URI...> <destination URI>
source/destination file(s) URI, examples:
hdfs://[HOST[:PORT]]/<path>
swift://<container>.<provider>/<path>
file:///<path>
-i <value> | --fsSwiftImpl <value>
swift file system configuration. Default taken from etc/hadoop/core-site.xml (fs.swift.impl)
-u <value> | --swiftUsername <value>
swift username. Default taken from etc/hadoop/core-site.xml fs.swift.service.<PROVIDER>.username)
-p <value> | --swiftPassword <value>
swift password. Default taken from etc/hadoop/core-site.xml (fs.swift.service.<PROVIDER>.password)
-i <value> | --swiftIdentityDomain <value>
swift password. Default taken from etc/hadoop/core-site.xml (fs.swift.service.<PROVIDER>.tenant)
-a <value> | --swiftAuthUrl <value>
swift auth URL. Default taken from etc/hadoop/core-site.xml (fs.swift.service.<PROVIDER>.auth.url)
-P <value> | --swiftPublic <value>
indicates if all URLs are public - yes/no (default yes). Default taken from etc/hadoop/core-site.xml (fs.swift.service.<PROVIDER>.public)
-r <value> | --swiftRegion <value>
swift Keystone region
-b <value> | --blockSize <value>
destination file block size (default 268435456 B), NOTE: remainder after division of partSize by blockSize must be equal to zero
-s <value> | --partSize <value>
destination file part size (default 1073741824 B), NOTE: remainder after division of partSize by blockSize must be equal to zero
-e <value> | --srcPattern <value>
copies file when their names match given regular expression pattern, NOTE: ignored when used with --groupBy
-g <value> | --groupBy <value>
concatenate files when their names match given regular expression pattern
-n <value> | --groupName <value>
group name (use only with --groupBy), NOTE: slashes are not allowed
--help
display this help and exit

 

One can submit a spark conductor application to a spark deployment environment for execution of spark applications. Below is an example of how to submit a spark conductor application.

spark-submit
–conf spark.yarn.executor.memoryOverhead=600
–jars hadoop-openstack-spoc-2.7.2.jar,scopt_2.10-3.4.0.jar
–class oracle.paas.bdcs.conductor.Conductor
–master yarn
–deploy-mode client
–executor-cores <number of executor core e.g. 5>
–executor-memory <memory size e.g. 40G>
–driver-memory < driver memory size e.g. 10G>
original-conductor-1.0-SNAPSHOT.jar
–swiftUsername <oracle username@oracle.com>
–swiftPassword <password>
–swiftIdentityDomain <storage ID assigned to this user>
–swiftAuthUrl https://<Storage cloud domain name e.g. storage.us2.oraclecloud.com:443>/auth/v2.0/tokens 
–swiftPublic true
–fsSwiftImpl org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem
–blockSize <block size e.g. 536870912>
swift://<container.provider e.g. rstrejc.a424392>/someDirectory
swift:// <container.provider e.g. rstrejc.a424392>/someFile
hdfs:///user/oracle/

  • Limitations

odcp consumes a lot of resources of the cluster. While running other Spark/MapReduce jobs parallel to odcp, one needs to adjust the number of executors, the amount of memory available to the executors or the number of executor cores using the options –executor-cores,  –executor-memory and –num-executors parameter value for better performance.

 

Distcp

Distcp (distributed copy) is a Hadoop utility used for inter/intra-cluster copying of large amounts of data in parallel. The Distcp command submits a regular MapReducer job that performs a file-by-file copy.

  • How does it work?

Distcp involves two steps:

(a) Building the list of files to copy (known as the copy list)

(b) Running a MapReduce job to copy files, with the copy list as input

distcp

The MapReduce job that does the copying has only mappers—each mapper copies a subset of files in the copy list. By default, the copy list is a complete list of all files in the source directory parameters of Distcp.

 

  • Examples

 

Copying data from HDFS to Oracle Storage Cloud Service syntax:

hadoop distcp hdfs://<hadoop namenode>/<source filename> swift://<MyContainer.MyProvider>/<destination filename>

Allocation of JVM heap-size:   

export HADOOP_CLIENT_OPTS=”-Xms<start heap memory size> –Xmx<max heap memory size>”

Setting timeout syntax:

hadoop distcp – Dmapred.task.timeout=<time in milliseconds>  hdfs://<hadoop namenode>/<source filename> swift://<MyContainer.MyProvider>/<destination filename>

Hadoop getmerge syntax:

bin/hadoop fs -getmerge [nl] <source directory> <destination directory>/<output filename>

Hadoop getmerge command takes a source directory and a destination file as an input and concatenates source files into the destination local file. The parameter  –nl can be set to add a newline character at the end of each file.

 

  • Limitations

For a large file copy, one has to make sure that the task has a termination strategy in case the task doesn’t read an input, write an output, or update its status string. The option  -Dmapred.task.timeout=<time in milliseconds>  can be used to set the maximum timeout value. In case of 1TB file size use -Dmapred.task.timeout=60000000 (approximately 16 hours) with Distcp command.

Distcp might run out of memory while copying very large files. To get around this, consider changing the -Xmx JVM heap-size parameters before executing hadoop distcp command. This value must be multiple of 1024

In order to improve the transfer speed of very large file, one has to split the file at source and copy these split files to destination. Once the files are successfully transferred, at the destination end, Hadoop performs merge operation.

Upload CLI

 

  • How does it work?

The Upload CLI tool is a cross-platform Java-based command line tool that you can use to efficiently upload files to Oracle Storage Cloud Service. This tool optimized uploads through segmentation and parallelization to maximize network efficiency and reduce overall upload time. During the large file transfer process,  if the system gets interrupted, upload CLI tool maintain the state and resumes from the point where the file transfer get interrupted. This tool has an automatic retry option on failures.

  • Example:

Syntax of upload CLI:

java -jar uploadcli.jar -url REST_Endpoint_URL -user userName -container containerName file-or-files-or-directory

To upload a file named file.txt to a standard container myContainer in the domain myIdentityDomain as the user abc.xyz@oracle.com, run the following command:

java -jar uploadcli.jar -url https://foo.storage.oraclecloud.com/myIdentityDomain-myServiceName -user abc.xyz@oracle.com -container myContainer file.txt

When running the Upload CLI tool on a host that’s behind a proxy server, specify the host name and port of the proxy server by using the https.proxyHost and https.proxyPort Java parameters.

 

Syntax of upload CLI behind proxy server:

java -Dhttps.proxyHost=host -Dhttps.proxyPort=port -jar uploadcli.jar -url REST_Endpoint_URL -user userName -container containerName file-or-files-or-directory

  • Limitations

Upload CLI is a java tool and will only run on hosts which satisfy the prerequisites for uploadcli tool.

 

Hadoop fs -cp

 

  • How does it work?

Hadoop fs -cp is a family of Hadoop file system shell commands that can run from source operating system’s command line interface. Hadoop fs -cp is not distributed across cluster. This command transfer data byte by byte from the source machine where the command has been issued.

  • Example

hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2

 

  • Limitations

The byte by byte transfer takes a very long time to copy large file from HDFS to Oracle Storage Cloud Service.

 

Oracle Storage Cloud Software Appliance

 

  • How does it work?

Oracle Storage Cloud Software Appliance is a product that facilitates easy, secure, reliable data storage and retrieval from Oracle Storage Cloud Service. Businesses can use Oracle Cloud Storage without changing their data center applications and workflows. The applications which use standard file-based network protocol like NFS to store and retrieve data, can use Oracle Storage Cloud Software Appliance as a bridge between Oracle Storage Cloud Service which uses object storage and standard file storage. Oracle Storage Cloud Software Appliance caches frequently retrieved data on the local host, minimizing the number of REST API calls to Oracle Storage Cloud Service and enabling low-latency, high-throughput file I/O.

The application host instance can mount directory to the Oracle Storage Cloud Software Appliance that acts as a cloud storage gateway. This enables the application host instance to access Oracle Cloud Storage container as a standard NFS file system.

 

Architecture

blog2

 

  • Limitations

The appliance is ideal for backup and archive use cases that require the replication of infrequently accessed data to cloud containers. Read-only and read-dominated content repositories are ideal target. Once the Oracle Storage Cloud Service container is mapped to a filesystem in Oracle Storage Cloud Software Appliance, other data movement tools like REST API, odcp, distcp, java library can’t be used for the specific container. Doing so would cause the data in the appliance to become inconsistent with data in Oracle Storage Cloud Service.

 

Application Programming Platform

Oracle provides various java library APIs to access Oracle Storage Cloud Services. The following interfaces summarizes various APIs one can use programmatically to access Oracle storage cloud service.

Interface

Description

Java library

Accessing Oracle Storage Cloud Service Using Java Library

File Transfer Manager API

Accessing Oracle Storage Cloud Service Using File Transfer Manager API

REST API

Accessing Oracle Storage Cloud Service Using REST API


Java Library  

 

  • How does it work?

The Java library is useful for Java Applications which prefer to use Oracle Cloud Java API for Oracle Storage Cloud Service instead of tools provided by Oracle and Hadoop. The Java library wraps the RESTful web service API. Most of the major RESTful API features to Oracle Storage Cloud Service are available through the Java Library. The Java Library is available via separate Oracle Cloud Service Java SDK.

 

java library

  • Example

Sample Code snippet

package storageupload;
import oracle.cloud.storage.*;
import oracle.cloud.storage.model.*;
import oracle.cloud.storage.exception.*;
import java.io.*;
import java.util.*;
import java.net.*;
public class UploadingSegmentedObjects {
public static void main(String[] args) {
try {
CloudStorageConfig myConfig = new CloudStorageConfig();
myConfig.setServiceName(“Storage-usoracleXXXXX”)
.setUsername(“xxxxxxxxx@yyyyyyyyy.com”)
.setPassword(“xxxxxxxxxxxxxxxxx”.toCharArray())
.setServiceUrl(“https://xxxxxx.yyyy.oraclecloud.com&#8221;);
CloudStorage myConnection = CloudStorageFactory.getStorage(myConfig);
System.out.println(“\nConnected!!\n”);
if ( myConnection.listContainers().isEmpty() ){
myConnection.createContainer(“myContainer”);
}
FileInputStream fis = new FileInputStream(“C:\\temp\\hello.txt”);
myConnection.storeObject(“myContainer”, “C:\\temp\\hello.txt”, “text/plain”, fis);
fis = new FileInputStream(“C:\\temp\\hello.txt”);
myConnection.storeObject(“myContainer”, “C:\\temp\\hello1.txt”, “text/plain”, fis);
fis = new FileInputStream(“C:\\temp\\hello.txt”);
myConnection.storeObject(“myContainer”, “C:\\temp\\hello2.txt”, “text/plain”, fis);
List myList = myConnection.listObjects(“myContainer”, null);
Iterator it = myList.iterator();
while (it.hasNext()) {
System.out.println((it.next().getKey().toString()));
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

 

  • Limitations

Java API cannot create Oracle storage Cloud Service archive container. Appropriate JRE version is required for the Java Library.

 

File Transfer Manager API

 

  • How does it Work?

The File Transfer Manager (FTM) API is a Java library that simplifies uploading to and downloading from Oracle Storage Cloud Service. The File Transfer Manager provides both synchronous and asynchronous APIs to transfer files. It provides a way to track the operations for asynchronous version. The Java Library is available via separate Oracle Cloud Service Java SDK.

 

  • Example

Uploading a Single File Sample Code snippet

FileTransferAuth auth = new FileTransferAuth
(
"email@oracle.com", // user name
"xxxxxx", // password
"yyyyyy", //  service name
"https://xxxxx.yyyyy.oraclecloud.com", // service URL
"xxxxxx" // identity domain
);
FileTransferManager manager = null;
try {
manager = FileTransferManager.getDefaultFileTransferManager(auth);
String containerName = "mycontainer";
String objectName = "foo.txt";
File file = new File("/tmp/foo.txt");
UploadConfig uploadConfig = new UploadConfig();
uploadConfig.setOverwrite(true);
uploadConfig.setStorageClass(CloudStorageClass.Standard);
System.out.println("Uploading file " + file.getName() + " to container " + containerName);
TransferResult uploadResult = manager.upload(uploadConfig, containerName, objectName, file);
System.out.println("Upload completed successfully.");
System.out.println("Upload result:" + uploadResult.toString());
} catch (ClientException ce) {
System.out.println("Upload failed. " + ce.getMessage());
} finally {
if (manager != null) {
manager.shutdown();
}
}

 

REST API

 

  • How does it work?

The REST API can be accessed from any application or programming platform that correctly and completely understands the Hypertext Transfer Protocol (HTTP). The REST API uses advanced facets of HTTP such as secure communication over HTTPS, HTTP headers, and specialized HTTP verbs (PUT, DELETE). cURL is one of the many applications that meet these requirement.

 

  • Example

cURL syntax:

curl -v -s -X PUT -H “X-Auth-Token: <Authorization Token ID>” https://Oracle Cloud Storage domain name>/v1/<storage ID associated to user account/<container name>”

 

Some Data Transfer Test results

The configuration used to measure performance and data transfer rates are as following:

Test environment configuration:

- BDCS 16.2.5
- Hadoop Swift driver 2.7.2
- US2 production data center
- 3 nodes cluster that is running in BDA
- Every node has 256GB memory/30 vCPU
- File size: 1TB (Terabyte)
- File contains all zeros

#

Interface

Source

Destination

Time

Comment

 1  odcp HDFS Oracle Storage Cloud Service 54 minutes Transfer rate :

2.47 GB/sec

1.11 TB/hour

2 hadoop Distcp Oracle Storage Cloud Service HDFS failed Not Enough memory (after 1h)
3 hadoop Distcp HDFS Oracle Storage Cloud Service Failed
4 hadoop Distcp HDFS Oracle Storage Cloud Service 3 hours Based on splitting 1TB files into 50 files with each file size of 10GB. Each 10GB file took 18 minutes (and with partition size 256MB)
5 Upload CLI HDFS Oracle Storage Cloud Service 5 hours  55 minutes Data was read from Big Data Cloud Service HDFS mounted using fuse_dfs
6 hadoop fs -cp HDFS Oracle Storage Cloud Service 11 hours 50 minutes 50 seconds Parallelism 1, Transfer rate: 250 Mb/sec

 

Summary

One can make following conclusions from the above analysis.

Data File size and Data transfer time are two main components on deciding the appropriate interface for data movement between HDFS and Oracle Storage Cloud Service.

There is no additional overhead of data manipulation and processing using odcp interface.

Add Your Comment