Back in 2012, while I was performing research on how to make a Fusion Applications environment portable, the decisive moment came when I was able to confirm that I could copy the entire FA environment from one server to another and run it there as long as I added all hostnames present in the FA configuration data to the /etc/hosts file of the operating system. By tweaking the hostname-to-IP-address mappings in the hosts file so that the requests over the network would be routed to the new machines instead of the old ones I was able to keep the duplicate FA environment almost intact, creating a perfect clone while preserving the ability to patch and upgrade. This technique later led to the creation of the Fusion Applications Cloning Tool, which the A-Team developed in 2013 under my lead.
While defining the architecture for that tool, it became clear that if we were to add all hostnames present in the FA configuration to the /etc/hosts file, we would first need to know what those hostnames were. While in a perfect world we could simply look at the server name using ping or the hostname command, this was actually much more complex since many customers have more than one hostname pointing at a single server. For example, one of our customers had two different hostnames – one local and one in DNS - for every host in their data center; they also had Virtual IPs configured on the server which use different hostnames, plus remote hostnames defined locally in the /etc/hosts file. An inspection of their FA configuration revealed that all of these hostnames were being used to point at the same server; some configuration files even used the IP address itself to reference a particular server in the topology.
So before we would be able to start the clone of the Fusion Applications environment on another host, we knew we would have to look up hostname references in Fusion Applications configuration data, which includes files, database tables, credentials stores, etc. With no tool available to do that for us at the time, we defined a procedure for finding not just hostnames but also other details such as usernames, database connections, identity management information, topology, etc. and documented it as the Discovery process in the Fusion Applications Cloning and Content Movement Administrator's Guide.
We are currently working on a tool that automates that process and the next sections attempt to share some of the thought process that went into the Discovery process and its automated version - the Discover tool. I hope it will be of interest to anyone in need of harvesting information from IT systems like us.
The need to obtain information from IT systems is nothing new. Regular system maintenance including patching, upgrading and making configuration changes often requires knowing current information such as database connection details, HTTP URLs, file locations and, of course, passwords. IT teams have traditionally relied on spreadsheets, wiki pages and cookbooks to keep track of and share that kind of information, but aside from the questionable security and reliability of these methods, how can one guarantee that the information there is current? Most of the time they simply can’t.
Going back to installation response files and other files generated at installation time sometimes works but can be misleading since they will most likely not include manual configuration changes made to the environment after the installation.
Looking at current architectural trends, modern systems rely more and more on service oriented architectures, with information coming from other, often unrelated systems using HTTP-based APIs. The decentralized nature of these architectures makes it even more difficult to maintain a centralized, up to date and reliable repository of configuration information and makes traditional methods either too costly or quickly obsolete.
So in order to be sure about the information from an IT system, the ideal approach would be to be able to inspect that system as needed in order to gather information that is current and reliable. There are many different ways to go about doing that, but we have found that the following 5 points can take care of all our information harvesting needs:
In this part 1 we will be discussing the first one: Knowing the what, where and how of the information you need.
Part 2 will discuss the remaining points: Gathering, Analyzing, Verifying and Presenting the information.
In our example above, we know we need all hostnames, however, hostnames can be available in various shapes (URLs, database connect strings, property values, etc.) so when looking for hostnames we are actually looking for all possible hostname shapes.
This may be true for other types of information as well, such as usernames, port numbers, file names, etc. So before you go on your quest to find the information, make sure you know all the available shapes and forms in which it may be presented and you have included all of them in your list. With this in mind, our search for hostnames now includes the following:
|Information shape||Example||Found for example in|
|HTTP URLs||http://fusionapps.mycompany.com:10614||Web Service connections|
|JDBC URLs||jdbc:oracle:thin:@fusiondb.mycompany.com:1521/fusiondb||Database connections|
|LDAP URLs||ldap://idm.mycompany.com:3060||Identity Store connections|
|Database connect strings||(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=fusiondb.mycompany.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fusiondb)))||Database connections|
|Hostname:Port pairs||fusionapps.mycompany.com:10613||Oracle HTTP Server configuration files|
In a perfect world, all configuration information for a given system is managed through a single web-based console, but we all know the IT world is far from perfect. In reality, modern systems span multiple products, often built on different technologies, with each system maintaining their own configuration, including database connections, identity system connections, API endpoint connections, etc.
In our hostname example, in order to find all hostnames (and other shapes) used in Fusion Applications we have to first list all the possible sources that can contain hostname references, they include among others:
|Identity management connections||E.g. SSO provider, identity store (LDAP), policy store/credential store (LDAP)|
|HTTP connections||E.g. web service URLs, map URLs, generic HTTP URLs|
|Frontend HTTP endpoint configuration||The hostnames defined at the HTTP server level for incoming requests|
|Search crawler connections||External connections used to retrieve search indexing data|
The list above provides an idea of where to look, but in order to come up with the detailed list of all the specific places that can contain hostname references, a combination of the methods below is needed at least once when coming up with the list of sources:
Now that we know all information we need (in its various shapes and forms) and where to get it from, we need to find a way to get to it. Once again, having a web-based console to get to the information really helps when performing manual discovery. A tool like Oracle Cloud Control can aggregate information from multiple systems/consoles and make this process a lot easier.
If the information is not readily available in a central location, one can always obtain each piece manually by:
However the information we are looking for is so vast and detailed that 1. Doing it manually would take way too much time and 2. Going directly to the source (APIs, command-line tools, LDAP directories and sometimes all the way down to database tables and configuration files) guarantees we will not miss anything. Since the information we are looking for lives in a variety of sources, we will need a variety of access methods e.g.:
|Access Method||Used to access|
|SQL /JDBC||database resources e.g. transactional data, some configuration data|
|LDAP||LDAP directories for identity (user and group) information|
|HTTP||Used to access web-basedAPIs (web services, REST, etc.)|
|JMX / T3||MBeans (to access WebLogic server configuration for example)|
|Process invocation||output from command line tools|
Most programming languages including Java have APIs for all of these which makes it possible to automate them.
Once the information has been obtained, it may have to go through some basic processing to extract important elements from it, as in our example, extracting the hostname from an HTTP URL or from a configuration (properties) file. When performing discovery manually, this becomes part of the process naturally, however when automating one must invoke this as a separate task. Some of the ways this can be done are listed below:
|Text format||Processing method||Description|
|Any text||Regular expressions||RegEx can be used to extract information from literally any text, however for structured text there may be other optimized ways to do it (see examples below)|
|XML||XPath||XML is widely used as a format for configuration files and XPath is certainly the best way to obtain specific information from XML data.|
|CSV||ODBC, etc.||Comma-separated values is a widely used format for storing tabular data and is compatible with popular tools like Excel.|
|Properties files||Regular expressions / Java Properties class||Properties files are widely used for storing system configuration|
Once you define the access method and processing method for each source, it is very important to ensure that the access and processing of the information can be encapsulated into a repeatable step. We wouldn’t want to take all the thought process that went into getting that information once and let it go to waste. This can be done through documentation and/or creation of code units that can be run independently, normally through scripts.
In the Discover tool, we call these units "operations" and we implement them as Java code. But an operation is, in reality, an abstract concept that defines a contract (inputs and outputs, plus the action it performs) and can be implemented in a variety of ways. We will discuss more about how we are coding operations and their features in another blog post. For now, all you need to know is that they are units of documentation or code that allow you to obtain a specific set of information from a given source.
Definition of an operation:
Here are a couple of examples of operations:
|Operation 1: Get a list of all JDBC URLs used by data sources in a WebLogic Domain|
|Inputs: WebLogic Console URL, username and password for a given domain|
|Outputs: list of JDBC URLs|
This is a manual operation, note that the procedure describes the steps a person must go through to perform it.
|Operation 2: Get a list of all database links in a given database|
|Inputs: Hostname, port, service name, dba username and password for a given database|
|Outputs: list of database links|
|Procedure: getDbLinks.sql script|
This is an automated operation, which uses a SQL script to obtain the information needed.
Note that the contract defined in both (inputs, outputs, what it does) has the same format, even though operation 1 is manual and operation 2 is automated. This is a key aspect since 1. the process of harvesting information may mix automated and manual steps and 2. it facilitates process design and the transition from a manual process to an automated one so that complex steps can be initially performed manually and be automated over time.
Stay tuned for part 2, where we will discuss how to collect the information, analyze it, verify it and present it so that it can be used by other processes.