Harvesting Information from IT Systems – Part 1

 

Introduction

Back in 2012, while I was performing research on how to make a Fusion Applications environment portable, the decisive moment came when I was able to confirm that I could copy the entire FA environment from one server to another and run it there as long as I added all hostnames present in the FA configuration data to the /etc/hosts file of the operating system. By tweaking the hostname-to-IP-address mappings in the hosts file so that the requests over the network would be routed to the new machines instead of the old ones I was able to keep the duplicate FA environment almost intact, creating a perfect clone while preserving the ability to patch and upgrade. This technique later led to the creation of the Fusion Applications Cloning Tool, which the A-Team developed in 2013 under my lead.

While defining the architecture for that tool, it became clear that if we were to add all hostnames present in the FA configuration to the /etc/hosts file, we would first need to know what those hostnames were. While in a perfect world we could simply look at the server name using ping or the hostname command, this was actually much more complex since many customers have more than one hostname pointing at a single server. For example, one of our customers had two different hostnames – one local and one in DNS – for every host in their data center; they also had Virtual IPs configured on the server which use different hostnames, plus remote hostnames defined locally in the /etc/hosts file. An inspection of their FA configuration revealed that all of these hostnames were being used to point at the same server; some configuration files even used the IP address itself to reference a particular server in the topology.

So before we would be able to start the clone of the Fusion Applications environment on another host, we knew we would have to look up hostname references in Fusion Applications configuration data, which includes files, database tables, credentials stores, etc. With no tool available to do that for us at the time, we defined a procedure for finding not just hostnames but also other details such as usernames, database connections, identity management information, topology, etc. and documented it as the Discovery process in the Fusion Applications Cloning and Content Movement Administrator’s Guide.

We are currently working on a tool that automates that process and the next sections attempt to share some of the thought process that went into the Discovery process and its automated version – the Discover tool. I hope it will be of interest to anyone in need of harvesting information from IT systems like us.

 

Know Before You Go

The need to obtain information from IT systems is nothing new. Regular system maintenance including patching, upgrading and making configuration changes often requires knowing current information such as database connection details, HTTP URLs, file locations and, of course, passwords. IT teams have traditionally relied on spreadsheets, wiki pages and cookbooks to keep track of and share that kind of information, but aside from the questionable security and reliability of these methods, how can one guarantee that the information there is current? Most of the time they simply can’t.

Going back to installation response files and other files generated at installation time sometimes works but can be misleading since they will most likely not include manual configuration changes made to the environment after the installation.

Looking at current architectural trends, modern systems rely more and more on service oriented architectures, with information coming from other, often unrelated systems using HTTP-based APIs. The decentralized nature of these architectures makes it even more difficult to maintain a centralized, up to date and reliable repository of configuration information and makes traditional methods either too costly or quickly obsolete.

So in order to be sure about the information from an IT system, the ideal approach would be to be able to inspect that system as needed in order to gather information that is current and reliable. There are many different ways to go about doing that, but we have found that the following 5 points can take care of all our information harvesting needs:

 

  • Knowing the what, where and how of the information you need
  • Gathering the information
  • Analyzing it
  • Verifying it
  • Summarizing and presenting it

Harvesting

 

In this part 1 we will be discussing the first one: Knowing the what, where and how of the information you need.

Part 2 will discuss the remaining points: Gathering, Analyzing, Verifying and Presenting the information.

 

The What, Where and How of IT Information

What information?

In our example above, we know we need all hostnames, however, hostnames can be available in various shapes (URLs, database connect strings, property values, etc.) so when looking for hostnames we are actually looking for all possible hostname shapes.

This may be true for other types of information as well, such as usernames, port numbers, file names, etc. So before you go on your quest to find the information, make sure you know all the available shapes and forms in which it may be presented and you have included all of them in your list. With this in mind, our search for hostnames now includes the following:

 

Information shape Example Found for example in
Hostnames fusionapps.mycompany.com Properties files
HTTP URLs http://fusionapps.mycompany.com:10614 Web Service connections
JDBC URLs jdbc:oracle:thin:@fusiondb.mycompany.com:1521/fusiondb Database connections
LDAP URLs ldap://idm.mycompany.com:3060 Identity Store connections
Database connect strings (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=fusiondb.mycompany.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fusiondb))) Database connections
Hostname:Port pairs fusionapps.mycompany.com:10613 Oracle HTTP Server configuration files

 

Where to find it?

In a perfect world, all configuration information for a given system is managed through a single web-based console, but we all know the IT world is far from perfect. In reality, modern systems span multiple products, often built on different technologies, with each system maintaining their own configuration, including database connections, identity system connections, API endpoint connections, etc.

In our hostname example, in order to find all hostnames (and other shapes) used in Fusion Applications we have to first list all the possible sources that can contain hostname references, they include among others:

 

Source type Notes
Database connections
Identity management connections E.g. SSO provider, identity store (LDAP), policy store/credential store (LDAP)
HTTP connections E.g. web service URLs, map URLs, generic HTTP URLs
Frontend HTTP endpoint configuration The hostnames defined at the HTTP server level for incoming requests
Search crawler connections External connections used to retrieve search indexing data

 

The list above provides an idea of where to look, but in order to come up with the detailed list of all the specific places that can contain hostname references, a combination of the methods below is needed at least once when coming up with the list of sources:

 

  • Product documentation detailing configuration points: this is often available for parts of the system but a comprehensive list is rarely available. Even if such is provided by the software vendor, configuration changes made by the customer (e.g. extensions, integrations, customizations) will often add more configuration points and, consequently, hostname references.
  • Information from domain experts: even if the documentation is not available through the software vendor, engineers often create their own lists and often publish them in blogs or whitepapers. While unofficial, this source of information is often valuable as a way to validate the list of sources
  • Scanning the environment: before inspecting the IT system, it may be a good idea to perform system scans to obtain a list of all places where a piece of information is found. I’ll provide details about this technique in another blog. This activity does not have to be performed every time you want to perform discovery, but should be performed every time there is a significant change in the environment being discovered, such as an upgrade or an extension.

 

How to get to it?

Now that we know all information we need (in its various shapes and forms) and where to get it from, we need to find a way to get to it. Once again, having a web-based console to get to the information really helps when performing manual discovery. A tool like Oracle Cloud Control can aggregate information from multiple systems/consoles and make this process a lot easier.

If the information is not readily available in a central location, one can always obtain each piece manually by:

  • Invoking command line tools e.g. SQL*Plus
  • Getting it from a web-based console or web page
  • Inspecting system files

 

However the information we are looking for is so vast and detailed that 1. Doing it manually would take way too much time and 2. Going directly to the source (APIs, command-line tools, LDAP directories and sometimes all the way down to database tables and configuration files) guarantees we will not miss anything. Since the information we are looking for lives in a variety of sources, we will need a variety of access methods e.g.:

 

Access Method Used to access
SQL /JDBC database resources e.g. transactional data, some configuration data
LDAP LDAP directories for identity (user and group) information
HTTP Used to access web-basedAPIs (web services, REST, etc.)
JMX / T3 MBeans (to access WebLogic server configuration for example)
File system files
Process invocation output from command line tools

Most programming languages including Java have APIs for all of these which makes it possible to automate them.

Once the information has been obtained, it may have to go through some basic processing to extract important elements from it, as in our example, extracting the hostname from an HTTP URL or from a configuration (properties) file. When performing discovery manually, this becomes part of the process naturally, however when automating one must invoke this as a separate task. Some of the ways this can be done are listed below:

 

Text format Processing method Description
Any text Regular expressions RegEx can be used to extract information from literally any text, however for structured text there may be other optimized ways to do it (see examples below)
XML XPath XML is widely used as a format for configuration files and XPath is certainly the best way to obtain specific information from XML data.
JSON Javascript / JSONPath JSON data can be processed directly in JavaScript or alternatives such as JSONPath
CSV ODBC, etc. Comma-separated values is a widely used format for storing tabular data and is compatible with popular tools like Excel.
Properties files Regular expressions / Java Properties class  Properties files are widely used for storing system configuration

 

 

Make it repeatable

Once you define the access method and processing method for each source, it is very important to ensure that the access and processing of the information can be encapsulated into a repeatable step. We wouldn’t want to take all the thought process that went into getting that information once and let it go to waste. This can be done through documentation and/or creation of code units that can be run independently, normally through scripts.

In the Discover tool, we call these units “operations” and we implement them as Java code. But an operation is, in reality, an abstract concept that defines a contract (inputs and outputs, plus the action it performs) and can be implemented in a variety of ways. We will discuss more about how we are coding operations and their features in another blog post. For now, all you need to know is that they are units of documentation or code that allow you to obtain a specific set of information from a given source.

 

Definition of an operation:

Snap1

Here are a couple of examples of operations:

 

Operation 1: Get a list of all JDBC URLs used by data sources in a WebLogic Domain
Inputs: WebLogic Console URL, username and password for a given domain
Outputs: list of JDBC URLs
Procedure:

  • Go to WebLogic Console using the given URL, username and password
  • Navigate to the data sources page and click on one of the data sources
  • Click on the Connection Pool tab
  • Write down the JDBC URL used
  • Repeat for each data source

 

This is a manual operation, note that the procedure describes the steps a person must go through to perform it.

 

Operation 2: Get a list of all database links in a given database
Inputs: Hostname, port, service name, dba username and password for a given database
Outputs: list of database links
Procedure: getDbLinks.sql script 

This is an automated operation, which uses a SQL script to obtain the information needed.

Note that the contract defined in both (inputs, outputs, what it does) has the same format, even though operation 1 is manual and operation 2 is automated. This is a key aspect since 1. the process of harvesting information may mix automated and manual steps and 2. it facilitates process design and the transition from a manual process to an automated one so that complex steps can be initially performed manually and be automated over time.

 

Stay tuned for part 2, where we will discuss how to collect the information, analyze it, verify it and present it so that it can be used by other processes.

 

 

Add Your Comment