The on-premises connectivity agent enables you to create integrations between on-premises applications and Oracle Integration Cloud Service (ICS)
Read all about agents in the documentation link below -
Integration Cloud Service (ICS) Connectivity Agent
In this blog, I wish to explain some problems that may be encountered when running on-premise integrations, which use connectivity agent. We will see some important agent related concepts, how to troubleshoot issues and more importantly, see how they can be avoided.
In the remainder of this blog, I refer to database as an example of an on-premise end system. However, in reality the end system could be any on-premise systems, which needs a Connectivity Agent with Integration Cloud. Refer to the Agent documentation for the list of adapters that can run on-premise with an agent.
Note that, this blog talks mainly about ICS Connectivity Agent and some of the discussions hold good for the new OIC lightweight Connectivity Agent as well.
I will point out significant changes and improvements available in OIC Connectivity Agent in the specific sections as "OIC Note"
Refer here if you are looking for Oracle Integration Cloud (OIC) documentation
To design optimal and performant on-premise integrations, it is important to understand the high-level architecture of agent and how it interacts with ICS and the end systems. See figure below for a high level architecture.
As can be seen in the diagram above, the connectivity Agent acts as a gateway from on-premise to ICS. All communication is initiated by the Agent to the Integration cloud and not vice versa. ICS does not initiate any outbound connection to the agent.
The agent posts a regular heartbeat to ICS to signal that it is alive and this reflects as a "green" agent health status in ICS monitoring console.
In addition, the agent continuously polls ICS for any design-time and runtime work that needs to be processed on-premise.
The design-time work includes 'Test Connection', 'Activation', 'Deactivation' requests. The runtime work comprises of 'processing invoke messages' that need to be sent to on-premise systems like database, E Business Suite, or private SOAP or REST endpoints.
The runtime also includes trigger messages that originate on-premise (for adapters configured as trigger in flows).
Figures below show the sequence of message flow for trigger and Invoke scenarios.
Note the 240s timeout on ICS from the time a request is available for agent to the time when on-premise response returns to ICS. We will discuss more on this in the next section.
The Agent pulls down invoke messages from Integration Cloud, which need to be processed on-premise. Similarly, it posts trigger messages originating on-premise onto to Integration Cloud.
The agent makes https REST calls to Integration Cloud for all its communication.
For Invoke messages, Integration Cloud waits for a maximum of 240s after it has posted the request for the agent. The agent needs to pull this work, execute against the database and return the response within this timeout period. Refer to the 'Invoke Flow diagram' above. If no response or Fault is received within this time, then the flow instance times out and fails with the below error (seen on the ICS diagnostic logs).
Invoke JCA outbound service failed with application error, exception: com.bea.wli.sb.transports.jca.JCATransportException:
oracle.cloud.cpi.agent.transport.aq.CpiAQException: Message not received within 240 seconds of wait interval.
It is important to note that the agent or database could continue to process the message on-premise even after ICS may have timed out after 240s.
Eventually (For example after 9mins) if and when the DB returns a response, it is lost as there is no consumer on Integration Cloud waiting to pick this response.
Responses after 240s do not make it to ICS.
It is important to understand the consequence here. Any delay in response to ICS will cause that specific flow instance to fail with the 240s timeout fault.
It also means agent and database resources are computing a response that does not make it to ICS.
Also, note that whenever ICS monitoring shows the 240s timeout error it is an outcome and not the cause. You almost definitely need to check the agent logs to determine the cause of the issue.
Note 240s timeout error is an outcome, not the cause.
Also, note that this 240s timeout is not configurable.
With this background information covered, we are now ready to see some real world issues with agent, that could arise because of incorrect design, massage payload sizes, endpoint performance, network topology, etc.
As we saw in the 240s Timeout discussion above, any on-premise operations 'invoked' from agent should complete and return a response within 240s.
Hence, database CRUD operations or DB Stored Procedures (storedProc) invoked in ICS flows should be tuned for performance.
Database table indexes, AWR reports are some of the Db tools used to improve performance of database tasks considerably.
If the execution does indeed take longer than 240s, then the flow should be designed using other async patterns.
One way to employ an async design is using a temp results table on the database for storedProc integrations.
The DB storedProc can be invoked from ICS as a one-way operation. The storedProc then stores the results in a temp table on completion.
A DB trigger Flow can be used to poll the temp table for results and process them when they become available.
This ensures that the integration flow is now decoupled from the database execution times.
Other patterns like the Parking Lot Pattern can be employed to design async DB integrations.
Very rarely it is seen that, serious system errors like OutOfMemoryError and StackOverflowError during runtime processing can cause the agent's runtime consumer to terminate.
Such a condition may cause all future processing of runtime messages to fail with 240s timeout errors.
When such a condition arises, the agent would need a restart in order to resume runtime message processing. Note that during such incidents the Test Connections could be working fine and also the agent status on the Integration Cloud monitoring console cold show up as green. This is because the agent's heartbeat and design time consumers are working fine.
Above are some examples of System errors which indicate a serious error condition and can cause Agent runtime consumers to fail.
When this happens, the cause of the issue is usually external to the agent and needs to be corrected.
Huge payloads exceeding 5MB size returned from DB is one of the causes that could trigger such system exceptions.
Care should be taken during design, development and testing of integrations to ensure that certain input or external conditions do not trigger massive payloads into the agent.
It is prudent to handle data boundary conditions in order to protect not just the connectivity agent but also the database and ICS from receiving unexpectedly large payloads.
Monitor the agent logs for errors of the below kind and restart the agent.
[2018-10-08T02:00:12.321-06:00] [AdminServer] [ERROR]  [oracle.cloud.cpi.agent.transport.AQRuntimeConsumer] [tid: pool-20-thread-1] [userId: <anonymous>] [ecid:
4393c8c0-8981-7ba12f4cb97e-0000000a,0] [APP: agent-webapp] Throwable encountered in AQRuntimeConsumer. Please contact Oracle Support to resolve this issue by providing these logs.
This issue would kill this thread and agent would need to be restarted to resume normal functionality.[[
This log monitoring and restart could be automated using scripts if required.
In case of the new OIC agents, message payloads of up to 10 MB are supported through the use of compression. In addition, with OIC agents, the large payload will be 'rejected' and does not affect the runtime consumer. The agent continues to run and process further messages in the event of such System errors.
Exhaustion of Agent worker threads is another possible cause of 240s timeout experienced on integration flow instances.
Connectivity agent uses worker threads, which are responsible to process on-premise executions.
Resource contention among these worker threads can cause the invoke requests to be queued up and more likely to time out in 240s.
This condition usually occurs when the end system like database is experiencing periods of slow performance. It can get further aggravated
by peak loads and large number of concurrent requests.
To circumvent this problem on ICS, multiple agents can be deployed with each one catering to a different set of integrations. This can help balance the load by providing more agent resources.
Note that ICS does not support running multiple agents in a HA topology. ICS allows one agent per agent group. Multiple agents should register to different agent groups and hence will execute different integrations at runtime.
A new feature in OIC is Agent High Availability (HA). Agent HA provides an option to run multiple OIC Agents in Active-Active mode.
The multiple HA agents subscribe to the same agent group and help in load balancing and scaling for performance.
Agent HA also helps in mitigating the risk of a Single Point of Failure in the integration architecture.
Note: The OIC agent HA feature is in controlled availability (as of 18.4.1). Use the Support SR route to verify and enable Agent HA feature for your OIC instance!
In on-premise networks, which use a network proxy, the agent is capable of routing ICS bound requests and on-premise requests through a proxy server.
Scheduled maintenance activities on the proxy can lead to snapped communications from the agent to ICS.
Connectivity from the agent host should be tested after scheduled proxy server maintenance activities and the agent be restarted if required.
Look for agent status on integration cloud monitoring.
Also, refer to this Support Doc Note for agent patch that can help with SSL issue with proxy server.
* Constant Disconnection between Agent Server and Other Integration Services (Doc ID 2387828.1)
From a network perspective, the agent needs to be installed on-premise on the same network as the end system (say database).
Ideally, there should be no firewall in the network route between agent and database.
If the on-premise network is such that agent to database requests need to go through a network firewall, then the firewall timeouts could cause disruptions between Agent and database.
Firewall timeouts could potentially close the database pooled connections of the agent. This will cause agent to retry and renegotiate connections with the database, leading to frequent 240s timeouts.
Below diagram shows, a case where a single agent is shared to communicate with two databases and one of them is on a different network behind a firewall.
Such network topologies should be avoided and different agents should be installed on both networks.
Note that this is not a recommended topology. The agent should be co-located in the same network as the end system for best performance.
Ideally, in the above topology, network #2 should host a separate agent to communicate with database #2 as shown below
Agent Logs and Integration Cloud diagnostic logs are useful in troubleshooting the cause of errors.
In many cases, there is a need to correlate between Integration cloud logs and Agent logs to track the processing of a flow instance.
Note that in some cases, these logs could be logging in different timezones and you will need to factor the time difference!
The ICS AdminServer access log ( part of the diagnostic zip file download from Monitoring dashboard)
will show the heartbeat REST calls from agent to ICS.
Sample shown below
2018-10-07 05:49:49 0.021 39 POST _icsapis_agent_1.0.0_monitor 200 "rT8DyY0003eo0000Cf" "1.005TtoI_tH44yk4_rT8DyY0003eo0000Cf" - "XX.XX.XX.XX"
2018-10-07 05:50:04 0.037 39 POST _icsapis_agent_1.0.0_monitor 200 "rT8DyY0003eo0000EL" "1.005TtoJV7Tg4yk4_rT8DyY0003eo0000EL" - "XX.XX.XX.XX"
2018-10-07 05:50:19 0.021 39 POST _icsapis_agent_1.0.0_monitor 200 "rTl3iY0005OR0000PX" "1.005TtoKOTHk4qm4_rTl3iY0005OR0000PX" - "XX.XX.XX.XX"
If these are missing for the current period, then it indicates loss of connectivity between agent and ICS.
Log on to the agent host and check connectivity using curl_wget commands to confirm the ICS 'status' REST API for any disruptions on the on-premise network
Sample wget and curl commands
wget --user ICS_USER --password PASSWORD -e use_proxy=yes -e http_proxy=PROXY_HOST:PROXY_PORT https:__ICS_URL_icsapis_v2_status [ --no-check-certificate]`
curl -u ICS_USER:PASSWORD --proxy PROXY_HOST:PROXY_PORT https:__ICS_URL_icsapis_v2_status `
If the curl commands via the proxy fail, it is an indication of network proxy issue.
Look for log messages such as these in the agent diagnostic logs.
Verify that proxy authentication is provided correctly
Refer to Proxy Authentication section in the Agent Install A-team Blog
If the issue is not with proxy authentication, contact the on-premise network admin for resolution
Sample log statement below shows failure when agent connects using proxy server.
Failed to communicate with proxy: proxy.example.com_80.
Will try connection icsurl.integration.us2.oraclecloud.com_443
If the agent is configured with a network proxy, then the proxy is used to route both 'agent to ICS' as well as 'agent to on-premise' HTTP communications. Remember to add your on-premise system host names to the non proxy hosts (—nphosts) when installing agent to avoid intranet connectivity issues.
The ICS managedServer access log (part of the diagnostic zip file download from Monitoring dashboard)
will show the dequeue requests from Agent hitting ICS. These are GET requests from agent checking on design-time and runtime work to be performed.
recollect from diagram above #link
2018-10-08 00:27:04 4.018 5 GET _integration_flowsvc_agent_v1_AQResource_dequeue 200 "4393c8c0-8981-0000000a" "j1RP_UOTj6TC" - "XX.XX.XX.XX"
2018-10-08 00:27:09 4.063 5 GET _integration_flowsvc_agent_v1_AQResource_dequeue 200 "4393c8c0-8981-0000000a" "RSjLPSpLPT_G" - "XX.XX.XX.XX"
2018-10-08 00:27:13 4.026 5 GET _integration_flowsvc_agent_v1_AQResource_dequeue 200 "4393c8c0-8981-0000000a" "SZLRLPpLPT_G" - "XX.XX.XX.XX"
If these are missing in the access logs, then it shows that the design-time and runtime consumers are not running.
Additionally, the below sample logs statements in Agent diagnostic logs indicate if the agent is actively receiving ICS requests.
Agent receiving design-time requests
[2018-09-30T23:57:02.957-06:00] [AdminServer] [NOTIFICATION]  [oracle.cloud.cpi.agent.transport.AQConsumer] [tid: Thread-54] [userId: ] [ecid: 476d0078-0000000a,0] [APP: agent-webapp] AQ Message received with ID 7725796220aa2a3
Agent receiving runtime requests
[2018-10-08T00:21:56.350-06:00] [AdminServer] [NOTIFICATION]  [oracle.cloud.cpi.agent.transport.AQRuntimeConsumer] [tid: pool-20-thread-1] [userId: ] [ecid: 4393c8c0--0000000a,0] [APP: agent-webapp] AQ message received with ID 77b220a004a
Follow up reading