OAM 11g Webgate Tuning

INTRODUCTION

This post is part of a larger series on Oracle Access Manager 11g called Oracle Access Manager Academy. An index to the entire series with links to each of the separate posts is available.

People typically are introduced to Webgate tuning in one of two ways, either forced into it because of a crisis or actively preparing an environment to do some aggressive load testing.  Hopefully you are in the later group.  Unfortunately, there is still a lot of mystery behind tuning some of the Webgate parameters.  This article will be focused on what I feel are the most important tuning parameters; 1) Max Connections, including the relationship between Max Connections and Max Number of Connection, 2) the Failover Threshold, and 3) the AAA Timeout Threshold.  If you can grasp the concepts around these few important key parameters your success in getting better performance and stability out of the Webgates and Access Servers will greatly increase.

Quick Overview

Knowledge in this article is based on extensive experience in the field, discussions with Oracle Webgate developers, and of course invaluable peers.  As I already mentioned in the introduction I will break out the Webgate tuning into three areas to help make it a little easier to digest.   Each of the three parameters are not necessarily relate to each other or dependent, so you are free to jump to the section you are interested in.  However, I highly advise that you spend time reading the entire article before making any major tuning changes.  Below is a screenshot of an 11gR2PS3 (OAM 11.1.2.3.0) Webgate definition that highlights the parameters I will cover plus any associated field; all settings are R2PS3 default values.

 

img1_webgate_def

 

Worker or Pre-Fork Mode

OHS and Apache will run in one of two modes, “Worker” or “Pre-Fork”. The default for OHS is Worker mode, but with Apache it can depend on how it was compiled, but the typical implementation uses Worker mode in newer 2.2.x+ versions.  Be sure to verify what mode you are running in as this makes a big impact on how you tune the Webgate.  I will only focus on Worker mode for this article as these days very few customers still use pre-fork anymore.

Worker mode uses multiple child processes with several threads for each process. Each thread will handle one connection at a time.  Take the following example configuration from a typical httpd.conf.  There are four key directives in bold that I will mention, StartServers, MaxClients, ServerLimit, and ThreadsPerChild.

 

<IfModule mpm_worker_module>
     StartServers         2
     MaxClients         150
     ServerLimit          6
     ThreadsPerChild     25
     MinSpareThreads     25
     MaxSpareThreads    756
     MaxRequestsPerChild  0
     AcceptMutex fcntl
     LockFile
</IfModule>

 

As a quick primer I want to explain each of the following directives so that you can relate to it comments I make about how they would correlate to Webgate tuning.

 

Directive Description
StartServers Start X number of child httpd processes on start up.
MaxClients The Maximum number of connections that will be processed simultaneously.
ServerLimit The maximum limit of httpd child processes the server will allow.
ThreadsPerChild Sets the number of threads per child process that will do work.

 

I want to point out that ServerLimit is an important directive because that limits the number of http processes that Apache/OHS will run.  In worker mode if that parameter is not used the default limit is 16, but I have seen it grow to even more so you want to monitor this process.  To determine the number of processes it is pretty simple, run the following command.

ps -ef | grep httpd

oracle 4891 4873 0 09:09 ? 00:00:00 /scratch/oracle/middleware/ohs_home/ohs/bin/httpd.worker -DSSL
oracle 4899 4891 0 09:10 ? 00:00:00 /scratch/oracle/middleware/ohs_home/ohs/bin/httpd.worker -DSSL
oracle 4901 4891 0 09:10 ? 00:00:00 /scratch/oracle/middleware/ohs_home/ohs/bin/httpd.worker -DSSL
oracle 5064 4891 0 09:10 ? 00:00:00 /scratch/oracle/middleware/ohs_home/ohs/bin/httpd.worker -DSSL

You will notice there will be a parent PID which is typically the first row or the lowest PID column value of the httpd processes.  Above the parent is 4891, and there will ever only be one parent.  The children httpd process of course each have their own PID, but their PPID values will be the parent PID.  The parent is not an actual httpd process serving requests, it mostly monitors and manages new and existing children httpd processes. It is also very important to understand that for each httpd child process that is spawned the Webgate will grab X number of Max Connections based on the Webgate Max Connections setting.  So for example if the Webgate Max Connection is set to 8 then one httpd child process will immediately get 8 OAP connections, and if 4 children httpd processes are spawned then each of the 4 httpd children processes will get 8 OAP connections.  That will give a total of 4 x 8 = 32 OAP connections from that Apache/OHS server with the Webgate plugin…more on this in the next section.

 

Max Connections — Not so Literal

I want to cover more detail on the Max Connection parameter, but first things first, we need to understand how connections work with web servers and how it relates to the Webgate module. Let’s focus on OHS which is basically Apache.  So if you have Apache assume everything I say will also apply.  However, if you use another supported 11g Webgate on a web server other than OHS or Apache, how connections work could be different so please extrapolate this information and apply it to your web server.

The Max Connections parameter can reap some big improvements in performance, but beware — increasing the value does not necessarily equate to increased performance and in fact can even have a negative impact. The official Oracle OAM 11g 11.1.2 Administration Guide says, “Max Connections is the maximum number of connections that a Webgate can establish with the Access Server.” This statement is some what true, but is not literal.   It may lead you to think applying Max Connections value X will only send X number of connections to the Access Server, but that is an incorrect assumption. Now let’s cover some of the OHS/Apache directives I mentioned earlier to make sense of Max Connections and what their impacts are.

StartServer Directive

Remember earlier I mentioned that each child httpd process gets X number of Max Connections?  I also used the example if Max Connections is set to 8, then each httpd child process will get 8 OAP connections.  So it makes sense if Max Connections is changed to 24 then the number of connections each httpd child process gets is 24.  Now lets look at the StartServer directive, if the value is set to 2, that means 2 httpd child processes will spawn up on the start of OHS.  Then if we expand on the Max Connections is 24 and we have 2 httpd processes that initially startup, then we will see 24 x 2 = 48 OAP connections coming from the Webgate of that OHS server; the math is that simple.

MaxClients Directive

This directive is fairly straight forward and is pretty much what the description is for what I have in the table.  MaxClients limits the number of simultaneous client connections OHS/Apache will handle. One important thing to know is the calculation of MaxClients, which is simply MaxClients = ServerLimit x ThreadsPerChild.  So if ServerLimit is 4 and ThreadsPerChild is 25 you will set the MaxClients to 100, which is not a typical value and is quite small.  Though you get the idea and that means if you want to increase MaxClients to say 512 you will have to adjust the formula.  For example MaxClient is 512 and ThreadsPerChild is 64, then you will have to adjust ServerLimit to 8; e.g. MaxClients 512 / ThreadsPerChild 64 = ServerLimit 8.  Increasing or Decreasing the MaxClients only adds or subtracts from the demand the web server will allow and does not impact on adjusting any Webgate configuration, but it can impact other tuning directives and how the Webgate uses them, more on this in the next couple sections.

ServerLimit Directive

Earlier I mentioned ServerLimit and how it sets a ceiling on the number of httpd child processes.  This directive is important to administrators who want to limit the resources the web server will use because the more httpd children processes that are spawned the more CPU, memory, file descriptors, etc. resources are used.  For example if ServerLimit is set to 8, then you are limiting the total number of httpd child process created.  Now these httpd child processes only spawn up based on the demand on the web server.  Remember when I pointed out StartServer kicks off the initial httpd children?  So if StartServer is 2, and ServerLimit is 8, on the start of the web server 2 of the 8 httpd children processes will start up.  At peak demand all 8 httpd child processes will be running.  Then each of those httpd children will open up the Max Number of connections.  So if Max Connections is 8 we can calculate — Max Connections 8 * ServerLimit 8 = 64 OAP connections.

ThreadsPerChild Directive

Now it gets more interesting when we talk about the ThreadsPerChild.  The ThreadsPerChild is a directive that works with worker mode, and that simply says how many threads will be started for each httpd child process.  So in our example configuration ThreadsPerChild is 25, so for each httpd child process that is started, they each will get 25 threads.  The threads are what is actually doing the work.  Remember when I mentioned that Max Connections defined how many OAP connections are opened for each httpd child process?  Keep this in your head because now that we have 25 threads for each httpd child process, each of these threads have to share the OAP connections.  A thread will grab a connection to do things like authentication requests, authorization requests, etc.  While a thread is using a connection, no other thread can use it.  One a thread is finished with a connection it will toss it back in the pool for other threads to use the connection.  This is a very important concept to understand because as you increase the ThreadsPerChild and don’t increase the number of OAP connections the threads can be starved under heavy http traffic, which ultimately starts creating congestion.   So though you may want to minimize memory usage on the OHS server by reducing the number of httpd children processes and increase the ThreadsPerChild to get a lot of work out of each process, you can cause a lot of congestion to the point where there becomes bottlenecks that can cause the Webgate to complain it cannot contact any Access Server.

 

img2_http_threads

 

 

Balancing Max Connections with Apache Directives

I am sure you are asking, so what is a good value for Max Connections?  As for a magical recommended number, besides calculating the total sum based on the Max Number of Connections from each primary Access Server (more on that in the next section), unfortunately there is no sweet spot.  The value needs to be determined based on experimenting with load tests and recording the results that can be compared to see what values reap the best performance.  No implementation is alike, and as many deployments I have seen I have equally seen as many different values.  That said some considerations are if ThreadsPerChild is high they will need plenty of connections to satisfy the demand on a busy web server, so increasing the Max Connections can help.  However, in my own tests I have found if ThreadsPerChild is too high even opening up a lot of Max Connections does not produce the high performance you may think.  Personally I have found a good balance to be ServerLimit around 16, ThreadsPerChild around 64, and MaxClients calculates out to 1,024 ( MaxClients = ServerLimit x ThreadsPerChild ).  These values seem to get good throughput, but again you have to load test and manipulate the settings while recording the results to see what gets the best performance and throughput for your environment.  Now before you decided on the Max Connections value, you need to read the next section.

 

Making the Connection to Max Connections

There is no pun in the connection between Max Connections and Max Number of Connections. In a nutshell, the value for the Max Connections parameter should be the sum of all the Max Number of Connections from each Primary Server. Take the following diagram as an example.

maxconn_01

The value for Max Connections in the diagram is 12. If you add up the Max Number of Connections from each of the three Primary Servers it totals 12 (4+4+4=12).

Let’s take another example, but this time change OAM 3 primary Access Server to a secondary server, and also update the Max Number of Connections value for each OAM Server from 4 to 6.

maxconn_02

The first thing I want to point out is that the secondary Access Server will not get requests from the Webgate until connections to any primary Access Server fall below the Failover Threshold; more on that later. Since we have two primary OAM servers with Max Number of Connections values of 6 each, the total Max Connections value for the Webgate would be 12 (6+6=12); it is pretty simple. Now that we understand how to get the value for Max Connections parameter, you maybe wondering about what value to even use for Max Number of Connections; 4, 6, 20, 100? Good question, and fortunately Chris Johnson wrote a great article on this very subject, “How many connections do I need from the WebGate to the OAM Server?”. Again, it must be called out that the number you define in the Webgate profile will be multiplied by the number of Web Server child processes to determine the actual number of connections – so a little can often go a long way!

 

Does each Max Number of Connections need to be Symmetrical?

So far in my examples I have made each OAM server Max Number of Connections the same or symmetrical, but you don’t necessarily have to do that. You can optionally add more connections to different primary servers if you want more requests to go to any specific server. This strategy is basically a type of load balancing using the Webgate Max Number of Connections configuration value instead of using an actual physical load balancer appliance; take the following diagram as an example.

maxconn_03

Notice that OAM 1 primary server has 8 Max Number of Connections while OAM 2 and OAM 3 primary servers have 4 each. So the total Max Connections value would be 16 (8+4+4=16). In this particular configuration OAM 1 server would get double the number of connections from the Webgates as the other two primary OAM servers. One reason to do this would be that OAM 1 is a much larger server, more memory, etc. and can handle more traffic, or maybe OAM 1 is physically closer to the Webgate so it can process requests much faster. In reality even though this is an option, I have never really seen this in practice because normally all the servers have the equivalent sized hardware, are in the same network, and therefore there is no need to distribute more requests to any one server. That said, I did want to at least bring this up so you understand that there are options for various reasons if you so decide it makes sense.

 

The Skinny on Failover Threshold

The latest (At the time of this post) official 11g Access Manager documentation in section Table 16-3 Elements on Expanded 11g and 10g WebGate/Access Client Registration Pages says the Failover Threshold parameter is “Number representing the point when this Webgate opens connections to a Secondary OAM Server.” It also gives an example, if 30 were used as a value, and the number of connections to primary servers drops to 29, connections begin to open up to the secondary Access Server; the default value is 1. This description kind of gives an idea of what is happening, but no recommendations and some find it confusing. So I wanted to add some of my experience with recommendations.

 

1. First, the word “Failover” in the parameter name is exactly what it means. As connections are lost from each primary OAM server, the Webgate will then try to make up that connection by connecting to a secondary OAM server; hence the word “Failover”.  So a big note here, this setting only works if there are at least one or more secondary OAM servers defined in the Webgate profile. The parameter Failover Threshold will do nothing if there is no secondary OAM server defined.

failover_02

2. Second, the word “Threshold” in the parameter name is talking about at what point do connections begin to go over to the secondary OAM server(s).   Based on the official documentation, which is correct, if the Failover Threshold is set to 6 where the Max Number of Connections is also set to 6, then as soon as the number of connections going from the Webgate to the OAM server drops below the Failover Threshold of 6, connections will start to be sent to the secondary OAM server(s).   If there are two secondary OAM servers, the first in the list will be the one getting all the connections. As soon as the first secondary OAM server fills up its Max Number of Connections, the second secondary OAM server will start getting connections. Are you following?

So the big question is what is the best setting? My recommendation is two fold.

1. If you DO have Secondary OAM Servers configured:
Set the Failover Threshold value equal to the Max Number of Connections only if you have at least one secondary OAM server. Take my examples above, if the OAM server Max Number of Connections is 4, then set the Failover Threshold to 4. The reason for this is that you engage all the processing power needed as connections drop from any one primary OAM server since the secondary OAM server will start picking up the slack. As soon as the primary server having connection problems corrects itself, the Webgate will start failing back to the primary OAM server and slowly drop the connections from the secondary server until all the Max Number of Connections are met.

2. If you DO NOT have Secondary OAM Servers configured:
If you decide not to configure any secondary OAM server, you can leave the Failover Threshold value to the default of 1 because it will never be used. Remember, Failover Threshold requires a secondary OAM server to be configured. In practice, most clients like to see all their hardware provide some value, which means keep them all working to get their money worth. So I will typically see all OAM servers configured as primary servers; there is nothing wrong with this. That said, I have also seen various configurations with a mix of primary and secondary servers in a criss cross fasion that is a bit more complicated, but certainly has merrits too depending on the situation.

If you follow either of the points above you should have a solid configuration.

 

AAA Timeout Threshold

The AAA Timeout Threshold parameter setting determines how long the Webgate will wait on a connection response before it gives up and attempts to request a new connection. For example let’s say the Webgate has a connection opened, and a request comes through to validate some credentials. This process normally should take a fraction of a second, but there could be all sorts of variables to make this request take much longer. If the wait for the response is longer than the AAA Timeout Threshold, it will abandon the connection for that request, toss it back in the pool, and open a new connection to try again.

For most of OAM’s life (prior to R2 PS3), the default value for AAA Timeout Threshold is “-1” (minus one). The -1 is a special value that tells the Webgate to use the operating systems TCP timeout, which could easily be 2 minutes or even more! I have seen actual cases in practice where something goes awry with some Access Server and while the Webgate tries to connect to the Access Server or get some response from it, the Webgate keeps trying for a long time because the AAA Timeout Threshold was set to the default -1. As each connection tries for a very long time, the Webgate begins to get into a state that gives impression it is down when in reality the Webgate is doing what it was told, and that was to wait for a long time before retrying. When all the connections start doing this we have an OAM zombie apocalypse problem. Zombies are bad, but we can try to avoid this behavior by shortening that wait time.

The recommended value is any where from 5 to 10; this is in seconds. For example if you set the AAA Timeout Threshold to 5, the Webgate will open its connection, send its request, and expect to get a response back in say 5 seconds. If not, then it opens a new connection and tries again while the old connection is just freed up and tossed back into the pool. If the value is set to be shorter, like say 1 second, an authentication or authorization request could possibly take longer because the Access Server is waiting for a long LDAP search to be returned, and therefore send us into a whirling tail spin because you would never get your request completed since there is not enough time allotted for such an LDAP search. So we have found that a 5 – 10 seconds value seems to be a fair and balanced approach.  In R2 PS3 the default is now 5 seconds, which is reasonable.

 

User-Defined Webgate Parameters

One worthy parameter to mention that many may not know about is “client_request_retry_attempts”. A description of this parameter can be found in the latest (at the time of this article) in the official Oracle online document https://docs.oracle.com/cd/E40329_01/admin.1112/e27239/register.htm#AIAAG5856. The official description says; “WebGate-to-OAM Server timeout threshold specifies how long (in seconds) the WebGate waits for the OAM Server before it considers it unreachable and attempts the request on a new connection.” This at first seems similar to the AAA Timeout Threshold, but the difference is that this parameter is more about how many times the WebGate will retry its request before attempting the secondary server.

So if the AAA Timeout Threshold is set to 5 seconds, it will time out that connection after 5 seconds if there is no response, but using the client_request_retry_attempts tells the Webgate how any times it will attempt to retry that connection. If the value is set to 2, then the Webgate will wait 5 seconds (Assuming the AAA Timeout Threshold is set to 5), and if it times out it will try up to 2 times before timing out the connection. This configuration may be useful if you think a network connectivity between the Webgates and the Access Servers are not stable and you want the Webgate to at least try more than once before closing its connection.

 

Summary

I realize there are a lot of details in this blog, but it is all very useful and you may need to read each section carefully to absorb the data.  I can say that tuning the Webgate profile is a very important part of an OAM deployment and can save you lots of late nights worrying about performance or outages.  Good luck and be sure to load test your configurations before going live.

Comments

  1. Hi Tim,

    Thanks much for this information.

    I understand the formula for configuring ‘Max Connections’.

    Do we need to stick to a different approach if we are using an OTD Webgate ?

    Cheers,
    Padma

    • Tim Melander says:

      Unfortunately I don’t have an answer, this blog is specific to OAM Webgates, I have no experience with the OTD Webgate. I do know comparing the OTD Webgate and OAM Webgate is most likely comparing apples to oranges. I would suggest opening up a SR and ask any tuning questions.

  2. sekhar. V says:

    Hi Tim,

    Max Connections 8 * ServerLimit 8 = 64 OAP connections.

    If each Serverlimit has 25 threads that would be 25 * 8 = 200 threads, are the 64 OAP connections shared by 200 threads under peak or full load?

  3. sekhar. V says:

    Hi Tim,

    I see there x number of active connections from the monitoring page of a webgate in oamconsole.
    Will these active connections give us any idea on how the web gate tuning parameters can be set. For example i have set the max connections as 5 and threshold as 5, but i do see OAM communication errors encountered by the webgate, This is an Webgate on OHS .

    Thanks
    -Sekhar

  4. sekhar. V says:

    Hi Tim,

    How does these max connection values differ when we mention a load balancing url instead of individual host names?
    I mean what is recommended to use, is it the load balancer url or individual host names as primary servers?

    Is setting primary servers as LB URL (under which there are servers OAM1 and OAM2) be same as setting primary servers as OAM1 and OAM2

    Thanks
    Sekhar

    • Tim Melander says:

      Hi Sekhar,

      I am assuming you mean when you use a load balancer virtual ip (VIP) to point to multiple Access Servers over TCP for the OAP traffic. So you would have 1 primary hostname defined in the Webgate definition with is a VIP, and the load balancer distributes the requests to oam_server1 and oam_server2. So first of all a very good question. The Webgate in this case thinks there is only 1 primary Access Server and therefore sends all Max Connections to that hostname, and in this case a VIP. The load balancer is not going to split the OAP connections between oam_server1 and oam_server2, which means the Max Connections value defined in the Webgate is what is going to be sent to each Access Server. So if the Max Connections is 8, then 8 OAP connections will be opened up to both oam_server1 and oam_server2. Therefore, you will need to decide what to use for Max Connections using a load balanced VIP versus a real hostname for each primary Access Server.

      • sekhar. V says:

        Hi Tim ,

        Thanks for the quick response.
        Yes VIP defined under primary server distributes the requets to oam_server1 and oam_server2.
        So how does the threshold work in this scenario, the web gate sees only one VIP which has say 8 max connections.

        But Behind the VIP each oam_server has 8 connections each, so how will the webgate know that the threshold value has reached?

        Another question i have, is there an advantage of using vip over individual host names in the primary server list?

  5. Tim,

    I hope all is well.

    Excellent article explaining the various options with WebGate HA options! At my client here, we have a two node OAM server cluster where based on the behavior I saw yesterday, node 1 must be primary and node 2 is secondary since our PROD access stopped until I shutdown the OAM WebLogic server on node 1 that had numerous stuck threads. Once the node 1 WLS server was shutting down, I noticed that access was restored where all the authorization requests failed-over to node 2 (also confirmed by looking at OEM’s OAM authorization metrics). Once I brought up the node 1 OAM server, all these requests failed-back to node 1.

    I also believe we have not set the AAA Timeout Threshold or this threshold value is to long, which hopefully explains why we hung our environment. I plan to go with primary, primary and set the timeout to no more than 10 seconds as you recommended.

    Thanks.

    Aubrey

    • Tim,

      I checked our OAM access manager configuration and confirmed AAA Timeout Threshold is set to -1 (the default) but we have two primary servers with each Max Number set 1 BUT our Max Connections is also set to 1 with no secondary server configured. I assume that having Max Connections set to 1 will force all OAP request to the first server unless the first server is down.

      Is this correct?

      According to your note you said “Max Connections parameter should be the sum of all the Max Number of Connections from each Primary Server” where “should be” implies that not doing so should/can work.

      Thanks.

      Aubrey

      • Tim Melander says:

        First of all the AAA Timeout Threshold set to -1 means the Webgate will never give up on the connection to the Access Server, that is not a good idea. For example the Webgate will send a request to an Access Server and wait for a response, with a AAA Timeout Threshold set to -1 the Webgate will never give up and there can be times that an Access Server is unresponsive which then will start causing all the threads to start queueing up and therefore cause login and authorization problems. A better value for AAA Timeout Threshold is 10 or 20, which means the Webgate will wait up to 10 seconds before it gives up and closes that connection to open another connection. However there is another setting I mention in my blog that is a Webgate configuration User-Defined Webgate Parameter named client_request_retry_attempt, and that default is 1. In combination of AAA Timeout Threshold equal to 10 and client_request_retry_attempt equal to say 2, the Webgate will wait up to 10 seconds to get a response from the Access Server and if it exceeds that time, then it will try 1 more time for a total of 2 attempts on the same connection. After 2 – 10 second attempts the Webgate still does not get a response, then the connection is closed and a new connection is tried.

        The Max Connection should be a summary all Primary Access Server Max Connections value. For example if there are two Access Servers, each are set as Primary, and your Max Connections for each is 1, then the Max Connections should be 2. The Webgate does not load balance between the primary Access Servers. The way the algorithm works is the Webgate will start to open OAP connections to the first primary Access Server until it exceeds the Max Number of Connections, and then start sending requests to the next primary Access Server, and if there are three primary Access Servers then the same and so on.

  6. Hello Tim

    Great blog post! This was an interesting read and i have more clarity of on webgate to OAM connectivity and failover.

    I had a question regarding the failover scenario. Basically, we have 2 datacenters (each with a dual node OAM cluster). Applications like EBS, ADF, etc. may be hosted in either datacenter and some applications are hosted in both datacenters. In order to protect these applications with webgate, we’ve configured the webgate to include the datacenter #1 OAM servers in the primary list and datacenter #2 nodes on the secondary list

    Assuming Total Max connections of 6, with each node at 3 connections and a failover threshold of 3, will the requests be routed to datacenter #2 OAM servers, after the failover threshold of 3 is reached? We’re been trying to test this and failover does not work as expected and there is often a delay till these connections stabilize.

    Also, i wanted to check with you if it would help us to put the webgate behind the F5 and have a OAP pool for each datacenter

    Appreciate your feedback

    Thanks

    Shiva

  7. Sudhir Kulkarni says:

    Tim,
    Great article. Definitely very useful for for everyone in Access Space to understand the impact on these settings on the webgate performance.

    Just would like to point out the small typo:

    Instead of “StartServers says open up 2 threads at start up” , I believe you meant

    “StartServers says open up 2 child processes at start up”

    Thanks,
    Sudhir

Add Your Comment