This post is part of a larger series on Oracle Access Manager 11g called Oracle Access Manager Academy. An index to the entire series with links to each of the separate posts is available.
People typically are introduced to Webgate tuning in one of two ways, either forced into it because of a crisis or actively preparing an environment to do some aggressive load testing. Hopefully you are in the later group. Unfortunately, there is still a lot of mystery behind tuning some of the Webgate parameters. This article will be focused on what I feel are the most important tuning parameters; 1) Max Connections, including the relationship between Max Connections and Max Number of Connection, 2) the Failover Threshold, and 3) the AAA Timeout Threshold. If you can grasp the concepts around these few important key parameters your success in getting better performance and stability out of the Webgates and Access Servers will greatly increase.
Knowledge in this article is based on extensive experience in the field, discussions with Oracle Webgate developers, and of course invaluable peers. As I already mentioned in the introduction I will break out the Webgate tuning into three areas to help make it a little easier to digest. Each of the three parameters are not necessarily relate to each other or dependent, so you are free to jump to the section you are interested in. However, I highly advise that you spend time reading the entire article before making any major tuning changes. Below is a screenshot of an 11gR2PS3 (OAM 220.127.116.11.0) Webgate definition that highlights the parameters I will cover plus any associated field; all settings are R2PS3 default values.
OHS and Apache will run in one of two modes, “Worker” or “Pre-Fork”. The default for OHS is Worker mode, but with Apache it can depend on how it was compiled, but the typical implementation uses Worker mode in newer 2.2.x+ versions. Be sure to verify what mode you are running in as this makes a big impact on how you tune the Webgate. I will only focus on Worker mode for this article as these days very few customers still use pre-fork anymore.
Worker mode uses multiple child processes with several threads for each process. Each thread will handle one connection at a time. Take the following example configuration from a typical httpd.conf. There are four key directives in bold that I will mention, StartServers, MaxClients, ServerLimit, and ThreadsPerChild.
As a quick primer I want to explain each of the following directives so that you can relate to it comments I make about how they would correlate to Webgate tuning.
|StartServers||Start X number of child httpd processes on start up.|
|MaxClients||The Maximum number of connections that will be processed simultaneously.|
|ServerLimit||The maximum limit of httpd child processes the server will allow.|
|ThreadsPerChild||Sets the number of threads per child process that will do work.|
I want to point out that ServerLimit is an important directive because that limits the number of http processes that Apache/OHS will run. In worker mode if that parameter is not used the default limit is 16, but I have seen it grow to even more so you want to monitor this process. To determine the number of processes it is pretty simple, run the following command.
ps -ef | grep httpd
oracle 4891 4873 0 09:09 ? 00:00:00 /scratch/oracle/middleware/ohs_home/ohs/bin/httpd.worker -DSSL
oracle 4899 4891 0 09:10 ? 00:00:00 /scratch/oracle/middleware/ohs_home/ohs/bin/httpd.worker -DSSL
oracle 4901 4891 0 09:10 ? 00:00:00 /scratch/oracle/middleware/ohs_home/ohs/bin/httpd.worker -DSSL
oracle 5064 4891 0 09:10 ? 00:00:00 /scratch/oracle/middleware/ohs_home/ohs/bin/httpd.worker -DSSL
You will notice there will be a parent PID which is typically the first row or the lowest PID column value of the httpd processes. Above the parent is 4891, and there will ever only be one parent. The children httpd process of course each have their own PID, but their PPID values will be the parent PID. The parent is not an actual httpd process serving requests, it mostly monitors and manages new and existing children httpd processes. It is also very important to understand that for each httpd child process that is spawned the Webgate will grab X number of Max Connections based on the Webgate Max Connections setting. So for example if the Webgate Max Connection is set to 8 then one httpd child process will immediately get 8 OAP connections, and if 4 children httpd processes are spawned then each of the 4 httpd children processes will get 8 OAP connections. That will give a total of 4 x 8 = 32 OAP connections from that Apache/OHS server with the Webgate plugin...more on this in the next section.
I want to cover more detail on the Max Connection parameter, but first things first, we need to understand how connections work with web servers and how it relates to the Webgate module. Let's focus on OHS which is basically Apache. So if you have Apache assume everything I say will also apply. However, if you use another supported 11g Webgate on a web server other than OHS or Apache, how connections work could be different so please extrapolate this information and apply it to your web server.
The Max Connections parameter can reap some big improvements in performance, but beware --- increasing the value does not necessarily equate to increased performance and in fact can even have a negative impact. The official Oracle OAM 11g 11.1.2 Administration Guide says, “Max Connections is the maximum number of connections that a Webgate can establish with the Access Server.” This statement is some what true, but is not literal. It may lead you to think applying Max Connections value X will only send X number of connections to the Access Server, but that is an incorrect assumption. Now let's cover some of the OHS/Apache directives I mentioned earlier to make sense of Max Connections and what their impacts are.
Remember earlier I mentioned that each child httpd process gets X number of Max Connections? I also used the example if Max Connections is set to 8, then each httpd child process will get 8 OAP connections. So it makes sense if Max Connections is changed to 24 then the number of connections each httpd child process gets is 24. Now lets look at the StartServer directive, if the value is set to 2, that means 2 httpd child processes will spawn up on the start of OHS. Then if we expand on the Max Connections is 24 and we have 2 httpd processes that initially startup, then we will see 24 x 2 = 48 OAP connections coming from the Webgate of that OHS server; the math is that simple.
This directive is fairly straight forward and is pretty much what the description is for what I have in the table. MaxClients limits the number of simultaneous client connections OHS/Apache will handle. One important thing to know is the calculation of MaxClients, which is simply MaxClients = ServerLimit x ThreadsPerChild. So if ServerLimit is 4 and ThreadsPerChild is 25 you will set the MaxClients to 100, which is not a typical value and is quite small. Though you get the idea and that means if you want to increase MaxClients to say 512 you will have to adjust the formula. For example MaxClient is 512 and ThreadsPerChild is 64, then you will have to adjust ServerLimit to 8; e.g. MaxClients 512 / ThreadsPerChild 64 = ServerLimit 8. Increasing or Decreasing the MaxClients only adds or subtracts from the demand the web server will allow and does not impact on adjusting any Webgate configuration, but it can impact other tuning directives and how the Webgate uses them, more on this in the next couple sections.
Earlier I mentioned ServerLimit and how it sets a ceiling on the number of httpd child processes. This directive is important to administrators who want to limit the resources the web server will use because the more httpd children processes that are spawned the more CPU, memory, file descriptors, etc. resources are used. For example if ServerLimit is set to 8, then you are limiting the total number of httpd child process created. Now these httpd child processes only spawn up based on the demand on the web server. Remember when I pointed out StartServer kicks off the initial httpd children? So if StartServer is 2, and ServerLimit is 8, on the start of the web server 2 of the 8 httpd children processes will start up. At peak demand all 8 httpd child processes will be running. Then each of those httpd children will open up the Max Number of connections. So if Max Connections is 8 we can calculate --- Max Connections 8 * ServerLimit 8 = 64 OAP connections.
Now it gets more interesting when we talk about the ThreadsPerChild. The ThreadsPerChild is a directive that works with worker mode, and that simply says how many threads will be started for each httpd child process. So in our example configuration ThreadsPerChild is 25, so for each httpd child process that is started, they each will get 25 threads. The threads are what is actually doing the work. Remember when I mentioned that Max Connections defined how many OAP connections are opened for each httpd child process? Keep this in your head because now that we have 25 threads for each httpd child process, each of these threads have to share the OAP connections. A thread will grab a connection to do things like authentication requests, authorization requests, etc. While a thread is using a connection, no other thread can use it. One a thread is finished with a connection it will toss it back in the pool for other threads to use the connection. This is a very important concept to understand because as you increase the ThreadsPerChild and don't increase the number of OAP connections the threads can be starved under heavy http traffic, which ultimately starts creating congestion. So though you may want to minimize memory usage on the OHS server by reducing the number of httpd children processes and increase the ThreadsPerChild to get a lot of work out of each process, you can cause a lot of congestion to the point where there becomes bottlenecks that can cause the Webgate to complain it cannot contact any Access Server.
I am sure you are asking, so what is a good value for Max Connections? As for a magical recommended number, besides calculating the total sum based on the Max Number of Connections from each primary Access Server (more on that in the next section), unfortunately there is no sweet spot. The value needs to be determined based on experimenting with load tests and recording the results that can be compared to see what values reap the best performance. No implementation is alike, and as many deployments I have seen I have equally seen as many different values. That said some considerations are if ThreadsPerChild is high they will need plenty of connections to satisfy the demand on a busy web server, so increasing the Max Connections can help. However, in my own tests I have found if ThreadsPerChild is too high even opening up a lot of Max Connections does not produce the high performance you may think. Personally I have found a good balance to be ServerLimit around 16, ThreadsPerChild around 64, and MaxClients calculates out to 1,024 ( MaxClients = ServerLimit x ThreadsPerChild ). These values seem to get good throughput, but again you have to load test and manipulate the settings while recording the results to see what gets the best performance and throughput for your environment. Now before you decided on the Max Connections value, you need to read the next section.
There is no pun in the connection between Max Connections and Max Number of Connections. In a nutshell, the value for the Max Connections parameter should be the sum of all the Max Number of Connections from each Primary Server. Take the following diagram as an example.
The value for Max Connections in the diagram is 12. If you add up the Max Number of Connections from each of the three Primary Servers it totals 12 (4+4+4=12).
Let’s take another example, but this time change OAM 3 primary Access Server to a secondary server, and also update the Max Number of Connections value for each OAM Server from 4 to 6.
The first thing I want to point out is that the secondary Access Server will not get requests from the Webgate until connections to any primary Access Server fall below the Failover Threshold; more on that later. Since we have two primary OAM servers with Max Number of Connections values of 6 each, the total Max Connections value for the Webgate would be 12 (6+6=12); it is pretty simple. Now that we understand how to get the value for Max Connections parameter, you maybe wondering about what value to even use for Max Number of Connections; 4, 6, 20, 100? Good question, and fortunately Chris Johnson wrote a great article on this very subject, “How many connections do I need from the WebGate to the OAM Server?”. Again, it must be called out that the number you define in the Webgate profile will be multiplied by the number of Web Server child processes to determine the actual number of connections – so a little can often go a long way!
So far in my examples I have made each OAM server Max Number of Connections the same or symmetrical, but you don’t necessarily have to do that. You can optionally add more connections to different primary servers if you want more requests to go to any specific server. This strategy is basically a type of load balancing using the Webgate Max Number of Connections configuration value instead of using an actual physical load balancer appliance; take the following diagram as an example.
Notice that OAM 1 primary server has 8 Max Number of Connections while OAM 2 and OAM 3 primary servers have 4 each. So the total Max Connections value would be 16 (8+4+4=16). In this particular configuration OAM 1 server would get double the number of connections from the Webgates as the other two primary OAM servers. One reason to do this would be that OAM 1 is a much larger server, more memory, etc. and can handle more traffic, or maybe OAM 1 is physically closer to the Webgate so it can process requests much faster. In reality even though this is an option, I have never really seen this in practice because normally all the servers have the equivalent sized hardware, are in the same network, and therefore there is no need to distribute more requests to any one server. That said, I did want to at least bring this up so you understand that there are options for various reasons if you so decide it makes sense.
The latest (At the time of this post) official 11g Access Manager documentation in section Table 16-3 Elements on Expanded 11g and 10g WebGate/Access Client Registration Pages says the Failover Threshold parameter is “Number representing the point when this Webgate opens connections to a Secondary OAM Server.” It also gives an example, if 30 were used as a value, and the number of connections to primary servers drops to 29, connections begin to open up to the secondary Access Server; the default value is 1. This description kind of gives an idea of what is happening, but no recommendations and some find it confusing. So I wanted to add some of my experience with recommendations.
1. First, the word “Failover” in the parameter name is exactly what it means. As connections are lost from each primary OAM server, the Webgate will then try to make up that connection by connecting to a secondary OAM server; hence the word “Failover”. So a big note here, this setting only works if there are at least one or more secondary OAM servers defined in the Webgate profile. The parameter Failover Threshold will do nothing if there is no secondary OAM server defined.
2. Second, the word “Threshold” in the parameter name is talking about at what point do connections begin to go over to the secondary OAM server(s). Based on the official documentation, which is correct, if the Failover Threshold is set to 6 where the Max Number of Connections is also set to 6, then as soon as the number of connections going from the Webgate to the OAM server drops below the Failover Threshold of 6, connections will start to be sent to the secondary OAM server(s). If there are two secondary OAM servers, the first in the list will be the one getting all the connections. As soon as the first secondary OAM server fills up its Max Number of Connections, the second secondary OAM server will start getting connections. Are you following?
So the big question is what is the best setting? My recommendation is two fold.
1. If you DO have Secondary OAM Servers configured:
Set the Failover Threshold value equal to the Max Number of Connections only if you have at least one secondary OAM server. Take my examples above, if the OAM server Max Number of Connections is 4, then set the Failover Threshold to 4. The reason for this is that you engage all the processing power needed as connections drop from any one primary OAM server since the secondary OAM server will start picking up the slack. As soon as the primary server having connection problems corrects itself, the Webgate will start failing back to the primary OAM server and slowly drop the connections from the secondary server until all the Max Number of Connections are met.
2. If you DO NOT have Secondary OAM Servers configured:
If you decide not to configure any secondary OAM server, you can leave the Failover Threshold value to the default of 1 because it will never be used. Remember, Failover Threshold requires a secondary OAM server to be configured. In practice, most clients like to see all their hardware provide some value, which means keep them all working to get their money worth. So I will typically see all OAM servers configured as primary servers; there is nothing wrong with this. That said, I have also seen various configurations with a mix of primary and secondary servers in a criss cross fasion that is a bit more complicated, but certainly has merrits too depending on the situation.
If you follow either of the points above you should have a solid configuration.
The AAA Timeout Threshold parameter setting determines how long the Webgate will wait on a connection response before it gives up and attempts to request a new connection. For example let’s say the Webgate has a connection opened, and a request comes through to validate some credentials. This process normally should take a fraction of a second, but there could be all sorts of variables to make this request take much longer. If the wait for the response is longer than the AAA Timeout Threshold, it will abandon the connection for that request, toss it back in the pool, and open a new connection to try again.
For most of OAM's life (prior to R2 PS3), the default value for AAA Timeout Threshold is “-1” (minus one). The -1 is a special value that tells the Webgate to use the operating systems TCP timeout, which could easily be 2 minutes or even more! I have seen actual cases in practice where something goes awry with some Access Server and while the Webgate tries to connect to the Access Server or get some response from it, the Webgate keeps trying for a long time because the AAA Timeout Threshold was set to the default -1. As each connection tries for a very long time, the Webgate begins to get into a state that gives impression it is down when in reality the Webgate is doing what it was told, and that was to wait for a long time before retrying. When all the connections start doing this we have an OAM zombie apocalypse problem. Zombies are bad, but we can try to avoid this behavior by shortening that wait time.
The recommended value is any where from 5 to 10; this is in seconds. For example if you set the AAA Timeout Threshold to 5, the Webgate will open its connection, send its request, and expect to get a response back in say 5 seconds. If not, then it opens a new connection and tries again while the old connection is just freed up and tossed back into the pool. If the value is set to be shorter, like say 1 second, an authentication or authorization request could possibly take longer because the Access Server is waiting for a long LDAP search to be returned, and therefore send us into a whirling tail spin because you would never get your request completed since there is not enough time allotted for such an LDAP search. So we have found that a 5 – 10 seconds value seems to be a fair and balanced approach. In R2 PS3 the default is now 5 seconds, which is reasonable.
One worthy parameter to mention that many may not know about is “client_request_retry_attempts”. A description of this parameter can be found in the latest (at the time of this article) in the official Oracle online document https://docs.oracle.com/cd/E40329_01/admin.1112/e27239/register.htm#AIAAG5856. The official description says; “WebGate-to-OAM Server timeout threshold specifies how long (in seconds) the WebGate waits for the OAM Server before it considers it unreachable and attempts the request on a new connection.” This at first seems similar to the AAA Timeout Threshold, but the difference is that this parameter is more about how many times the WebGate will retry its request before attempting the secondary server.
So if the AAA Timeout Threshold is set to 5 seconds, it will time out that connection after 5 seconds if there is no response, but using the client_request_retry_attempts tells the Webgate how any times it will attempt to retry that connection. If the value is set to 2, then the Webgate will wait 5 seconds (Assuming the AAA Timeout Threshold is set to 5), and if it times out it will try up to 2 times before timing out the connection. This configuration may be useful if you think a network connectivity between the Webgates and the Access Servers are not stable and you want the Webgate to at least try more than once before closing its connection.
I realize there are a lot of details in this blog, but it is all very useful and you may need to read each section carefully to absorb the data. I can say that tuning the Webgate profile is a very important part of an OAM deployment and can save you lots of late nights worrying about performance or outages. Good luck and be sure to load test your configurations before going live.