Re: Major Application client frustrations

From: <glassfish_at_javadesktop.org>
Date: Sat, 05 May 2007 01:32:15 PDT

Now, about the retries. We've known for some time that there were some
issues with ORB retries. Recently someone filed a high priority bug
(6546045, which was a duplicate of 6394769) complaining about too many
retries. I wrote a simple test to reproduce this, and found that the ORB
was re-trying 56000 times per minute on my test machine (JDK 1.5_10,
Ubuntu 6.06, Athlon 64 3500+ with 2 GB RAM).

The problem is that the ORB needs to retry the entire list of endpoints
in a cluster, and wrap from the end of the list to the beginning. The
old code intended to use a simple exponential backoff in this case,
but it ONLY applied the wait with backoff when the error was either a
TRANSIENT or a COMM_FAILURE connectionRebind (both COMPLETED_NO
of course). Any other COMM_FAILURE COMPLETED_NO was treated as
a "try the next endpoint" kind of retry. Now, this is good in the cluster case,
because we want to try the next endpoint quickly. But we also don't want to
try all of the endpoints over and over again if they all fail.

The fix is to introduce another use of TcpTimeouts here, and to record
the failed endpoints in a list, until we find that the next endpoint to try
is already in the list. In that case, we clear the list, sleep for the next
timeout, and then start retrying again. That way, we try every endpoint once,
and then sleep until we try again, or timeout.

This brings us to the topic of the ORB config parameters in
com.sun.corba.ee.impl.orbutil.ORBConstants (by the way, the ORBs
in the JDK and the App server are really the same code base snapshotted
at very different times. The names are renamed automatically at build tinme).
The relevant constants are:

TRANSPORT_TCP_TIMEOUTS_PROPERTY
    This one controls the retry behavior when the ORB is reading data
    and does not get all of the data at once. It is a TcpTimeouts.
    The defaults are 2000:6000:20
TRANSPORT_TCP_CONNECT_TIMEOUTS_PROPERTY
    This is the one relevant to this discussion. It controls
    how the ORB behaves on the client side when attempting to
    connect to an IOR (the wire rep of an EJB reference).
    This is also a TcpTimeouts.
    The defaults are 250:60000:100:5000
WAIT_FOR_RESPONSE_TIMEOUT
    This controls how long the client waits for a response AFTER successfully
    sending a request. The default is 30 minutes.

Both TcpTimeouts use the same scheme:

initial:max:backoff:maxsingle, where

initial is the first timeout in milliseconds
max is the maximum wait time (before the last wait, which can go
over this time) in milliseconds
backoff is the backoff factor by which the timeout is increased each time
(the multiplication is actually by (backoff+100)/100, so 20 is 1.2 and
100 is 2, but we avoid any use of floating point here)
maxsingle is the maximum single wait time

I've removed the old COMMUNICATIONS_RETRY_TIMEOUT.

You can certainly configure these any way you would like.

Please let me know how this works for you (it will probably be another 2 weeks
before you can get a build with this ORB in it). Also, I'd like to know
if there are any needs for fine grained control over timeouts. Right now the
granularity is the entire ORB, so you cannot set a timeout that applies
to only a single request, or a single EJB reference.
[Message sent by forum member 'kcavanaugh' (kcavanaugh)]

http://forums.java.net/jive/thread.jspa?messageID=215663