admin@glassfish.java.net

Re: Command Replication in 3.1 - details

From: Vijay Ramachandran <vijay.ramachandran_at_oracle.com>
Date: Fri, 30 Apr 2010 12:11:45 -0700

Bill - Thanks for your comments.

For this upcoming release, we are planning to implement the following
simple command replication model :

- Try to send the command to all applicable instances once
- Collect response from all instances
- Give detailed response/result to the user (where the command
succeeded, where the command failed, does some instance need a restart) etc.

Once this is in place, we will build on it further to make it more
powerful and user friendly, depending on availability of time, resources
etc.

Give the above plan, answers to your specific questions are inline below :

> Vijay Ramachandran wrote on 04/14/2010 10:54 AM:
>> I have put together the details of how command replication feature
>> will work Glassfish 3.1 in this wiki page
>> <http://wiki.glassfish.java.net/Wiki.jsp?page=ClusterDynamicReconfig>. Your
>> comments / feedback will be deeply appreciated. We can probably use a
>> few minutes of the next team meeting to give your feedback.
>
> This looks good. I had a few comments...
>
> In the table "Command replication results and action taken", the
> first entry for "Failure on one or more instances", the action taken
> includes "set server-restart". What exactly does this mean and how do
> you plan to do this? Are you assuming the server is up and you have
> reliable communication with the server?

We are not assuming that the server is going to be up and that the
communication is reliable. We will try to send the command to the
instance and get back results. If sending the command itself failed, we
will flag it as one type of error that indicates something wrong with
network or the server. If the command went through to the server but the
command execution fails, we will get a detailed error (just as it
happens now between CLI-DAS), and we will display this error. In either
of the failure cases, we indicate to the caller (the CLI or GUI), that
the server(s) where the replication failed need to be restarted.

> In general, the failure cases don't seem to distinguish "I sent the
> command to the instance and it returned a failure response" from
> "I wasn't able to send the command to the instance, e.g., because
> it was down" or "I sent the command to the instance but I never got
> a response". How do you plan to detect and handle these different
> cases?

As per the current plan, we probably will be able to distinguish between
the following types of failure :

1. Network failure
2. Connection timed out
3. Command failure (Command was sent but command execution on the server
failed because some other reason)

"3" will return enough failure info which will be self explanatory and
the user will have to take corrective action. For "1" and "2", we can do
more by saying whether the error is because of server not being up or
network failure or some other communication failure - but that is not
being planned for the first phase of implementation.

> Also, how do you plan to handle intermittent network failures?
> When you're next able to talk to the server instance, will you be
> able to detect that it is out of date? Will you depend on GMS to
> detect such cases? What if GMS says the instance is up but you can't
> talk to it?

Again, for the first phase of implementation, we are planning to keep it
simple. We will add a P3 task named "Make command replication error
reporting more user friendly by using GMS infrastructure" to the list of
dynamic reconfig tasks and address them as time/resources permit.

I will update the wiki
<http://wiki.glassfish.java.net/Wiki.jsp?page=ClusterDynamicReconfig>
also to reflect the contents of this mail so that it is clear to all.

Thanks a lot

Vijay