Re: rsync algorithm

From: Tim Quinn <tim.quinn_at_oracle.com>
Date: Wed, 10 Mar 2010 10:38:50 -0600

I don't have data to refute Bill's thought about the two most frequent
use cases (fetching nothing vs. fetching everything). And in fact that
makes intuitive sense to me - at least in a true production
environment. I don't really have a sense for the relative frequency of
adding a new cluster node as opposed to deploying a new app or
redeploying an existing one or changing some configuration data (any of
which would involve a partial refresh as opposed to an all-or-nothing
refresh) in a production GlassFish environment.

Does anyone have a sense for how many GlassFish environments are
"cluster development" or "cluster testing" environments in which all of
these activities - adding instances, deploying and redeploying apps -
are more frequent than in truly production settings? And to the extent
that we want to offer a compelling (that is, fast!) development
experience with clusters as well, we'll want to keep those environments
in mind.

I have no numbers relating the counts of "production" vs. "development"
cluster environments, or the frequency of instance creations, app
deployments or redeployments, etc. in each type of environment. Clearly
we don't want to suffer poor performance in production situations in
exchange for fast developer performance. I guess I'm saying we don't
want to do the opposite either.

But, as Bill said, without data for any of this we need to go with our
guts. We could probably gather data to measure the technology choices
we are considering. I doubt we have the time to collect data to help us
understand the real-world populations of true production vs.
"development" GlassFish cluster environments and how often users do
various tasks in each.

- Tim

On 3/9/10 10:48 PM, Bill Shannon wrote:
> Byron Nevins wrote on 03/09/2010 12:59 PM:
>> http://rsync.samba.org/tech_report/
>
> It's been awhile since I've looked at this, so correct me if I'm
> wrong, but...
>
> My understanding of the big advantage of rsync over (e.g.) rdist was that
> it was much more clever about figuring out the smallest amount of data it
> needed to send in the case where a file was changed.
>
> I don't think that's our problem.
>
> (Yes, doing that would clearly make some cases *better*, but I don't
> think
> it's what makes the performance of large clusters unacceptable.)
>
> I think the issue is efficiently deciding whether *any* data needs to be
> sent when a cluster instance starts up. The most common case has to be
> that nothing needs to be sent. The second most common case is likely
> that
> everything needs to be sent, because you're adding a new server to the
> cluster.
>
> Anyone have any reason to doubt that?
>
> If that's true, we only need to consider something rsync-like, or rather
> more rdist-like, that figures out what files are out of date and sends
> them.
> Again, my understanding of rsync/rdist is that doing this over a large
> directory involves many round trips. It's a *very* simple approach, but
> if the performance is sufficient for what we're doing we should
> definitely
> consider it.
>
> My understanding of our experience with v2 is that we don't expect such a
> simple approach to have sufficient performance.
>
> But I'd love to have some data to prove this one way or another.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: admin-unsubscribe_at_glassfish.dev.java.net
> For additional commands, e-mail: admin-help_at_glassfish.dev.java.net
>