Re: rsync algorithm

From: Bill Shannon <bill.shannon_at_sun.com>
Date: Tue, 09 Mar 2010 20:48:00 -0800

Byron Nevins wrote on 03/09/2010 12:59 PM:
> http://rsync.samba.org/tech_report/

It's been awhile since I've looked at this, so correct me if I'm wrong, but...

My understanding of the big advantage of rsync over (e.g.) rdist was that
it was much more clever about figuring out the smallest amount of data it
needed to send in the case where a file was changed.

I don't think that's our problem.

(Yes, doing that would clearly make some cases *better*, but I don't think
it's what makes the performance of large clusters unacceptable.)

I think the issue is efficiently deciding whether *any* data needs to be
sent when a cluster instance starts up. The most common case has to be
that nothing needs to be sent. The second most common case is likely that
everything needs to be sent, because you're adding a new server to the
cluster.

Anyone have any reason to doubt that?

If that's true, we only need to consider something rsync-like, or rather
more rdist-like, that figures out what files are out of date and sends them.
Again, my understanding of rsync/rdist is that doing this over a large
directory involves many round trips. It's a *very* simple approach, but
if the performance is sufficient for what we're doing we should definitely
consider it.

My understanding of our experience with v2 is that we don't expect such a
simple approach to have sufficient performance.

But I'd love to have some data to prove this one way or another.