How big is a big cluster?

From: Tom Mueller <tom.mueller_at_oracle.com>
Date: Fri, 14 May 2010 14:55:15 -0500

I'm taking a look at issue 4357
<https://glassfish.dev.java.net/issues/show_bug.cgi?id=4357> which
references an original v3 PRD requirement
<http://wiki.glassfish.java.net/Wiki.jsp?page=V3CoreInfrastructureImprovements>
(CoreInfra-000) about the cluster size that needs to be supported. The
original request was for a cluster with 100 instances, but the
engineering response says "Yes for 50. Stretch goal 75. Need to define #
of clusters in the domain and the perf criteria for startup, etc. Also,
what is the expectation from HA side? We should have these two numbers
in synch."

A quick scan through the current sub-project documents and the
clustering design didn't reveal any numbers. (If I missed them, please
let me know.)

Clearly these numbers will depend on the hardware being used and the
performance criteria. However, I would like to pin down some criteria
that we can use to evaluate whether our cluster infrastructure
implementation is good enough to release in 3.1. Issue 4357 looks like a
good place to record the criteria.

How about this for a starting point?

   1. support at least 50 instances per domain in any number of clusters
      (1-50) and any number of cluster and instance configs (1-50) with
      at most TBD volume of applications and resources deployed to
      instances.
   2. support DAS start up time that is no worse that twice the start-up
      time for the "development" profile (no instances) no matter how
      many instances, clusters, or configs are configured, up to the
      supported maximum
   3. support complete domain startup (all instances) requiring no
      synchronization in less than 2 minutes
   4. support dynamic reconfiguration command execution requiring no
      more than 5 times the time required for the command to execute on
      the DAS only. (Note, to support 50, this means that the commands
      have to be executed in parallel across instances).
   5. support execution time of all cluster infrastructure commands of
      no more than 1 minute.

Note the "TBD" - how do we measure the volume of "stuff" that is
deployed to a cluster? This will have an effect on sync and startup
times. For example, if an instance has enough applications and other
resources deployed, can it be such that starting even a single instance
takes longer than 1 minute, so that #5 isn't achievable? Is there some
standard benchmark application load that could be used to define this?

To achieve these times, you can use any hardware available.

Clearly, I pulled many of these numbers out of thin air. Other
suggestions are welcome.

Thanks.
Tom

-- 
Oracle <http://www.oracle.com>
Tom Mueller | Principal Member of Technical Staff
Phone: +1 4029169943 | Fax: +1 4029169943 | Mobile: +1 4027206872
21915 Hillandale Dr | Elkhorn, NE 68022
Green Oracle <http://www.oracle.com/commitment> Oracle is committed to 
developing practices and products that help protect the environment