I've written a Glassfish application around JPA in the interest of scaling
relatively expensive jobs to a compute cluster (~60 nodes and growing). The
idea is that Glassfish will sit on the master (or possibly the db) node,
dispatch work in discrete (large) bundles and receive the processed results
in many small packets per work unit.
For an average work-unit, the process works something like the following:
1. Client retrieves a 1.5 million row data set, containing some
pre-processed analytical data that needs further processing.
2. Client fires up a compute Thread and a send-results Thread. The result of
the computation is serialized and sent across the wire as 100,000 - 300,000
Objects which are subsequently persisted to the database. The client is
designed to reduce it's memory footprint and avoid network contention by
batching submission into small chunks (64 Object/submission) as they are
generated.
So far, so good. My test setup is to run GF on the same (DEVL) computer as
the client programs are running on. This works well enough for development:
Give GF a gig of RAM and 1 CPU, and then run 2 clients with 1/2 gig and 25%
of CPU2 each. The database (pgsql) sits on another more hefty machine
serving the connection pool. Throughput is slow, but sufficient insofar as I
can get the DEVL changes through quickly enough. GF appears to be the
bottleneck, since it's topped out at 99% CPU most of the time while the
compute Threads bumble around at ~15% and go in and out of sleep.
So, the question is: How do I approach optimizing the process so that
compute threads (on the cluster) are as busy as possible and that the client
results->GF->jdbc->database pipeline does not form a tiny bottleneck vs
available work.
One possibility is to put GF on the DB server (which ultimately does not
need that many resources for writing a bunch of records to the end of a file
somewhere). Then I just twiddle with the size of my connection pool and hope
for the best.
Alternatively, I put GF on the master (which does have some background
responsibilities related to the running of the cluster). Or run two GFs and
load-balance, using two connection pools along the way?
--
========================================================
Jason Nerothin
Programmer/Analyst IV - Database Administration
UCLA-DOE Institute for Genomics & Proteomics
Howard Hughes Medical Institute
========================================================
611 C.E. Young Drive East | Tel: (310) 206-3907
105 Boyer Hall, Box 951570 | Fax: (310) 206-3914
Los Angeles, CA 90095. USA | Mail: jason_at_mbi.ucla.edu
========================================================
http://www.mbi.ucla.edu/~jason <http://www.mbi.ucla.edu/%7Ejason>
========================================================