Turkcell Performance Report
Author: Albert Tam albert.tam@oracle.com
Over the course of one and a
half days we successfully managed to increase overall throughput in the Service
Broker system from 10.91 tps (transactions per second) to 29.27 tps; this
increase was realized by applying a series of engine code patches and
fine-tuning the timer values (3-tuple of polling timer, submit timer and
poolsize) of the fetcher processes.
The type of code patches
applied can be categorized in the following manner:
Although we did not have the
opportunity to break down the detailed effects of the patches by category, we
postulate that the majority of the throughput effects were achieved by limiting
the size and number of objects created in the JVM. More specifically, the Stopwatch [12] and
XPathUtil [16] patches cummulatively contributed to an
increase of 8.46 tps. We have noticed
at other customer installations that garbage collection performance has an
extremely large impact on JVM performance, due in part to the increased
frequency in minor and major (stop-the-world) collections. Under high input load conditions, even a
brief major collection can cause a drop in outflow coupled with an increasing
backlog on the inflow side; equilibrium between inflow and outflow in this case
is not possible.
The removal of several
synchronization points (patches [1], [10]
and [15]) helped to increase overall throughput, albeit
not at the same rate as the object size patches. Of interest here is patch [10] – the removal of a synchronized clone() call in the
WSIF layer – which could account for the observation that Accenture made
regarding the performance of WSIF versus DeliveryService API calls in outbound
partnerlink invocations.
The cummulative changes in
configuration helped to increase throughput a total of 4.53 tps. The changes in configuration were guided by
empirical observations of the slope of the eden and old generations in the
visualGC tool. “Choppy” or
“staircase-like” eden slopes typically indicate lulls in processing or spare
cpu cycles; smooth eden slopes indicate full utilization of all threads in the JVM. The goal during these runs was to find a
timer 3-tuple value that resulted in a smooth eden slope and a slow growing old
generation (indicating requests started in eden completed in eden).
|
Description |
tps |
% change in tps |
|
|
|
|
|
Baseline |
10.91 |
-- |
|
15 |
37.5 |
|
|
Configuration |
18.5 |
23 |
|
19.13 |
3.4 |
|
|
Patch [16] |
23.5 |
22.8 |
|
New processes, no call to LogService |
28.24 |
20.2 |
|
Configuration |
29.27 |
3.6 |
The net result of all these
changes was an overall throughput increase of 18.36 tps or 168% from the baseline
of 10.91 tps.
Deployed new DB adapter,
which included fix to remove synchronization in acquire session code in toplink
layer [1]. 9.85
tps, down from 10.91 tps.
Moved to oc4j managed
connection pool (instead of toplink connection pool), still at 50
max-connections. 10.11 tps.
Removed MaxTenuringThreshold JVM parameter, was 0. This parameter specifies how many collections an object can remain
in the young (eden) generation, after which it is promoted to the old
generation. The old generation is
growing steadily with each minor collection, which could either mean there is
not enough space in eden after the collection (in which case objects are
promoted to make room in eden) or the objects are promoted immediately because
of the MaxTenuringThreshold value. In
any event, the run produced a tps of 9.813.
Moved RTPE process to
mtsopa05 (4 dual-core CPU with 32GB RAM). This configuration produced a tps of
26.05. This setup is utilizing 12 CPUs
rather than 4 and the corresponding increase in throughput is 138%; not quite
linear considering the number of CPUs has increased 3-fold.
Increased the size of
max-connections in QUEUE and TOMDB connection pools to 100 from 50. The total number of listener threads is
100. tps is 25.45.
Moved RTPE, BC and setQueue
to mtsopa09, with the fetcher processes on mtsopa05. If threading is the issue then this configuration should result
in a tps value close to the previous test. Unfortunately, throughput was
measured at 10.5 tps.
Moved all processes back to
mtsopa09. Edwin, Glenn & Muruga
discovered that removing StopWatch calls from the requests results in a 20%
improvement in performance [12]. The way the keys concatenated for each
start() call in StopWatch means that the amount of memory consumed for recent
runs (which are using StopWatch heavily) will be huge. Other patches installed for this run
include:
Run produced tps of ~15 with
settings of 550/50/10.
Changed the settings to
550/25/20 – this produced throughput of ~17 tps. The old generation increased steadily during this run, implying
the number of objects created for inflow messages was too large for the outflow
rate; we need to increase the time between fetcher injections to allow the
engine to clear the previous batch.
Changed settings to 1000,
30, 15 – this produced throughput of ~14 tps. The old generation was very
stable during this run, increasing ever so slightly. Eden growth was “stair-case” like, implying that there were
periods of low cpu utilization.
Changed settings to 750, 30,
15 – this run produced throughput of ~16 tps. The old generation was stable,
with minimal growth. Eden growth is
still not smooth; there are still several small pauses visible in the slope.
Changed settings to 750, 30,
17 – this run produced throughput of ~17.5 tps. The eden slope is smooth with few objects surviving minor
collection (judging by the size of the survivor spaces) and old generation is stable.
Changed settings to 750, 30,
20 – this run produced throughput of ~18 tps. The eden slope is smooth with the
old generation growing very slightly.
Changed settings to 1000,
30, 25 – this run produced throughput of ~18 tps. The eden slope is “staircase-like”
with the old generation slowly growing.
Changed settings to 1000,
30, 30 – this run produced throughput of ~18.5 tps. The eden slope is smooth with the old generation growing at a
slow but steady pace. The old generation
eventually fills up but after a major collection comes back down to its initial
levels.
Changed settings to 1250,
30, 40 – this run produced throughput of ~18.5 tps. The eden slope is smooth with old generation again growing at a
slow but steady pace. Looks like we’ve hit
a plateau here.
Thread dump analysis shows
that there are quite a few threads pausing at Throwable.fillInStackTrace() from
the creation of the BaseScope object. Constructing a stack trace is a fairly
expensive operation as the JVM needs to pause processing to traverse the call
stack … doing this multiple times per request (as we create multiple Scope
objects) is even worse.
For the next run we remove
Throwable from BaseScope (BaseScope was extending Throwable) [13]. We also remove several logMessage() calls
from the toplink layer [15], which end up using
ResourceBundles to lookup the a localized message string. It turns out that ResourceBundles cache the
message strings using a SoftCache, which is static synchronized. These two changes take us to 19.13 tps. The eden slope is smooth and the old
generation is steady.
The invocation message
priority patch [14] – which allows a caller process to
specify the priority of the callee invocation message via a partnerLinkBinding
property – is deployed. The result is
~18.5 tps, which is below the previous run.
Invocation message priority is probably little use here since the engine
is capable of completing the entire message batch before the next batch is
injected into the system.
The priority patch is
removed and a fix to XPathUtil [16] is deployed. The
XPathUtil change consists of the removal of several debug messages (which cause
several large objects to have their toString() method called). With a setting
of 1250, 25, 45 the resulting throughput is ~21 tps. The eden slope is smooth and the old generation appears to have
no growth. A 500000 record batch
overnight is run, with a sustained throughput of 23.5 tps recorded. The
throughput appears to increase to its peak rate 10-15 minutes after the start
of the run.
Thread dump analysis shows
several threads marshalling XML to string format. The call to LogService in RTPE always prepares variables to be
logged even if the log level does not warrant it. In order to skip unnecessary object creation the call to new
WriteLog().addLog() is commented out completely; we did add the method canLog()
to the class com.accenture.sdp.logging.WriteLog which should be called prior to
the addLog() call, however in the interests of time we did not deploy the
change.
Additional changes included
switching over to the new SPM (revision 7.0.1) and RTPE (revision 8.0.1)
processes. These new revisions contain
the following changes:
Executing these two changes
with timer settings of 1250, 25, 45 give us 28.24 tps. The eden slope is not smooth and the old
generation is growing very very slowly; eden GCs seem to occur every 25
seconds. These JVM memory profile
observations indicate we have reduce the amount of memory consumption per
request and can probably increase the input rates.
For the next run we increase
the listener threads from 50 to 60 (dspMaxThreads is increased to 60 as well)
and the QUEUE and TOMDB connection pools are increased to 70. With a setting of 1250, 20, 50 we get a
sustained tps of 29.27. The eden slope
is smooth and constant; the old generation growth is very very slow; eden GCs
occur every 25 seconds on average.
[1]
DeliveryPersistPolicy – Enables setting of deliveryPersistPolicy from the
bpel.xml of a process (as opposed to setting in the domain.xml).
[2]
off.immediate – Allows for delivery of messages received via
post() to be dispatched immediately by the same thread without being
asynchronously delivered to the delivery layer.
[3]
memory leak – Fixed leak in DOM4J layer, cache was keeping soft
references to ICubeContext.
[4]
load balancing – Fixed authentication across multiple jvms;
previous behavior would force authentication for each jvm.
[5]
Turkish character – Fixed problem in XDK layer processing XML data
containing multi-byte characters.
[6]
SOAP shortcut – Fixed problem with SOAP shortcut not being taken
for consecutive partnerlink invocations.
[7]
Sensors – Resolved problem with multiple revisions of a
process with identical sensor configurations.
[8]
Dispatch recovery – Disable automatic recovery on engine restart.
[9]
DBAdapter acquire
session – Removed synchronization
from toplink layer for acquireSession call.
[10]
WSIF – Removed SAX reader synchronization (for WSIF
context clone).
[11]
AXIS – Removed extra clone() call (should not have effect
for co-located processes).
[12]
Stopwatch – Disabled stopwatch statistics gathering; current
implementation creates many Strings internally which contributes heavily to
memory consumption.
[13]
BaseScope throwable – Changed BaseScope implementation to not extend
java.lang.Throwable.
[14]
Partnerlink priority – Allows process to specify priority for outbound
invocation messages for BPEL processes.
[15]
Synchronization in
toplink logging – Removed logging
calls in toplink library that forced access to java.util.ResourceBundle (which
contains a static synchronized member).
[16]
XPathUtil debug – Removed several extraneous debug messages from
XPathUtil class (which caused several large objects to have toString() called).
|
dspMaxT |
Connections |
WorkerBean |
Polling Timer |
Submit Timer |
Poolsize |
tps |
Comments |
|
|
|
|
|
|
|
|
|
|
80 |
100/100/100 |
80 |
550 |
25 |
20 |
9.85 |
Patch [1]
(DB adapter) |
|
|
|
|
|
|
|
|
|
|
80 |
100/100/100 |
80 |
550 |
25 |
20 |
9.8 |
MaxTenuringThreshold = 0 |
|
80 |
50/50/100 |
80 |
550 |
25 |
20 |
26.05 |
Moved RTPE to mtsopa05 |
|
80 |
100/100/100 |
80 |
550 |
25 |
20 |
25.45 |
Changed QUEUE/TOMDB max
connections |
|
80 |
50/50/100 |
80 |
550 |
25 |
20 |
10.5 |
Moved RTPE to mtsopa09;
fetcher processes on mtsopa05 |
|
|
|
|
|
|
|
|
|
|
50 |
50/50/100 |
50 |
550 |
50 |
10 |
15 |
|
|
50 |
50/50/100 |
50 |
550 |
25 |
20 |
17 |
Old generation increasing
steadily |
|
50 |
50/50/100 |
50 |
1000 |
30 |
15 |
14 |
Stable old generation;
staircase eden growth |
|
50 |
50/50/100 |
50 |
750 |
30 |
15 |
16 |
Stable old generation;
staircase eden growth |
|
50 |
50/50/100 |
50 |
750 |
30 |
17 |
17.5 |
Smooth eden growth |
|
50 |
50/50/100 |
50 |
750 |
30 |
20 |
18 |
Smooth eden growth |
|
50 |
50/50/100 |
50 |
1000 |
30 |
25 |
18 |
Staircase eden |
|
50 |
50/50/100 |
50 |
1000 |
30 |
30 |
18.5 |
Smooth eden growth; old
generating growing steadily |
|
50 |
50/50/100 |
50 |
1250 |
30 |
40 |
18.5 |
Smooth eden growth; old
generating growing steadily |
|
50 |
50/50/100 |
50 |
1250 |
30 |
40 |
19.13 |
Patches [13]
and [15] (toplink logging synchronization and BaseScope
Throwable removal) |
|
50 |
50/50/100 |
50 |
1250 |
30 |
40 |
18.5 |
Patch [14]
(partnerlink priority) |
|
50 |
50/50/100 |
50 |
1250 |
25 |
45 |
21 |
|
|
50 |
50/50/100 |
50 |
1250 |
25 |
45 |
23.5 |
Overnight run of 500000
records |
|
50 |
50/50/100 |
50 |
1250 |
25 |
45 |
28.24 |
Change to new SPM, RTPE
processes; disable call to LogService |
|
60 |
70/70/100 |
60 |
1250 |
25 |
50 |
29.27 |
Change listener threads to
60 |
|
|
|
|
|
|
|
|
|