Turkcell Performance Report

 

Author: Albert Tam albert.tam@oracle.com

1.1        Summary of changes

 

Over the course of one and a half days we successfully managed to increase overall throughput in the Service Broker system from 10.91 tps (transactions per second) to 29.27 tps; this increase was realized by applying a series of engine code patches and fine-tuning the timer values (3-tuple of polling timer, submit timer and poolsize) of the fetcher processes.

 

The type of code patches applied can be categorized in the following manner:

 

Although we did not have the opportunity to break down the detailed effects of the patches by category, we postulate that the majority of the throughput effects were achieved by limiting the size and number of objects created in the JVM.  More specifically, the Stopwatch [12] and XPathUtil [16] patches cummulatively contributed to an increase of 8.46 tps.  We have noticed at other customer installations that garbage collection performance has an extremely large impact on JVM performance, due in part to the increased frequency in minor and major (stop-the-world) collections.  Under high input load conditions, even a brief major collection can cause a drop in outflow coupled with an increasing backlog on the inflow side; equilibrium between inflow and outflow in this case is not possible.

 

The removal of several synchronization points (patches [1], [10] and [15]) helped to increase overall throughput, albeit not at the same rate as the object size patches. Of interest here is patch [10] – the removal of a synchronized clone() call in the WSIF layer – which could account for the observation that Accenture made regarding the performance of WSIF versus DeliveryService API calls in outbound partnerlink invocations.

 

The cummulative changes in configuration helped to increase throughput a total of 4.53 tps.  The changes in configuration were guided by empirical observations of the slope of the eden and old generations in the visualGC tool.  “Choppy” or “staircase-like” eden slopes typically indicate lulls in processing or spare cpu cycles; smooth eden slopes indicate full utilization of all threads in the JVM.  The goal during these runs was to find a timer 3-tuple value that resulted in a smooth eden slope and a slow growing old generation (indicating requests started in eden completed in eden).

 

Description

tps

% change in tps

 

 

 

Baseline

10.91

--

Patches [10], [11], [12]

15

37.5

Configuration

18.5

23

Patches [13], [15]

19.13

3.4

Patch [16]

23.5

22.8

New processes, no call to LogService

28.24

20.2

Configuration

29.27

3.6

 

The net result of all these changes was an overall throughput increase of 18.36 tps or 168% from the baseline of 10.91 tps.

 

1.2        Detailed description of changes applied

1.2.1      8 Jun 2006

 

Deployed new DB adapter, which included fix to remove synchronization in acquire session code in toplink layer [1].  9.85 tps, down from 10.91 tps.

 

Moved to oc4j managed connection pool (instead of toplink connection pool), still at 50 max-connections.  10.11 tps.

 

Removed MaxTenuringThreshold JVM parameter, was 0.  This parameter specifies how many collections an object can remain in the young (eden) generation, after which it is promoted to the old generation.  The old generation is growing steadily with each minor collection, which could either mean there is not enough space in eden after the collection (in which case objects are promoted to make room in eden) or the objects are promoted immediately because of the MaxTenuringThreshold value.  In any event, the run produced a tps of 9.813.

 

Moved RTPE process to mtsopa05 (4 dual-core CPU with 32GB RAM). This configuration produced a tps of 26.05.  This setup is utilizing 12 CPUs rather than 4 and the corresponding increase in throughput is 138%; not quite linear considering the number of CPUs has increased 3-fold.

 

Increased the size of max-connections in QUEUE and TOMDB connection pools to 100 from 50.  The total number of listener threads is 100.  tps is 25.45.

 

Moved RTPE, BC and setQueue to mtsopa09, with the fetcher processes on mtsopa05.  If threading is the issue then this configuration should result in a tps value close to the previous test. Unfortunately, throughput was measured at 10.5 tps.

 

1.2.2      9 Jun 2006

 

Moved all processes back to mtsopa09.  Edwin, Glenn & Muruga discovered that removing StopWatch calls from the requests results in a 20% improvement in performance [12].  The way the keys concatenated for each start() call in StopWatch means that the amount of memory consumed for recent runs (which are using StopWatch heavily) will be huge.  Other patches installed for this run include:

Run produced tps of ~15 with settings of 550/50/10.

 

Changed the settings to 550/25/20 – this produced throughput of ~17 tps.  The old generation increased steadily during this run, implying the number of objects created for inflow messages was too large for the outflow rate; we need to increase the time between fetcher injections to allow the engine to clear the previous batch.

 

Changed settings to 1000, 30, 15 – this produced throughput of ~14 tps. The old generation was very stable during this run, increasing ever so slightly.  Eden growth was “stair-case” like, implying that there were periods of low cpu utilization.

 

Changed settings to 750, 30, 15 – this run produced throughput of ~16 tps. The old generation was stable, with minimal growth.  Eden growth is still not smooth; there are still several small pauses visible in the slope.

 

Changed settings to 750, 30, 17 – this run produced throughput of ~17.5 tps.  The eden slope is smooth with few objects surviving minor collection (judging by the size of the survivor spaces) and old generation is stable.

 

Changed settings to 750, 30, 20 – this run produced throughput of ~18 tps. The eden slope is smooth with the old generation growing very slightly.

 

Changed settings to 1000, 30, 25 – this run produced throughput of ~18 tps. The eden slope is “staircase-like” with the old generation slowly growing.

 

Changed settings to 1000, 30, 30 – this run produced throughput of ~18.5 tps.  The eden slope is smooth with the old generation growing at a slow but steady pace.  The old generation eventually fills up but after a major collection comes back down to its initial levels.

 

Changed settings to 1250, 30, 40 – this run produced throughput of ~18.5 tps.  The eden slope is smooth with old generation again growing at a slow but steady pace.  Looks like we’ve hit a plateau here.

 

Thread dump analysis shows that there are quite a few threads pausing at Throwable.fillInStackTrace() from the creation of the BaseScope object. Constructing a stack trace is a fairly expensive operation as the JVM needs to pause processing to traverse the call stack … doing this multiple times per request (as we create multiple Scope objects) is even worse.

 

For the next run we remove Throwable from BaseScope (BaseScope was extending Throwable) [13].  We also remove several logMessage() calls from the toplink layer [15], which end up using ResourceBundles to lookup the a localized message string.  It turns out that ResourceBundles cache the message strings using a SoftCache, which is static synchronized.  These two changes take us to 19.13 tps.  The eden slope is smooth and the old generation is steady.

 

The invocation message priority patch [14] – which allows a caller process to specify the priority of the callee invocation message via a partnerLinkBinding property – is deployed.  The result is ~18.5 tps, which is below the previous run.  Invocation message priority is probably little use here since the engine is capable of completing the entire message batch before the next batch is injected into the system.

 

The priority patch is removed and a fix to XPathUtil [16] is deployed. The XPathUtil change consists of the removal of several debug messages (which cause several large objects to have their toString() method called). With a setting of 1250, 25, 45 the resulting throughput is ~21 tps.  The eden slope is smooth and the old generation appears to have no growth.  A 500000 record batch overnight is run, with a sustained throughput of 23.5 tps recorded. The throughput appears to increase to its peak rate 10-15 minutes after the start of the run.

 

1.2.3      10 Jun 2006

 

Thread dump analysis shows several threads marshalling XML to string format.  The call to LogService in RTPE always prepares variables to be logged even if the log level does not warrant it.  In order to skip unnecessary object creation the call to new WriteLog().addLog() is commented out completely; we did add the method canLog() to the class com.accenture.sdp.logging.WriteLog which should be called prior to the addLog() call, however in the interests of time we did not deploy the change.

 

Additional changes included switching over to the new SPM (revision 7.0.1) and RTPE (revision 8.0.1) processes.  These new revisions contain the following changes:

 

Executing these two changes with timer settings of 1250, 25, 45 give us 28.24 tps.  The eden slope is not smooth and the old generation is growing very very slowly; eden GCs seem to occur every 25 seconds.  These JVM memory profile observations indicate we have reduce the amount of memory consumption per request and can probably increase the input rates.

 

For the next run we increase the listener threads from 50 to 60 (dspMaxThreads is increased to 60 as well) and the QUEUE and TOMDB connection pools are increased to 70.  With a setting of 1250, 20, 50 we get a sustained tps of 29.27.  The eden slope is smooth and constant; the old generation growth is very very slow; eden GCs occur every 25 seconds on average.

 

1.3        Patches

 

[1]    DeliveryPersistPolicy – Enables setting of deliveryPersistPolicy from the bpel.xml of a process (as opposed to setting in the domain.xml).

[2]    off.immediate – Allows for delivery of messages received via post() to be dispatched immediately by the same thread without being asynchronously delivered to the delivery layer.

[3]    memory leak – Fixed leak in DOM4J layer, cache was keeping soft references to ICubeContext.

[4]    load balancing – Fixed authentication across multiple jvms; previous behavior would force authentication for each jvm.

[5]    Turkish character – Fixed problem in XDK layer processing XML data containing multi-byte characters.

[6]    SOAP shortcut – Fixed problem with SOAP shortcut not being taken for consecutive partnerlink invocations.

[7]    Sensors – Resolved problem with multiple revisions of a process with identical sensor configurations.

[8]    Dispatch recovery – Disable automatic recovery on engine restart.

[9]    DBAdapter acquire session – Removed synchronization from toplink layer for acquireSession call.

[10]           WSIF – Removed SAX reader synchronization (for WSIF context clone).

[11]           AXIS – Removed extra clone() call (should not have effect for co-located processes).

[12]           Stopwatch – Disabled stopwatch statistics gathering; current implementation creates many Strings internally which contributes heavily to memory consumption.

[13]           BaseScope throwable – Changed BaseScope implementation to not extend java.lang.Throwable.

[14]           Partnerlink priority – Allows process to specify priority for outbound invocation messages for BPEL processes.

[15]           Synchronization in toplink logging – Removed logging calls in toplink library that forced access to java.util.ResourceBundle (which contains a static synchronized member).

[16]           XPathUtil debug – Removed several extraneous debug messages from XPathUtil class (which caused several large objects to have toString() called).

 

1.4        Appendix

 

dspMaxT

Connections
(QUEUE/TOMDB/orabpel)

WorkerBean

Polling Timer

Submit Timer

Poolsize

tps

Comments

 

 

 

 

 

 

 

 

80

100/100/100

80

550

25

20

9.85

Patch [1] (DB adapter)

 

 

 

 

 

 

 

 

80

100/100/100

80

550

25

20

9.8

MaxTenuringThreshold = 0

80

50/50/100

80

550

25

20

26.05

Moved RTPE to mtsopa05

80

100/100/100

80

550

25

20

25.45

Changed QUEUE/TOMDB max connections

80

50/50/100

80

550

25

20

10.5

Moved RTPE to mtsopa09; fetcher processes on mtsopa05

 

 

 

 

 

 

 

 

50

50/50/100

50

550

50

10

15

Patches [10], [11], [12]

50

50/50/100

50

550

25

20

17

Old generation increasing steadily

50

50/50/100

50

1000

30

15

14

Stable old generation; staircase eden growth

50

50/50/100

50

750

30

15

16

Stable old generation; staircase eden growth

50

50/50/100

50

750

30

17

17.5

Smooth eden growth

50

50/50/100

50

750

30

20

18

Smooth eden growth

50

50/50/100

50

1000

30

25

18

Staircase eden

50

50/50/100

50

1000

30

30

18.5

Smooth eden growth; old generating growing steadily

50

50/50/100

50

1250

30

40

18.5

Smooth eden growth; old generating growing steadily

50

50/50/100

50

1250

30

40

19.13

Patches [13] and [15] (toplink logging synchronization and BaseScope Throwable removal)

50

50/50/100

50

1250

30

40

18.5

Patch [14] (partnerlink priority)

50

50/50/100

50

1250

25

45

21

Deploy patch [16]; remove patch [14]

50

50/50/100

50

1250

25

45

23.5

Overnight run of 500000 records

50

50/50/100

50

1250

25

45

28.24

Change to new SPM, RTPE processes; disable call to LogService

60

70/70/100

60

1250

25

50

29.27

Change listener threads to 60