Re: Optimal IOStrategy/ThreadPool configuration for proxy

From: Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
Date: Mon, 20 Apr 2015 18:00:40 -0700

Hi Dan,

interesting observations! Try to play with the selectors count, try to
double it for both cases blocking and non-blocking, just to compare peak
tps. You may also want to try different payload sizes.
Thread context switch is relatively expensive operation as well as I/O
operations, but when you run load tests, for different cases they can
compensate each other, for example you make more thread context
switches, but somehow it leads to less I/O ops...

Regarding the direct memory usage - it's expected, because Grizzly (JDK
does the same) store a thread-local direct ByteBuffer for read/write
operations, more threads - more direct ByteBuffers. We store them in
Weak references, so they should be recycled at some point, but still...

Thanks.

WBR,
Alexey.

On 19.04.15 05:50, Daniel Feist wrote:
> Hi,
>
> Sorry, false alarm. There was a stupid bug in my code (between
> inbound and outbound) that was causing a deadlock in some of the
> selectors and it was this that was producing the timeouts and errors
> :-( Fixed now though..
>
> I've now been able to run a full set of tests at different target
> service latencies and concurrencies and it's running very well.
>
> Observations are:
> - With high latency (e.g. 1000ms) blocking/non-blocking perform the
> same. Of course blocking needs 1 thread per client thread, but giving
> the proxy maxWorkerThreads of 10,000 just in case doesn't cause any
> adverse performance, it just doesn't use the threads.
> - With low latency (e.g. 0->5ms) blocking is faster, but not by much.
> Number of worker threads in this case is crucial though, more worker
> threads than required to reach peak tps, and I start to see a
> degradation in TPS/latency.
> - With medium latency (e.g. 50ms) it appears that non-blocking is
> slightly faster, at least at higher concurrencies.
>
> Initially I was expecting to see more negative effects of having say
> 4000 worker threads from context-switching etc, but this causes
> minimal impact at low latencies and none at all at high latencies.
>
> One other interesting side affect of having 1000's of worker threads
> rather than 24 selectors is the amount of direct memory used. I'm
> limiting buffer size via system properties of course, but if i wasn't
> 4000 worker threads on the hardware I'm using (which reports 16Mb
> buffer size to java) requires 125GB of direct memory vs 0.75GB, and
> thats just for read buffer. My calculations might not be perfect but
> you get the idea..
>
> This is just a FYI, but if there is anything you think is strange, be
> interesting to know..
>
> thanks!
>
> Dan
>
>
>
> On Fri, Apr 17, 2015 at 1:56 AM, Oleksiy Stashok
> <oleksiy.stashok_at_oracle.com> wrote:
>> Hi Dan,
>>
>> everything is possible, and may be back pressure caused by blocking I/O
>> really makes the difference...
>> If you have time it would be interesting to investigate this more and try to
>> check if you can register any "lost" or "forgotten" request in your app. Try
>> to dump all the request/response processing timestamps to figure out where
>> exactly the processing takes the most time and at what stage the timeout
>> occurs: jmeter (1)--> proxy (2)--> backend (3)--> proxy (4)--> jmeter.
>> According to your description it should it either 3 or 4, but it would be
>> interesting to see exactly how it happens.
>>
>> Thanks.
>>
>> WBR,
>> Alexey.
>>
>>
>> On 16.04.15 17:24, Daniel Feist wrote:
>>> Ignore my last email about this affecting low concurrency, it doesn't.
>>> I was only seeing some errors at low concurrency due to side-effects
>>> of previous test run I think. I need 2000+ JMeter client threads to
>>> reproduce this consistently.
>>>
>>> I stripped out everything as much as possible so i'm not doing
>>> anything in between and AHC is invoking inbound grizzly response as
>>> directly as possible but no difference. The exact error in jmeter is
>>> "java.net.SocketTimeoutException,Non HTTP response message: Read timed
>>> out".
>>>
>>> Question: this might sound stupid, but couldn't it simply be that the
>>> proxy, with the number of selectors it has (and not using worker
>>> threads) simply cannot handle the load? And that we don't see errors
>>> with blocking because back-pressured is applied more directly whereas
>>> with non-blocking the same type of back-pressure doesn't occur and so
>>> we get this type of error instead?
>>>
>>> Dan
>>>
>>> On Thu, Apr 16, 2015 at 10:16 PM, Daniel Feist <dfeist_at_gmail.com> wrote:
>>>> The thing is, if i remove the outbound call then is ceases to be a
>>>> proxy and as such I don't have a seperate thread processing the
>>>> response callback and instead it behaves blocking (which works)
>>>>
>>>> Anyway, I'll try to simplify as much as possible in other ways and see
>>>> where that leads me...
>>>>
>>>> Dan
>>>>
>>>> On Thu, Apr 16, 2015 at 9:00 PM, Oleksiy Stashok
>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>> Hi Dan,
>>>>>
>>>>> let's try to simplify the test, what happens if the proxy sends the
>>>>> response
>>>>> right away (no outbound calls), do you still see the timeouts?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> WBR,
>>>>> Alexey.
>>>>>
>>>>>
>>>>> On 16.04.15 12:17, Daniel Feist wrote:
>>>>>> What I forgot to add is that I see the same issue with timeouts
>>>>>> between jmeter and the proxy even when "jmeter threads < selectors",
>>>>>> which kind of invalidates all of my ideas about selectors all
>>>>>> potentially being busy..
>>>>>>
>>>>>> Wow, even with 1 thread it's occuring.. most be something stupid... I
>>>>>> don't think it's releated to persistent connections, maxKeepAlive on
>>>>>> target service is 100, which wouldn't explain rougly 1 in 2000
>>>>>> client-side timeout, especially given no errors are being logged.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 16, 2015 at 7:36 PM, Daniel Feist <dfeist_at_gmail.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Nothing different really, just the blocking version returns response
>>>>>>> when the stack returns after waiting on outbound future returned by
>>>>>>> AHC, while the non-blocking version returns response when the
>>>>>>> completion handler passed to AHC is invoked. Ah, also blocking version
>>>>>>> using WorkerThreadIOStrategy while the non-blocking version uses
>>>>>>> SameThreadIOStrategy for inbound.
>>>>>>>
>>>>>>> I didn't reply earlier because I've been trying to get my head round
>>>>>>> whats going on. The errors are all timeout errors. Most of the
>>>>>>> timeout errors between jmeter and the proxy, but also some timeout
>>>>>>> errors between the proxy and target service, whereas with the blocking
>>>>>>> version there are no errors at all.
>>>>>>>
>>>>>>> Everything seems to be ok, and there are no exceptions being throw
>>>>>>> (other than timeouts) by grizzly/ahc. So my only hypothesis is that
>>>>>>> there is an issue with the selectors, either:
>>>>>>>
>>>>>>> i) for some reason selectors are blocking. (i see no evidence of this
>>>>>>> though, only thing i have between inbound and outbound is some copying
>>>>>>> of headers)
>>>>>>> ii) different number of inbound/outbound selectors could generate more
>>>>>>> inbound message than can be handled by outbound (i've ensured both
>>>>>>> have same number of selectors, and doesn't help, giving outbound more
>>>>>>> selectors than inbound seemed to improve things, but not solve the
>>>>>>> problem). BTW thought is what provoked my original email about shared
>>>>>>> transports/selectors.
>>>>>>> iii) By using dedicatedAcceptor the proxy is accepting all connection
>>>>>>> attempts immedialty, but a selector doesn't manage to handle read
>>>>>>> event before timeout is reached. (although changing this back to false
>>>>>>> didn't seem to help).
>>>>>>>
>>>>>>> I was initially testing with 4000 client threads, hitting proxy on
>>>>>>> 24-core machine which in turn hits an simple service with 5ms latency
>>>>>>> on another 24-core machine. But if I run with just 200 client threads
>>>>>>> I'm seeing the same :-(
>>>>>>>
>>>>>>> Last run i just did with concurrency of 200 gave 1159 errors, (6
>>>>>>> outbound timeouts and 1152 jmeter timeouts) in total of 4,154,978
>>>>>>> requests. It's only 0.03% but lot more than blocking, and no reason
>>>>>>> they should be happening.
>>>>>>>
>>>>>>> Any hints on where to look next would be greatly appreciated...
>>>>>>>
>>>>>>> thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 15, 2015 at 2:16 AM, Oleksiy Stashok
>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>> What's the implementation diff of blocking vs. non-blocking? I mean
>>>>>>>> is
>>>>>>>> there
>>>>>>>> any change in your code?
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> WBR,
>>>>>>>> Alexey.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 14.04.15 18:01, Daniel Feist wrote:
>>>>>>>>
>>>>>>>> Very interesting. My previuos tests had been with a simple inbound
>>>>>>>> echo.
>>>>>>>> When testing with a non-blocking proxy (1Kb payload, 5ms target
>>>>>>>> service
>>>>>>>> latency) optimizedForMultiplexing=false appears to give better TPS
>>>>>>>> and
>>>>>>>> latency :-)
>>>>>>>>
>>>>>>>> Having some issues with non-blocking proxy in general though, getting
>>>>>>>> decent
>>>>>>>> number of errors whereas in blocking mode get zero. Is it possible
>>>>>>>> that
>>>>>>>> stale connections aren't handled in the same way, or is there
>>>>>>>> something
>>>>>>>> else
>>>>>>>> that might be causing this? I'll do some more digging around, but
>>>>>>>> what
>>>>>>>> I'm
>>>>>>>> seeing right now is 0.05% of jmeter client requests timing out after
>>>>>>>> 60s.
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 14, 2015 at 9:25 PM, Oleksiy Stashok
>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>> Hi Dan,
>>>>>>>>>
>>>>>>>>> yeah, there is no silver bullet solution for all kind of usecases.
>>>>>>>>> An optimizedForMultiplexing is useful for concurrent writes, because
>>>>>>>>> the
>>>>>>>>> outbound messages are always added to the queue and written from the
>>>>>>>>> selector/nio thread, and at the write time Grizzly packs all (up to
>>>>>>>>> some
>>>>>>>>> limit) the available outbound messages and send them as one chunk,
>>>>>>>>> which
>>>>>>>>> reduces number of I/O operations. When optimizedForMultiplexing is
>>>>>>>>> disabled
>>>>>>>>> (by default) Grizzly (if the output queue is empty) first tries to
>>>>>>>>> send
>>>>>>>>> the
>>>>>>>>> outbound message right away in the same thread.
>>>>>>>>> So I'd say when optimizedForMultiplexing is disabled we potentially
>>>>>>>>> reduce
>>>>>>>>> latency, when optimizedForMultiplexing is enabled we increase
>>>>>>>>> throughput.
>>>>>>>>> But it's very simple way to look at this config parameter, I bet on
>>>>>>>>> practice
>>>>>>>>> you can experience opposite :))
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> WBR,
>>>>>>>>> Alexey.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 13.04.15 23:40, Daniel Feist wrote:
>>>>>>>>>
>>>>>>>>> Interestingly I saw a performance improvement using
>>>>>>>>> optimizedForMultiplexing with HTTP, although this potentially only
>>>>>>>>> affected
>>>>>>>>> my specific test scenario (simple low latency echo). Also note that
>>>>>>>>> this was
>>>>>>>>> when using worker threads, so not straight through using selectors.
>>>>>>>>>
>>>>>>>>> Let me turn off optimizedForMultiplexing, give inbound 1 selector
>>>>>>>>> per
>>>>>>>>> core, outbound 1 selector per core and see how this runs...
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>> On Mon, Apr 13, 2015 at 11:44 PM, Oleksiy Stashok
>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> - Even if the same selector pool is configured for inbound and
>>>>>>>>>>> outbound,
>>>>>>>>>>> during response processing then Grizzly will still do a thread
>>>>>>>>>>> handover
>>>>>>>>>>> before sending response to client because of the use of
>>>>>>>>>>> AsyncQueueIO.
>>>>>>>>>>> This
>>>>>>>>>>> right?
>>>>>>>>>>>
>>>>>>>>>>> Not sure I understand this, IMO there won't be any extra thread
>>>>>>>>>>> handover
>>>>>>>>>>> involved.
>>>>>>>>>>
>>>>>>>>>> I was referring to the AsyncWriteQueue. Currently I have
>>>>>>>>>> 'optimizedForMultiplexing' set true which I thought I'd seen
>>>>>>>>>> previsouly
>>>>>>>>>> disabled the direct writing as you described further on in your
>>>>>>>>>> email.
>>>>>>>>>> Pherhaps I should try without this flag though.
>>>>>>>>>>
>>>>>>>>>> Right, optimizedForMultiplexing is useful, when you concurrently
>>>>>>>>>> write
>>>>>>>>>> packets to the connection, which is not the case with HTTP, unless
>>>>>>>>>> it's HTTP
>>>>>>>>>> 2.0 :)
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> WBR,
>>>>>>>>>> Alexey.
>>>>>>>>>>
>>>>>>>>>>