Re: Optimal IOStrategy/ThreadPool configuration for proxy

From: Daniel Feist <dfeist_at_gmail.com>
Date: Sun, 19 Apr 2015 13:50:48 +0100

Hi,

Sorry, false alarm. There was a stupid bug in my code (between
inbound and outbound) that was causing a deadlock in some of the
selectors and it was this that was producing the timeouts and errors
:-( Fixed now though..

I've now been able to run a full set of tests at different target
service latencies and concurrencies and it's running very well.

Observations are:
- With high latency (e.g. 1000ms) blocking/non-blocking perform the
same. Of course blocking needs 1 thread per client thread, but giving
the proxy maxWorkerThreads of 10,000 just in case doesn't cause any
adverse performance, it just doesn't use the threads.
- With low latency (e.g. 0->5ms) blocking is faster, but not by much.
Number of worker threads in this case is crucial though, more worker
threads than required to reach peak tps, and I start to see a
degradation in TPS/latency.
- With medium latency (e.g. 50ms) it appears that non-blocking is
slightly faster, at least at higher concurrencies.

Initially I was expecting to see more negative effects of having say
4000 worker threads from context-switching etc, but this causes
minimal impact at low latencies and none at all at high latencies.

One other interesting side affect of having 1000's of worker threads
rather than 24 selectors is the amount of direct memory used. I'm
limiting buffer size via system properties of course, but if i wasn't
4000 worker threads on the hardware I'm using (which reports 16Mb
buffer size to java) requires 125GB of direct memory vs 0.75GB, and
thats just for read buffer. My calculations might not be perfect but
you get the idea..

This is just a FYI, but if there is anything you think is strange, be
interesting to know..

thanks!

Dan

On Fri, Apr 17, 2015 at 1:56 AM, Oleksiy Stashok
<oleksiy.stashok_at_oracle.com> wrote:
> Hi Dan,
>
> everything is possible, and may be back pressure caused by blocking I/O
> really makes the difference...
> If you have time it would be interesting to investigate this more and try to
> check if you can register any "lost" or "forgotten" request in your app. Try
> to dump all the request/response processing timestamps to figure out where
> exactly the processing takes the most time and at what stage the timeout
> occurs: jmeter (1)--> proxy (2)--> backend (3)--> proxy (4)--> jmeter.
> According to your description it should it either 3 or 4, but it would be
> interesting to see exactly how it happens.
>
> Thanks.
>
> WBR,
> Alexey.
>
>
> On 16.04.15 17:24, Daniel Feist wrote:
>>
>> Ignore my last email about this affecting low concurrency, it doesn't.
>> I was only seeing some errors at low concurrency due to side-effects
>> of previous test run I think. I need 2000+ JMeter client threads to
>> reproduce this consistently.
>>
>> I stripped out everything as much as possible so i'm not doing
>> anything in between and AHC is invoking inbound grizzly response as
>> directly as possible but no difference. The exact error in jmeter is
>> "java.net.SocketTimeoutException,Non HTTP response message: Read timed
>> out".
>>
>> Question: this might sound stupid, but couldn't it simply be that the
>> proxy, with the number of selectors it has (and not using worker
>> threads) simply cannot handle the load? And that we don't see errors
>> with blocking because back-pressured is applied more directly whereas
>> with non-blocking the same type of back-pressure doesn't occur and so
>> we get this type of error instead?
>>
>> Dan
>>
>> On Thu, Apr 16, 2015 at 10:16 PM, Daniel Feist <dfeist_at_gmail.com> wrote:
>>>
>>> The thing is, if i remove the outbound call then is ceases to be a
>>> proxy and as such I don't have a seperate thread processing the
>>> response callback and instead it behaves blocking (which works)
>>>
>>> Anyway, I'll try to simplify as much as possible in other ways and see
>>> where that leads me...
>>>
>>> Dan
>>>
>>> On Thu, Apr 16, 2015 at 9:00 PM, Oleksiy Stashok
>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>
>>>> Hi Dan,
>>>>
>>>> let's try to simplify the test, what happens if the proxy sends the
>>>> response
>>>> right away (no outbound calls), do you still see the timeouts?
>>>>
>>>> Thanks.
>>>>
>>>> WBR,
>>>> Alexey.
>>>>
>>>>
>>>> On 16.04.15 12:17, Daniel Feist wrote:
>>>>>
>>>>> What I forgot to add is that I see the same issue with timeouts
>>>>> between jmeter and the proxy even when "jmeter threads < selectors",
>>>>> which kind of invalidates all of my ideas about selectors all
>>>>> potentially being busy..
>>>>>
>>>>> Wow, even with 1 thread it's occuring.. most be something stupid... I
>>>>> don't think it's releated to persistent connections, maxKeepAlive on
>>>>> target service is 100, which wouldn't explain rougly 1 in 2000
>>>>> client-side timeout, especially given no errors are being logged.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 16, 2015 at 7:36 PM, Daniel Feist <dfeist_at_gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Nothing different really, just the blocking version returns response
>>>>>> when the stack returns after waiting on outbound future returned by
>>>>>> AHC, while the non-blocking version returns response when the
>>>>>> completion handler passed to AHC is invoked. Ah, also blocking version
>>>>>> using WorkerThreadIOStrategy while the non-blocking version uses
>>>>>> SameThreadIOStrategy for inbound.
>>>>>>
>>>>>> I didn't reply earlier because I've been trying to get my head round
>>>>>> whats going on. The errors are all timeout errors. Most of the
>>>>>> timeout errors between jmeter and the proxy, but also some timeout
>>>>>> errors between the proxy and target service, whereas with the blocking
>>>>>> version there are no errors at all.
>>>>>>
>>>>>> Everything seems to be ok, and there are no exceptions being throw
>>>>>> (other than timeouts) by grizzly/ahc. So my only hypothesis is that
>>>>>> there is an issue with the selectors, either:
>>>>>>
>>>>>> i) for some reason selectors are blocking. (i see no evidence of this
>>>>>> though, only thing i have between inbound and outbound is some copying
>>>>>> of headers)
>>>>>> ii) different number of inbound/outbound selectors could generate more
>>>>>> inbound message than can be handled by outbound (i've ensured both
>>>>>> have same number of selectors, and doesn't help, giving outbound more
>>>>>> selectors than inbound seemed to improve things, but not solve the
>>>>>> problem). BTW thought is what provoked my original email about shared
>>>>>> transports/selectors.
>>>>>> iii) By using dedicatedAcceptor the proxy is accepting all connection
>>>>>> attempts immedialty, but a selector doesn't manage to handle read
>>>>>> event before timeout is reached. (although changing this back to false
>>>>>> didn't seem to help).
>>>>>>
>>>>>> I was initially testing with 4000 client threads, hitting proxy on
>>>>>> 24-core machine which in turn hits an simple service with 5ms latency
>>>>>> on another 24-core machine. But if I run with just 200 client threads
>>>>>> I'm seeing the same :-(
>>>>>>
>>>>>> Last run i just did with concurrency of 200 gave 1159 errors, (6
>>>>>> outbound timeouts and 1152 jmeter timeouts) in total of 4,154,978
>>>>>> requests. It's only 0.03% but lot more than blocking, and no reason
>>>>>> they should be happening.
>>>>>>
>>>>>> Any hints on where to look next would be greatly appreciated...
>>>>>>
>>>>>> thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 15, 2015 at 2:16 AM, Oleksiy Stashok
>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>
>>>>>>> What's the implementation diff of blocking vs. non-blocking? I mean
>>>>>>> is
>>>>>>> there
>>>>>>> any change in your code?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> WBR,
>>>>>>> Alexey.
>>>>>>>
>>>>>>>
>>>>>>> On 14.04.15 18:01, Daniel Feist wrote:
>>>>>>>
>>>>>>> Very interesting. My previuos tests had been with a simple inbound
>>>>>>> echo.
>>>>>>> When testing with a non-blocking proxy (1Kb payload, 5ms target
>>>>>>> service
>>>>>>> latency) optimizedForMultiplexing=false appears to give better TPS
>>>>>>> and
>>>>>>> latency :-)
>>>>>>>
>>>>>>> Having some issues with non-blocking proxy in general though, getting
>>>>>>> decent
>>>>>>> number of errors whereas in blocking mode get zero. Is it possible
>>>>>>> that
>>>>>>> stale connections aren't handled in the same way, or is there
>>>>>>> something
>>>>>>> else
>>>>>>> that might be causing this? I'll do some more digging around, but
>>>>>>> what
>>>>>>> I'm
>>>>>>> seeing right now is 0.05% of jmeter client requests timing out after
>>>>>>> 60s.
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 14, 2015 at 9:25 PM, Oleksiy Stashok
>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>
>>>>>>>> Hi Dan,
>>>>>>>>
>>>>>>>> yeah, there is no silver bullet solution for all kind of usecases.
>>>>>>>> An optimizedForMultiplexing is useful for concurrent writes, because
>>>>>>>> the
>>>>>>>> outbound messages are always added to the queue and written from the
>>>>>>>> selector/nio thread, and at the write time Grizzly packs all (up to
>>>>>>>> some
>>>>>>>> limit) the available outbound messages and send them as one chunk,
>>>>>>>> which
>>>>>>>> reduces number of I/O operations. When optimizedForMultiplexing is
>>>>>>>> disabled
>>>>>>>> (by default) Grizzly (if the output queue is empty) first tries to
>>>>>>>> send
>>>>>>>> the
>>>>>>>> outbound message right away in the same thread.
>>>>>>>> So I'd say when optimizedForMultiplexing is disabled we potentially
>>>>>>>> reduce
>>>>>>>> latency, when optimizedForMultiplexing is enabled we increase
>>>>>>>> throughput.
>>>>>>>> But it's very simple way to look at this config parameter, I bet on
>>>>>>>> practice
>>>>>>>> you can experience opposite :))
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> WBR,
>>>>>>>> Alexey.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 13.04.15 23:40, Daniel Feist wrote:
>>>>>>>>
>>>>>>>> Interestingly I saw a performance improvement using
>>>>>>>> optimizedForMultiplexing with HTTP, although this potentially only
>>>>>>>> affected
>>>>>>>> my specific test scenario (simple low latency echo). Also note that
>>>>>>>> this was
>>>>>>>> when using worker threads, so not straight through using selectors.
>>>>>>>>
>>>>>>>> Let me turn off optimizedForMultiplexing, give inbound 1 selector
>>>>>>>> per
>>>>>>>> core, outbound 1 selector per core and see how this runs...
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>> On Mon, Apr 13, 2015 at 11:44 PM, Oleksiy Stashok
>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> - Even if the same selector pool is configured for inbound and
>>>>>>>>>> outbound,
>>>>>>>>>> during response processing then Grizzly will still do a thread
>>>>>>>>>> handover
>>>>>>>>>> before sending response to client because of the use of
>>>>>>>>>> AsyncQueueIO.
>>>>>>>>>> This
>>>>>>>>>> right?
>>>>>>>>>>
>>>>>>>>>> Not sure I understand this, IMO there won't be any extra thread
>>>>>>>>>> handover
>>>>>>>>>> involved.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I was referring to the AsyncWriteQueue. Currently I have
>>>>>>>>> 'optimizedForMultiplexing' set true which I thought I'd seen
>>>>>>>>> previsouly
>>>>>>>>> disabled the direct writing as you described further on in your
>>>>>>>>> email.
>>>>>>>>> Pherhaps I should try without this flag though.
>>>>>>>>>
>>>>>>>>> Right, optimizedForMultiplexing is useful, when you concurrently
>>>>>>>>> write
>>>>>>>>> packets to the connection, which is not the case with HTTP, unless
>>>>>>>>> it's HTTP
>>>>>>>>> 2.0 :)
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> WBR,
>>>>>>>>> Alexey.
>>>>>>>>>
>>>>>>>>>
>