Re: Optimal IOStrategy/ThreadPool configuration for proxy

From: Marc Arens <marc.arens_at_open-xchange.com>
Date: Tue, 21 Apr 2015 14:32:12 +0200 (CEST)

On 21 April 2015 at 13:56 Daniel Feist <dfeist@gmail.com> wrote:

Hi,

The size can be limited to 1MB or less using the following system
properties -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-receive-buffer-size=1048576.
Send buffer is less problematic because it will only use 16MB if you
payload is 16MB but still, it's a good idea to limit send buffer size
to you using -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-send-buffer-size=1048576.

To estimate direct memory usage for HTTP inbound:

direct memory usage = workerThreads x socket.geReceiveBufferSize() +
workerThreads x min(socket.getSendBufferSize() * 1.5,
sizeHttpResponse)

So if the OS reports 16MB then with 4000 worker threads and a payload
of 1MB, this means total usage of 66GB. While if limit both to 512KB
say, then that's just under 4GB in total.

Thanks, I'll do some in-house testing to see how this influences our application.

You shouldn't be seeing a leak as such, just grizzly wanting to use
than it's got. The thing you are likely to see before memory actually
runs out is the JVM going into a FullGC loop continuously as a
explicit GC occurs or each http request when remaining direct memory
is under a certain threshold..

Of course there _should_ be no leak but something is leaking to off-heap and we haven't found out what exactly it is as it. As grizzly is one of the frameworks that uses direct memory i'll have to investigate that part. Inspecting the process pmap shows some huge and some constantly growing anon regions e.g 2GiB resident RAM usage while the heapdump is around 200MiB. I was looking at http://developerblog.redhat.com/2015/01/06/malloc-systemtap-probes-an-example/ already for debugging this, any other recommendations if anybody of you had a similar problem? (to hijack this thread completely :D)

Dan

On Tue, Apr 21, 2015 at 9:38 AM, Marc Arens <marc.arens@open-xchange.com> wrote:
> Hey Alexey, Daniel,
>
> this sounds interesting as some customers are seeing long-time off-heap
> memory leaks. Sadly i don't have many infos yet as they didn't install the
> jvm direct memory monitoring until now. How exactly was the direct memory
> limited in your tests and what are recommendations when using many worker
> threads?
>
>
> On 21 April 2015 at 03:00 Oleksiy Stashok <oleksiy.stashok@oracle.com>
> wrote:
>
>
> Hi Dan,
>
> interesting observations! Try to play with the selectors count, try to
> double it for both cases blocking and non-blocking, just to compare peak
> tps. You may also want to try different payload sizes.
> Thread context switch is relatively expensive operation as well as I/O
> operations, but when you run load tests, for different cases they can
> compensate each other, for example you make more thread context
> switches, but somehow it leads to less I/O ops...
>
> Regarding the direct memory usage - it's expected, because Grizzly (JDK
> does the same) store a thread-local direct ByteBuffer for read/write
> operations, more threads - more direct ByteBuffers. We store them in
> Weak references, so they should be recycled at some point, but still...
>
> Thanks.
>
> WBR,
> Alexey.
>
>
>
> On 19.04.15 05:50, Daniel Feist wrote:
>> Hi,
>>
>> Sorry, false alarm. There was a stupid bug in my code (between
>> inbound and outbound) that was causing a deadlock in some of the
>> selectors and it was this that was producing the timeouts and errors
>> :-( Fixed now though..
>>
>> I've now been able to run a full set of tests at different target
>> service latencies and concurrencies and it's running very well.
>>
>> Observations are:
>> - With high latency (e.g. 1000ms) blocking/non-blocking perform the
>> same. Of course blocking needs 1 thread per client thread, but giving
>> the proxy maxWorkerThreads of 10,000 just in case doesn't cause any
>> adverse performance, it just doesn't use the threads.
>> - With low latency (e.g. 0->5ms) blocking is faster, but not by much.
>> Number of worker threads in this case is crucial though, more worker
>> threads than required to reach peak tps, and I start to see a
>> degradation in TPS/latency.
>> - With medium latency (e.g. 50ms) it appears that non-blocking is
>> slightly faster, at least at higher concurrencies.
>>
>> Initially I was expecting to see more negative effects of having say
>> 4000 worker threads from context-switching etc, but this causes
>> minimal impact at low latencies and none at all at high latencies.
>>
>> One other interesting side affect of having 1000's of worker threads
>> rather than 24 selectors is the amount of direct memory used. I'm
>> limiting buffer size via system properties of course, but if i wasn't
>> 4000 worker threads on the hardware I'm using (which reports 16Mb
>> buffer size to java) requires 125GB of direct memory vs 0.75GB, and
>> thats just for read buffer. My calculations might not be perfect but
>> you get the idea..
>>
>> This is just a FYI, but if there is anything you think is strange, be
>> interesting to know..
>>
>> thanks!
>>
>> Dan
>>
>>
>>
>> On Fri, Apr 17, 2015 at 1:56 AM, Oleksiy Stashok
>> <oleksiy.stashok@oracle.com> wrote:
>>> Hi Dan,
>>>
>>> everything is possible, and may be back pressure caused by blocking I/O
>>> really makes the difference...
>>> If you have time it would be interesting to investigate this more and try
>>> to
>>> check if you can register any "lost" or "forgotten" request in your app.
>>> Try
>>> to dump all the request/response processing timestamps to figure out
>>> where
>>> exactly the processing takes the most time and at what stage the timeout
>>> occurs: jmeter (1)--> proxy (2)--> backend (3)--> proxy (4)--> jmeter.
>>> According to your description it should it either 3 or 4, but it would be
>>> interesting to see exactly how it happens.
>>>
>>> Thanks.
>>>
>>> WBR,
>>> Alexey.
>>>
>>>
>>> On 16.04.15 17:24, Daniel Feist wrote:
>>>> Ignore my last email about this affecting low concurrency, it doesn't.
>>>> I was only seeing some errors at low concurrency due to side-effects
>>>> of previous test run I think. I need 2000+ JMeter client threads to
>>>> reproduce this consistently.
>>>>
>>>> I stripped out everything as much as possible so i'm not doing
>>>> anything in between and AHC is invoking inbound grizzly response as
>>>> directly as possible but no difference. The exact error in jmeter is
>>>> "java.net.SocketTimeoutException,Non HTTP response message: Read timed
>>>> out".
>>>>
>>>> Question: this might sound stupid, but couldn't it simply be that the
>>>> proxy, with the number of selectors it has (and not using worker
>>>> threads) simply cannot handle the load? And that we don't see errors
>>>> with blocking because back-pressured is applied more directly whereas
>>>> with non-blocking the same type of back-pressure doesn't occur and so
>>>> we get this type of error instead?
>>>>
>>>> Dan
>>>>
>>>> On Thu, Apr 16, 2015 at 10:16 PM, Daniel Feist <dfeist@gmail.com> wrote:
>>>>> The thing is, if i remove the outbound call then is ceases to be a
>>>>> proxy and as such I don't have a seperate thread processing the
>>>>> response callback and instead it behaves blocking (which works)
>>>>>
>>>>> Anyway, I'll try to simplify as much as possible in other ways and see
>>>>> where that leads me...
>>>>>
>>>>> Dan
>>>>>
>>>>> On Thu, Apr 16, 2015 at 9:00 PM, Oleksiy Stashok
>>>>> <oleksiy.stashok@oracle.com> wrote:
>>>>>> Hi Dan,
>>>>>>
>>>>>> let's try to simplify the test, what happens if the proxy sends the
>>>>>> response
>>>>>> right away (no outbound calls), do you still see the timeouts?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> WBR,
>>>>>> Alexey.
>>>>>>
>>>>>>
>>>>>> On 16.04.15 12:17, Daniel Feist wrote:
>>>>>>> What I forgot to add is that I see the same issue with timeouts
>>>>>>> between jmeter and the proxy even when "jmeter threads < selectors",
>>>>>>> which kind of invalidates all of my ideas about selectors all
>>>>>>> potentially being busy..
>>>>>>>
>>>>>>> Wow, even with 1 thread it's occuring.. most be something stupid... I
>>>>>>> don't think it's releated to persistent connections, maxKeepAlive on
>>>>>>> target service is 100, which wouldn't explain rougly 1 in 2000
>>>>>>> client-side timeout, especially given no errors are being logged.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 16, 2015 at 7:36 PM, Daniel Feist <dfeist@gmail.com>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Nothing different really, just the blocking version returns response
>>>>>>>> when the stack returns after waiting on outbound future returned by
>>>>>>>> AHC, while the non-blocking version returns response when the
>>>>>>>> completion handler passed to AHC is invoked. Ah, also blocking
>>>>>>>> version
>>>>>>>> using WorkerThreadIOStrategy while the non-blocking version uses
>>>>>>>> SameThreadIOStrategy for inbound.
>>>>>>>>
>>>>>>>> I didn't reply earlier because I've been trying to get my head round
>>>>>>>> whats going on. The errors are all timeout errors. Most of the
>>>>>>>> timeout errors between jmeter and the proxy, but also some timeout
>>>>>>>> errors between the proxy and target service, whereas with the
>>>>>>>> blocking
>>>>>>>> version there are no errors at all.
>>>>>>>>
>>>>>>>> Everything seems to be ok, and there are no exceptions being throw
>>>>>>>> (other than timeouts) by grizzly/ahc. So my only hypothesis is that
>>>>>>>> there is an issue with the selectors, either:
>>>>>>>>
>>>>>>>> i) for some reason selectors are blocking. (i see no evidence of
>>>>>>>> this
>>>>>>>> though, only thing i have between inbound and outbound is some
>>>>>>>> copying
>>>>>>>> of headers)
>>>>>>>> ii) different number of inbound/outbound selectors could generate
>>>>>>>> more
>>>>>>>> inbound message than can be handled by outbound (i've ensured both
>>>>>>>> have same number of selectors, and doesn't help, giving outbound
>>>>>>>> more
>>>>>>>> selectors than inbound seemed to improve things, but not solve the
>>>>>>>> problem). BTW thought is what provoked my original email about
>>>>>>>> shared
>>>>>>>> transports/selectors.
>>>>>>>> iii) By using dedicatedAcceptor the proxy is accepting all
>>>>>>>> connection
>>>>>>>> attempts immedialty, but a selector doesn't manage to handle read
>>>>>>>> event before timeout is reached. (although changing this back to
>>>>>>>> false
>>>>>>>> didn't seem to help).
>>>>>>>>
>>>>>>>> I was initially testing with 4000 client threads, hitting proxy on
>>>>>>>> 24-core machine which in turn hits an simple service with 5ms
>>>>>>>> latency
>>>>>>>> on another 24-core machine. But if I run with just 200 client
>>>>>>>> threads
>>>>>>>> I'm seeing the same :-(
>>>>>>>>
>>>>>>>> Last run i just did with concurrency of 200 gave 1159 errors, (6
>>>>>>>> outbound timeouts and 1152 jmeter timeouts) in total of 4,154,978
>>>>>>>> requests. It's only 0.03% but lot more than blocking, and no reason
>>>>>>>> they should be happening.
>>>>>>>>
>>>>>>>> Any hints on where to look next would be greatly appreciated...
>>>>>>>>
>>>>>>>> thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Apr 15, 2015 at 2:16 AM, Oleksiy Stashok
>>>>>>>> <oleksiy.stashok@oracle.com> wrote:
>>>>>>>>> What's the implementation diff of blocking vs. non-blocking? I mean
>>>>>>>>> is
>>>>>>>>> there
>>>>>>>>> any change in your code?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> WBR,
>>>>>>>>> Alexey.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 14.04.15 18:01, Daniel Feist wrote:
>>>>>>>>>
>>>>>>>>> Very interesting. My previuos tests had been with a simple inbound
>>>>>>>>> echo.
>>>>>>>>> When testing with a non-blocking proxy (1Kb payload, 5ms target
>>>>>>>>> service
>>>>>>>>> latency) optimizedForMultiplexing=false appears to give better TPS
>>>>>>>>> and
>>>>>>>>> latency :-)
>>>>>>>>>
>>>>>>>>> Having some issues with non-blocking proxy in general though,
>>>>>>>>> getting
>>>>>>>>> decent
>>>>>>>>> number of errors whereas in blocking mode get zero. Is it possible
>>>>>>>>> that
>>>>>>>>> stale connections aren't handled in the same way, or is there
>>>>>>>>> something
>>>>>>>>> else
>>>>>>>>> that might be causing this? I'll do some more digging around, but
>>>>>>>>> what
>>>>>>>>> I'm
>>>>>>>>> seeing right now is 0.05% of jmeter client requests timing out
>>>>>>>>> after
>>>>>>>>> 60s.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Apr 14, 2015 at 9:25 PM, Oleksiy Stashok
>>>>>>>>> <oleksiy.stashok@oracle.com> wrote:
>>>>>>>>>> Hi Dan,
>>>>>>>>>>
>>>>>>>>>> yeah, there is no silver bullet solution for all kind of usecases.
>>>>>>>>>> An optimizedForMultiplexing is useful for concurrent writes,
>>>>>>>>>> because
>>>>>>>>>> the
>>>>>>>>>> outbound messages are always added to the queue and written from
>>>>>>>>>> the
>>>>>>>>>> selector/nio thread, and at the write time Grizzly packs all (up
>>>>>>>>>> to
>>>>>>>>>> some
>>>>>>>>>> limit) the available outbound messages and send them as one chunk,
>>>>>>>>>> which
>>>>>>>>>> reduces number of I/O operations. When optimizedForMultiplexing is
>>>>>>>>>> disabled
>>>>>>>>>> (by default) Grizzly (if the output queue is empty) first tries to
>>>>>>>>>> send
>>>>>>>>>> the
>>>>>>>>>> outbound message right away in the same thread.
>>>>>>>>>> So I'd say when optimizedForMultiplexing is disabled we
>>>>>>>>>> potentially
>>>>>>>>>> reduce
>>>>>>>>>> latency, when optimizedForMultiplexing is enabled we increase
>>>>>>>>>> throughput.
>>>>>>>>>> But it's very simple way to look at this config parameter, I bet
>>>>>>>>>> on
>>>>>>>>>> practice
>>>>>>>>>> you can experience opposite :))
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> WBR,
>>>>>>>>>> Alexey.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 13.04.15 23:40, Daniel Feist wrote:
>>>>>>>>>>
>>>>>>>>>> Interestingly I saw a performance improvement using
>>>>>>>>>> optimizedForMultiplexing with HTTP, although this potentially only
>>>>>>>>>> affected
>>>>>>>>>> my specific test scenario (simple low latency echo). Also note
>>>>>>>>>> that
>>>>>>>>>> this was
>>>>>>>>>> when using worker threads, so not straight through using
>>>>>>>>>> selectors.
>>>>>>>>>>
>>>>>>>>>> Let me turn off optimizedForMultiplexing, give inbound 1 selector
>>>>>>>>>> per
>>>>>>>>>> core, outbound 1 selector per core and see how this runs...
>>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 13, 2015 at 11:44 PM, Oleksiy Stashok
>>>>>>>>>> <oleksiy.stashok@oracle.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> - Even if the same selector pool is configured for inbound and
>>>>>>>>>>>> outbound,
>>>>>>>>>>>> during response processing then Grizzly will still do a thread
>>>>>>>>>>>> handover
>>>>>>>>>>>> before sending response to client because of the use of
>>>>>>>>>>>> AsyncQueueIO.
>>>>>>>>>>>> This
>>>>>>>>>>>> right?
>>>>>>>>>>>>
>>>>>>>>>>>> Not sure I understand this, IMO there won't be any extra thread
>>>>>>>>>>>> handover
>>>>>>>>>>>> involved.
>>>>>>>>>>>
>>>>>>>>>>> I was referring to the AsyncWriteQueue. Currently I have
>>>>>>>>>>> 'optimizedForMultiplexing' set true which I thought I'd seen
>>>>>>>>>>> previsouly
>>>>>>>>>>> disabled the direct writing as you described further on in your
>>>>>>>>>>> email.
>>>>>>>>>>> Pherhaps I should try without this flag though.
>>>>>>>>>>>
>>>>>>>>>>> Right, optimizedForMultiplexing is useful, when you concurrently
>>>>>>>>>>> write
>>>>>>>>>>> packets to the connection, which is not the case with HTTP,
>>>>>>>>>>> unless
>>>>>>>>>>> it's HTTP
>>>>>>>>>>> 2.0 :)
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> WBR,
>>>>>>>>>>> Alexey.
>>>>>>>>>>>
>>>>>>>>>>>
>
>>
>
> ...it was evening and it was morning and there were already two ways to
> store Unicode...

...it was evening and it was morning and there were already two ways to store Unicode...