Re: Optimal IOStrategy/ThreadPool configuration for proxy

From: Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
Date: Tue, 21 Apr 2015 09:35:42 -0700

Hi guys,

just a small fix for the estimate, I think it's following:

usage = (workerThreads + selectorThreads) *
(max(connection.getReadBufferSize(), min(connection.getWriteBufferSize()
* 1.5, sizeHttpResponse));

where by default:
connection.getReadBufferSize() == socket.getReceiveBufferSize()
connection.getWriteBufferSize() == socket.getSendBufferSize();

There are also 2 system properties supported:

org.glassfish.grizzly.nio.transport.TCPNIOTransport.max-receive-buffer-size
org.glassfish.grizzly.nio.transport.TCPNIOTransport.max-send-buffer-size

which additionally set the max read, write buffer sizes (by default
Integer.MAX_VALUE).

WBR,
Alexey.

On 21.04.15 04:56, Daniel Feist wrote:
> Hi,
>
> The size can be limited to 1MB or less using the following system
> properties -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-receive-buffer-size=1048576.
> Send buffer is less problematic because it will only use 16MB if you
> payload is 16MB but still, it's a good idea to limit send buffer size
> to you using -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-send-buffer-size=1048576.
>
> To estimate direct memory usage for HTTP inbound:
>
> direct memory usage = workerThreads x socket.geReceiveBufferSize() +
> workerThreads x min(socket.getSendBufferSize() * 1.5,
> sizeHttpResponse)
>
> So if the OS reports 16MB then with 4000 worker threads and a payload
> of 1MB, this means total usage of 66GB. While if limit both to 512KB
> say, then that's just under 4GB in total.
>
> You shouldn't be seeing a leak as such, just grizzly wanting to use
> than it's got. The thing you are likely to see before memory actually
> runs out is the JVM going into a FullGC loop continuously as a
> explicit GC occurs or each http request when remaining direct memory
> is under a certain threshold..
>
> Dan
>
>
> On Tue, Apr 21, 2015 at 9:38 AM, Marc Arens <marc.arens_at_open-xchange.com> wrote:
>> Hey Alexey, Daniel,
>>
>> this sounds interesting as some customers are seeing long-time off-heap
>> memory leaks. Sadly i don't have many infos yet as they didn't install the
>> jvm direct memory monitoring until now. How exactly was the direct memory
>> limited in your tests and what are recommendations when using many worker
>> threads?
>>
>>
>> On 21 April 2015 at 03:00 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
>> wrote:
>>
>>
>> Hi Dan,
>>
>> interesting observations! Try to play with the selectors count, try to
>> double it for both cases blocking and non-blocking, just to compare peak
>> tps. You may also want to try different payload sizes.
>> Thread context switch is relatively expensive operation as well as I/O
>> operations, but when you run load tests, for different cases they can
>> compensate each other, for example you make more thread context
>> switches, but somehow it leads to less I/O ops...
>>
>> Regarding the direct memory usage - it's expected, because Grizzly (JDK
>> does the same) store a thread-local direct ByteBuffer for read/write
>> operations, more threads - more direct ByteBuffers. We store them in
>> Weak references, so they should be recycled at some point, but still...
>>
>> Thanks.
>>
>> WBR,
>> Alexey.
>>
>>
>>
>> On 19.04.15 05:50, Daniel Feist wrote:
>>> Hi,
>>>
>>> Sorry, false alarm. There was a stupid bug in my code (between
>>> inbound and outbound) that was causing a deadlock in some of the
>>> selectors and it was this that was producing the timeouts and errors
>>> :-( Fixed now though..
>>>
>>> I've now been able to run a full set of tests at different target
>>> service latencies and concurrencies and it's running very well.
>>>
>>> Observations are:
>>> - With high latency (e.g. 1000ms) blocking/non-blocking perform the
>>> same. Of course blocking needs 1 thread per client thread, but giving
>>> the proxy maxWorkerThreads of 10,000 just in case doesn't cause any
>>> adverse performance, it just doesn't use the threads.
>>> - With low latency (e.g. 0->5ms) blocking is faster, but not by much.
>>> Number of worker threads in this case is crucial though, more worker
>>> threads than required to reach peak tps, and I start to see a
>>> degradation in TPS/latency.
>>> - With medium latency (e.g. 50ms) it appears that non-blocking is
>>> slightly faster, at least at higher concurrencies.
>>>
>>> Initially I was expecting to see more negative effects of having say
>>> 4000 worker threads from context-switching etc, but this causes
>>> minimal impact at low latencies and none at all at high latencies.
>>>
>>> One other interesting side affect of having 1000's of worker threads
>>> rather than 24 selectors is the amount of direct memory used. I'm
>>> limiting buffer size via system properties of course, but if i wasn't
>>> 4000 worker threads on the hardware I'm using (which reports 16Mb
>>> buffer size to java) requires 125GB of direct memory vs 0.75GB, and
>>> thats just for read buffer. My calculations might not be perfect but
>>> you get the idea..
>>>
>>> This is just a FYI, but if there is anything you think is strange, be
>>> interesting to know..
>>>
>>> thanks!
>>>
>>> Dan
>>>
>>>
>>>
>>> On Fri, Apr 17, 2015 at 1:56 AM, Oleksiy Stashok
>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>> Hi Dan,
>>>>
>>>> everything is possible, and may be back pressure caused by blocking I/O
>>>> really makes the difference...
>>>> If you have time it would be interesting to investigate this more and try
>>>> to
>>>> check if you can register any "lost" or "forgotten" request in your app.
>>>> Try
>>>> to dump all the request/response processing timestamps to figure out
>>>> where
>>>> exactly the processing takes the most time and at what stage the timeout
>>>> occurs: jmeter (1)--> proxy (2)--> backend (3)--> proxy (4)--> jmeter.
>>>> According to your description it should it either 3 or 4, but it would be
>>>> interesting to see exactly how it happens.
>>>>
>>>> Thanks.
>>>>
>>>> WBR,
>>>> Alexey.
>>>>
>>>>
>>>> On 16.04.15 17:24, Daniel Feist wrote:
>>>>> Ignore my last email about this affecting low concurrency, it doesn't.
>>>>> I was only seeing some errors at low concurrency due to side-effects
>>>>> of previous test run I think. I need 2000+ JMeter client threads to
>>>>> reproduce this consistently.
>>>>>
>>>>> I stripped out everything as much as possible so i'm not doing
>>>>> anything in between and AHC is invoking inbound grizzly response as
>>>>> directly as possible but no difference. The exact error in jmeter is
>>>>> "java.net.SocketTimeoutException,Non HTTP response message: Read timed
>>>>> out".
>>>>>
>>>>> Question: this might sound stupid, but couldn't it simply be that the
>>>>> proxy, with the number of selectors it has (and not using worker
>>>>> threads) simply cannot handle the load? And that we don't see errors
>>>>> with blocking because back-pressured is applied more directly whereas
>>>>> with non-blocking the same type of back-pressure doesn't occur and so
>>>>> we get this type of error instead?
>>>>>
>>>>> Dan
>>>>>
>>>>> On Thu, Apr 16, 2015 at 10:16 PM, Daniel Feist <dfeist_at_gmail.com> wrote:
>>>>>> The thing is, if i remove the outbound call then is ceases to be a
>>>>>> proxy and as such I don't have a seperate thread processing the
>>>>>> response callback and instead it behaves blocking (which works)
>>>>>>
>>>>>> Anyway, I'll try to simplify as much as possible in other ways and see
>>>>>> where that leads me...
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>> On Thu, Apr 16, 2015 at 9:00 PM, Oleksiy Stashok
>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>> Hi Dan,
>>>>>>>
>>>>>>> let's try to simplify the test, what happens if the proxy sends the
>>>>>>> response
>>>>>>> right away (no outbound calls), do you still see the timeouts?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> WBR,
>>>>>>> Alexey.
>>>>>>>
>>>>>>>
>>>>>>> On 16.04.15 12:17, Daniel Feist wrote:
>>>>>>>> What I forgot to add is that I see the same issue with timeouts
>>>>>>>> between jmeter and the proxy even when "jmeter threads < selectors",
>>>>>>>> which kind of invalidates all of my ideas about selectors all
>>>>>>>> potentially being busy..
>>>>>>>>
>>>>>>>> Wow, even with 1 thread it's occuring.. most be something stupid... I
>>>>>>>> don't think it's releated to persistent connections, maxKeepAlive on
>>>>>>>> target service is 100, which wouldn't explain rougly 1 in 2000
>>>>>>>> client-side timeout, especially given no errors are being logged.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 16, 2015 at 7:36 PM, Daniel Feist <dfeist_at_gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Nothing different really, just the blocking version returns response
>>>>>>>>> when the stack returns after waiting on outbound future returned by
>>>>>>>>> AHC, while the non-blocking version returns response when the
>>>>>>>>> completion handler passed to AHC is invoked. Ah, also blocking
>>>>>>>>> version
>>>>>>>>> using WorkerThreadIOStrategy while the non-blocking version uses
>>>>>>>>> SameThreadIOStrategy for inbound.
>>>>>>>>>
>>>>>>>>> I didn't reply earlier because I've been trying to get my head round
>>>>>>>>> whats going on. The errors are all timeout errors. Most of the
>>>>>>>>> timeout errors between jmeter and the proxy, but also some timeout
>>>>>>>>> errors between the proxy and target service, whereas with the
>>>>>>>>> blocking
>>>>>>>>> version there are no errors at all.
>>>>>>>>>
>>>>>>>>> Everything seems to be ok, and there are no exceptions being throw
>>>>>>>>> (other than timeouts) by grizzly/ahc. So my only hypothesis is that
>>>>>>>>> there is an issue with the selectors, either:
>>>>>>>>>
>>>>>>>>> i) for some reason selectors are blocking. (i see no evidence of
>>>>>>>>> this
>>>>>>>>> though, only thing i have between inbound and outbound is some
>>>>>>>>> copying
>>>>>>>>> of headers)
>>>>>>>>> ii) different number of inbound/outbound selectors could generate
>>>>>>>>> more
>>>>>>>>> inbound message than can be handled by outbound (i've ensured both
>>>>>>>>> have same number of selectors, and doesn't help, giving outbound
>>>>>>>>> more
>>>>>>>>> selectors than inbound seemed to improve things, but not solve the
>>>>>>>>> problem). BTW thought is what provoked my original email about
>>>>>>>>> shared
>>>>>>>>> transports/selectors.
>>>>>>>>> iii) By using dedicatedAcceptor the proxy is accepting all
>>>>>>>>> connection
>>>>>>>>> attempts immedialty, but a selector doesn't manage to handle read
>>>>>>>>> event before timeout is reached. (although changing this back to
>>>>>>>>> false
>>>>>>>>> didn't seem to help).
>>>>>>>>>
>>>>>>>>> I was initially testing with 4000 client threads, hitting proxy on
>>>>>>>>> 24-core machine which in turn hits an simple service with 5ms
>>>>>>>>> latency
>>>>>>>>> on another 24-core machine. But if I run with just 200 client
>>>>>>>>> threads
>>>>>>>>> I'm seeing the same :-(
>>>>>>>>>
>>>>>>>>> Last run i just did with concurrency of 200 gave 1159 errors, (6
>>>>>>>>> outbound timeouts and 1152 jmeter timeouts) in total of 4,154,978
>>>>>>>>> requests. It's only 0.03% but lot more than blocking, and no reason
>>>>>>>>> they should be happening.
>>>>>>>>>
>>>>>>>>> Any hints on where to look next would be greatly appreciated...
>>>>>>>>>
>>>>>>>>> thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 15, 2015 at 2:16 AM, Oleksiy Stashok
>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>> What's the implementation diff of blocking vs. non-blocking? I mean
>>>>>>>>>> is
>>>>>>>>>> there
>>>>>>>>>> any change in your code?
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> WBR,
>>>>>>>>>> Alexey.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 14.04.15 18:01, Daniel Feist wrote:
>>>>>>>>>>
>>>>>>>>>> Very interesting. My previuos tests had been with a simple inbound
>>>>>>>>>> echo.
>>>>>>>>>> When testing with a non-blocking proxy (1Kb payload, 5ms target
>>>>>>>>>> service
>>>>>>>>>> latency) optimizedForMultiplexing=false appears to give better TPS
>>>>>>>>>> and
>>>>>>>>>> latency :-)
>>>>>>>>>>
>>>>>>>>>> Having some issues with non-blocking proxy in general though,
>>>>>>>>>> getting
>>>>>>>>>> decent
>>>>>>>>>> number of errors whereas in blocking mode get zero. Is it possible
>>>>>>>>>> that
>>>>>>>>>> stale connections aren't handled in the same way, or is there
>>>>>>>>>> something
>>>>>>>>>> else
>>>>>>>>>> that might be causing this? I'll do some more digging around, but
>>>>>>>>>> what
>>>>>>>>>> I'm
>>>>>>>>>> seeing right now is 0.05% of jmeter client requests timing out
>>>>>>>>>> after
>>>>>>>>>> 60s.
>>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 14, 2015 at 9:25 PM, Oleksiy Stashok
>>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>
>>>>>>>>>>> yeah, there is no silver bullet solution for all kind of usecases.
>>>>>>>>>>> An optimizedForMultiplexing is useful for concurrent writes,
>>>>>>>>>>> because
>>>>>>>>>>> the
>>>>>>>>>>> outbound messages are always added to the queue and written from
>>>>>>>>>>> the
>>>>>>>>>>> selector/nio thread, and at the write time Grizzly packs all (up
>>>>>>>>>>> to
>>>>>>>>>>> some
>>>>>>>>>>> limit) the available outbound messages and send them as one chunk,
>>>>>>>>>>> which
>>>>>>>>>>> reduces number of I/O operations. When optimizedForMultiplexing is
>>>>>>>>>>> disabled
>>>>>>>>>>> (by default) Grizzly (if the output queue is empty) first tries to
>>>>>>>>>>> send
>>>>>>>>>>> the
>>>>>>>>>>> outbound message right away in the same thread.
>>>>>>>>>>> So I'd say when optimizedForMultiplexing is disabled we
>>>>>>>>>>> potentially
>>>>>>>>>>> reduce
>>>>>>>>>>> latency, when optimizedForMultiplexing is enabled we increase
>>>>>>>>>>> throughput.
>>>>>>>>>>> But it's very simple way to look at this config parameter, I bet
>>>>>>>>>>> on
>>>>>>>>>>> practice
>>>>>>>>>>> you can experience opposite :))
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> WBR,
>>>>>>>>>>> Alexey.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 13.04.15 23:40, Daniel Feist wrote:
>>>>>>>>>>>
>>>>>>>>>>> Interestingly I saw a performance improvement using
>>>>>>>>>>> optimizedForMultiplexing with HTTP, although this potentially only
>>>>>>>>>>> affected
>>>>>>>>>>> my specific test scenario (simple low latency echo). Also note
>>>>>>>>>>> that
>>>>>>>>>>> this was
>>>>>>>>>>> when using worker threads, so not straight through using
>>>>>>>>>>> selectors.
>>>>>>>>>>>
>>>>>>>>>>> Let me turn off optimizedForMultiplexing, give inbound 1 selector
>>>>>>>>>>> per
>>>>>>>>>>> core, outbound 1 selector per core and see how this runs...
>>>>>>>>>>>
>>>>>>>>>>> Dan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 13, 2015 at 11:44 PM, Oleksiy Stashok
>>>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> - Even if the same selector pool is configured for inbound and
>>>>>>>>>>>>> outbound,
>>>>>>>>>>>>> during response processing then Grizzly will still do a thread
>>>>>>>>>>>>> handover
>>>>>>>>>>>>> before sending response to client because of the use of
>>>>>>>>>>>>> AsyncQueueIO.
>>>>>>>>>>>>> This
>>>>>>>>>>>>> right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Not sure I understand this, IMO there won't be any extra thread
>>>>>>>>>>>>> handover
>>>>>>>>>>>>> involved.
>>>>>>>>>>>> I was referring to the AsyncWriteQueue. Currently I have
>>>>>>>>>>>> 'optimizedForMultiplexing' set true which I thought I'd seen
>>>>>>>>>>>> previsouly
>>>>>>>>>>>> disabled the direct writing as you described further on in your
>>>>>>>>>>>> email.
>>>>>>>>>>>> Pherhaps I should try without this flag though.
>>>>>>>>>>>>
>>>>>>>>>>>> Right, optimizedForMultiplexing is useful, when you concurrently
>>>>>>>>>>>> write
>>>>>>>>>>>> packets to the connection, which is not the case with HTTP,
>>>>>>>>>>>> unless
>>>>>>>>>>>> it's HTTP
>>>>>>>>>>>> 2.0 :)
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> WBR,
>>>>>>>>>>>> Alexey.
>>>>>>>>>>>>
>>>>>>>>>>>>
>> ...it was evening and it was morning and there were already two ways to
>> store Unicode...