Re: Optimal IOStrategy/ThreadPool configuration for proxy

From: Daniel Feist <dfeist_at_gmail.com>
Date: Tue, 21 Apr 2015 19:05:20 +0100

That formula means that same buffer is used for both read and write,
is that really true?

Of course the formula is different based on i) IOStrategy ii) if async
write queue is used or not. My formular was ssuming WorkerIOStrategy
and no async writes.

Dan

On Tue, Apr 21, 2015 at 5:35 PM, Oleksiy Stashok
<oleksiy.stashok_at_oracle.com> wrote:
> Hi guys,
>
> just a small fix for the estimate, I think it's following:
>
> usage = (workerThreads + selectorThreads) *
> (max(connection.getReadBufferSize(), min(connection.getWriteBufferSize() *
> 1.5, sizeHttpResponse));
>
> where by default:
> connection.getReadBufferSize() == socket.getReceiveBufferSize()
> connection.getWriteBufferSize() == socket.getSendBufferSize();
>
> There are also 2 system properties supported:
>
> org.glassfish.grizzly.nio.transport.TCPNIOTransport.max-receive-buffer-size
> org.glassfish.grizzly.nio.transport.TCPNIOTransport.max-send-buffer-size
>
> which additionally set the max read, write buffer sizes (by default
> Integer.MAX_VALUE).
>
> WBR,
> Alexey.
>
>
>
> On 21.04.15 04:56, Daniel Feist wrote:
>>
>> Hi,
>>
>> The size can be limited to 1MB or less using the following system
>> properties
>> -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-receive-buffer-size=1048576.
>> Send buffer is less problematic because it will only use 16MB if you
>> payload is 16MB but still, it's a good idea to limit send buffer size
>> to you using
>> -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-send-buffer-size=1048576.
>>
>> To estimate direct memory usage for HTTP inbound:
>>
>> direct memory usage = workerThreads x socket.geReceiveBufferSize() +
>> workerThreads x min(socket.getSendBufferSize() * 1.5,
>> sizeHttpResponse)
>>
>> So if the OS reports 16MB then with 4000 worker threads and a payload
>> of 1MB, this means total usage of 66GB. While if limit both to 512KB
>> say, then that's just under 4GB in total.
>>
>> You shouldn't be seeing a leak as such, just grizzly wanting to use
>> than it's got. The thing you are likely to see before memory actually
>> runs out is the JVM going into a FullGC loop continuously as a
>> explicit GC occurs or each http request when remaining direct memory
>> is under a certain threshold..
>>
>> Dan
>>
>>
>> On Tue, Apr 21, 2015 at 9:38 AM, Marc Arens <marc.arens_at_open-xchange.com>
>> wrote:
>>>
>>> Hey Alexey, Daniel,
>>>
>>> this sounds interesting as some customers are seeing long-time off-heap
>>> memory leaks. Sadly i don't have many infos yet as they didn't install
>>> the
>>> jvm direct memory monitoring until now. How exactly was the direct memory
>>> limited in your tests and what are recommendations when using many worker
>>> threads?
>>>
>>>
>>> On 21 April 2015 at 03:00 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
>>> wrote:
>>>
>>>
>>> Hi Dan,
>>>
>>> interesting observations! Try to play with the selectors count, try to
>>> double it for both cases blocking and non-blocking, just to compare peak
>>> tps. You may also want to try different payload sizes.
>>> Thread context switch is relatively expensive operation as well as I/O
>>> operations, but when you run load tests, for different cases they can
>>> compensate each other, for example you make more thread context
>>> switches, but somehow it leads to less I/O ops...
>>>
>>> Regarding the direct memory usage - it's expected, because Grizzly (JDK
>>> does the same) store a thread-local direct ByteBuffer for read/write
>>> operations, more threads - more direct ByteBuffers. We store them in
>>> Weak references, so they should be recycled at some point, but still...
>>>
>>> Thanks.
>>>
>>> WBR,
>>> Alexey.
>>>
>>>
>>>
>>> On 19.04.15 05:50, Daniel Feist wrote:
>>>>
>>>> Hi,
>>>>
>>>> Sorry, false alarm. There was a stupid bug in my code (between
>>>> inbound and outbound) that was causing a deadlock in some of the
>>>> selectors and it was this that was producing the timeouts and errors
>>>> :-( Fixed now though..
>>>>
>>>> I've now been able to run a full set of tests at different target
>>>> service latencies and concurrencies and it's running very well.
>>>>
>>>> Observations are:
>>>> - With high latency (e.g. 1000ms) blocking/non-blocking perform the
>>>> same. Of course blocking needs 1 thread per client thread, but giving
>>>> the proxy maxWorkerThreads of 10,000 just in case doesn't cause any
>>>> adverse performance, it just doesn't use the threads.
>>>> - With low latency (e.g. 0->5ms) blocking is faster, but not by much.
>>>> Number of worker threads in this case is crucial though, more worker
>>>> threads than required to reach peak tps, and I start to see a
>>>> degradation in TPS/latency.
>>>> - With medium latency (e.g. 50ms) it appears that non-blocking is
>>>> slightly faster, at least at higher concurrencies.
>>>>
>>>> Initially I was expecting to see more negative effects of having say
>>>> 4000 worker threads from context-switching etc, but this causes
>>>> minimal impact at low latencies and none at all at high latencies.
>>>>
>>>> One other interesting side affect of having 1000's of worker threads
>>>> rather than 24 selectors is the amount of direct memory used. I'm
>>>> limiting buffer size via system properties of course, but if i wasn't
>>>> 4000 worker threads on the hardware I'm using (which reports 16Mb
>>>> buffer size to java) requires 125GB of direct memory vs 0.75GB, and
>>>> thats just for read buffer. My calculations might not be perfect but
>>>> you get the idea..
>>>>
>>>> This is just a FYI, but if there is anything you think is strange, be
>>>> interesting to know..
>>>>
>>>> thanks!
>>>>
>>>> Dan
>>>>
>>>>
>>>>
>>>> On Fri, Apr 17, 2015 at 1:56 AM, Oleksiy Stashok
>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>
>>>>> Hi Dan,
>>>>>
>>>>> everything is possible, and may be back pressure caused by blocking I/O
>>>>> really makes the difference...
>>>>> If you have time it would be interesting to investigate this more and
>>>>> try
>>>>> to
>>>>> check if you can register any "lost" or "forgotten" request in your
>>>>> app.
>>>>> Try
>>>>> to dump all the request/response processing timestamps to figure out
>>>>> where
>>>>> exactly the processing takes the most time and at what stage the
>>>>> timeout
>>>>> occurs: jmeter (1)--> proxy (2)--> backend (3)--> proxy (4)--> jmeter.
>>>>> According to your description it should it either 3 or 4, but it would
>>>>> be
>>>>> interesting to see exactly how it happens.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> WBR,
>>>>> Alexey.
>>>>>
>>>>>
>>>>> On 16.04.15 17:24, Daniel Feist wrote:
>>>>>>
>>>>>> Ignore my last email about this affecting low concurrency, it doesn't.
>>>>>> I was only seeing some errors at low concurrency due to side-effects
>>>>>> of previous test run I think. I need 2000+ JMeter client threads to
>>>>>> reproduce this consistently.
>>>>>>
>>>>>> I stripped out everything as much as possible so i'm not doing
>>>>>> anything in between and AHC is invoking inbound grizzly response as
>>>>>> directly as possible but no difference. The exact error in jmeter is
>>>>>> "java.net.SocketTimeoutException,Non HTTP response message: Read timed
>>>>>> out".
>>>>>>
>>>>>> Question: this might sound stupid, but couldn't it simply be that the
>>>>>> proxy, with the number of selectors it has (and not using worker
>>>>>> threads) simply cannot handle the load? And that we don't see errors
>>>>>> with blocking because back-pressured is applied more directly whereas
>>>>>> with non-blocking the same type of back-pressure doesn't occur and so
>>>>>> we get this type of error instead?
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>> On Thu, Apr 16, 2015 at 10:16 PM, Daniel Feist <dfeist_at_gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> The thing is, if i remove the outbound call then is ceases to be a
>>>>>>> proxy and as such I don't have a seperate thread processing the
>>>>>>> response callback and instead it behaves blocking (which works)
>>>>>>>
>>>>>>> Anyway, I'll try to simplify as much as possible in other ways and
>>>>>>> see
>>>>>>> where that leads me...
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>> On Thu, Apr 16, 2015 at 9:00 PM, Oleksiy Stashok
>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>
>>>>>>>> Hi Dan,
>>>>>>>>
>>>>>>>> let's try to simplify the test, what happens if the proxy sends the
>>>>>>>> response
>>>>>>>> right away (no outbound calls), do you still see the timeouts?
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> WBR,
>>>>>>>> Alexey.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 16.04.15 12:17, Daniel Feist wrote:
>>>>>>>>>
>>>>>>>>> What I forgot to add is that I see the same issue with timeouts
>>>>>>>>> between jmeter and the proxy even when "jmeter threads <
>>>>>>>>> selectors",
>>>>>>>>> which kind of invalidates all of my ideas about selectors all
>>>>>>>>> potentially being busy..
>>>>>>>>>
>>>>>>>>> Wow, even with 1 thread it's occuring.. most be something stupid...
>>>>>>>>> I
>>>>>>>>> don't think it's releated to persistent connections, maxKeepAlive
>>>>>>>>> on
>>>>>>>>> target service is 100, which wouldn't explain rougly 1 in 2000
>>>>>>>>> client-side timeout, especially given no errors are being logged.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 16, 2015 at 7:36 PM, Daniel Feist <dfeist_at_gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Nothing different really, just the blocking version returns
>>>>>>>>>> response
>>>>>>>>>> when the stack returns after waiting on outbound future returned
>>>>>>>>>> by
>>>>>>>>>> AHC, while the non-blocking version returns response when the
>>>>>>>>>> completion handler passed to AHC is invoked. Ah, also blocking
>>>>>>>>>> version
>>>>>>>>>> using WorkerThreadIOStrategy while the non-blocking version uses
>>>>>>>>>> SameThreadIOStrategy for inbound.
>>>>>>>>>>
>>>>>>>>>> I didn't reply earlier because I've been trying to get my head
>>>>>>>>>> round
>>>>>>>>>> whats going on. The errors are all timeout errors. Most of the
>>>>>>>>>> timeout errors between jmeter and the proxy, but also some timeout
>>>>>>>>>> errors between the proxy and target service, whereas with the
>>>>>>>>>> blocking
>>>>>>>>>> version there are no errors at all.
>>>>>>>>>>
>>>>>>>>>> Everything seems to be ok, and there are no exceptions being throw
>>>>>>>>>> (other than timeouts) by grizzly/ahc. So my only hypothesis is
>>>>>>>>>> that
>>>>>>>>>> there is an issue with the selectors, either:
>>>>>>>>>>
>>>>>>>>>> i) for some reason selectors are blocking. (i see no evidence of
>>>>>>>>>> this
>>>>>>>>>> though, only thing i have between inbound and outbound is some
>>>>>>>>>> copying
>>>>>>>>>> of headers)
>>>>>>>>>> ii) different number of inbound/outbound selectors could generate
>>>>>>>>>> more
>>>>>>>>>> inbound message than can be handled by outbound (i've ensured both
>>>>>>>>>> have same number of selectors, and doesn't help, giving outbound
>>>>>>>>>> more
>>>>>>>>>> selectors than inbound seemed to improve things, but not solve the
>>>>>>>>>> problem). BTW thought is what provoked my original email about
>>>>>>>>>> shared
>>>>>>>>>> transports/selectors.
>>>>>>>>>> iii) By using dedicatedAcceptor the proxy is accepting all
>>>>>>>>>> connection
>>>>>>>>>> attempts immedialty, but a selector doesn't manage to handle read
>>>>>>>>>> event before timeout is reached. (although changing this back to
>>>>>>>>>> false
>>>>>>>>>> didn't seem to help).
>>>>>>>>>>
>>>>>>>>>> I was initially testing with 4000 client threads, hitting proxy on
>>>>>>>>>> 24-core machine which in turn hits an simple service with 5ms
>>>>>>>>>> latency
>>>>>>>>>> on another 24-core machine. But if I run with just 200 client
>>>>>>>>>> threads
>>>>>>>>>> I'm seeing the same :-(
>>>>>>>>>>
>>>>>>>>>> Last run i just did with concurrency of 200 gave 1159 errors, (6
>>>>>>>>>> outbound timeouts and 1152 jmeter timeouts) in total of 4,154,978
>>>>>>>>>> requests. It's only 0.03% but lot more than blocking, and no
>>>>>>>>>> reason
>>>>>>>>>> they should be happening.
>>>>>>>>>>
>>>>>>>>>> Any hints on where to look next would be greatly appreciated...
>>>>>>>>>>
>>>>>>>>>> thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 15, 2015 at 2:16 AM, Oleksiy Stashok
>>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> What's the implementation diff of blocking vs. non-blocking? I
>>>>>>>>>>> mean
>>>>>>>>>>> is
>>>>>>>>>>> there
>>>>>>>>>>> any change in your code?
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> WBR,
>>>>>>>>>>> Alexey.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 14.04.15 18:01, Daniel Feist wrote:
>>>>>>>>>>>
>>>>>>>>>>> Very interesting. My previuos tests had been with a simple
>>>>>>>>>>> inbound
>>>>>>>>>>> echo.
>>>>>>>>>>> When testing with a non-blocking proxy (1Kb payload, 5ms target
>>>>>>>>>>> service
>>>>>>>>>>> latency) optimizedForMultiplexing=false appears to give better
>>>>>>>>>>> TPS
>>>>>>>>>>> and
>>>>>>>>>>> latency :-)
>>>>>>>>>>>
>>>>>>>>>>> Having some issues with non-blocking proxy in general though,
>>>>>>>>>>> getting
>>>>>>>>>>> decent
>>>>>>>>>>> number of errors whereas in blocking mode get zero. Is it
>>>>>>>>>>> possible
>>>>>>>>>>> that
>>>>>>>>>>> stale connections aren't handled in the same way, or is there
>>>>>>>>>>> something
>>>>>>>>>>> else
>>>>>>>>>>> that might be causing this? I'll do some more digging around, but
>>>>>>>>>>> what
>>>>>>>>>>> I'm
>>>>>>>>>>> seeing right now is 0.05% of jmeter client requests timing out
>>>>>>>>>>> after
>>>>>>>>>>> 60s.
>>>>>>>>>>>
>>>>>>>>>>> Dan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 14, 2015 at 9:25 PM, Oleksiy Stashok
>>>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>>
>>>>>>>>>>>> yeah, there is no silver bullet solution for all kind of
>>>>>>>>>>>> usecases.
>>>>>>>>>>>> An optimizedForMultiplexing is useful for concurrent writes,
>>>>>>>>>>>> because
>>>>>>>>>>>> the
>>>>>>>>>>>> outbound messages are always added to the queue and written from
>>>>>>>>>>>> the
>>>>>>>>>>>> selector/nio thread, and at the write time Grizzly packs all (up
>>>>>>>>>>>> to
>>>>>>>>>>>> some
>>>>>>>>>>>> limit) the available outbound messages and send them as one
>>>>>>>>>>>> chunk,
>>>>>>>>>>>> which
>>>>>>>>>>>> reduces number of I/O operations. When optimizedForMultiplexing
>>>>>>>>>>>> is
>>>>>>>>>>>> disabled
>>>>>>>>>>>> (by default) Grizzly (if the output queue is empty) first tries
>>>>>>>>>>>> to
>>>>>>>>>>>> send
>>>>>>>>>>>> the
>>>>>>>>>>>> outbound message right away in the same thread.
>>>>>>>>>>>> So I'd say when optimizedForMultiplexing is disabled we
>>>>>>>>>>>> potentially
>>>>>>>>>>>> reduce
>>>>>>>>>>>> latency, when optimizedForMultiplexing is enabled we increase
>>>>>>>>>>>> throughput.
>>>>>>>>>>>> But it's very simple way to look at this config parameter, I bet
>>>>>>>>>>>> on
>>>>>>>>>>>> practice
>>>>>>>>>>>> you can experience opposite :))
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> WBR,
>>>>>>>>>>>> Alexey.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 13.04.15 23:40, Daniel Feist wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Interestingly I saw a performance improvement using
>>>>>>>>>>>> optimizedForMultiplexing with HTTP, although this potentially
>>>>>>>>>>>> only
>>>>>>>>>>>> affected
>>>>>>>>>>>> my specific test scenario (simple low latency echo). Also note
>>>>>>>>>>>> that
>>>>>>>>>>>> this was
>>>>>>>>>>>> when using worker threads, so not straight through using
>>>>>>>>>>>> selectors.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me turn off optimizedForMultiplexing, give inbound 1
>>>>>>>>>>>> selector
>>>>>>>>>>>> per
>>>>>>>>>>>> core, outbound 1 selector per core and see how this runs...
>>>>>>>>>>>>
>>>>>>>>>>>> Dan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Apr 13, 2015 at 11:44 PM, Oleksiy Stashok
>>>>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Even if the same selector pool is configured for inbound and
>>>>>>>>>>>>>> outbound,
>>>>>>>>>>>>>> during response processing then Grizzly will still do a thread
>>>>>>>>>>>>>> handover
>>>>>>>>>>>>>> before sending response to client because of the use of
>>>>>>>>>>>>>> AsyncQueueIO.
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Not sure I understand this, IMO there won't be any extra
>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>> handover
>>>>>>>>>>>>>> involved.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was referring to the AsyncWriteQueue. Currently I have
>>>>>>>>>>>>> 'optimizedForMultiplexing' set true which I thought I'd seen
>>>>>>>>>>>>> previsouly
>>>>>>>>>>>>> disabled the direct writing as you described further on in your
>>>>>>>>>>>>> email.
>>>>>>>>>>>>> Pherhaps I should try without this flag though.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Right, optimizedForMultiplexing is useful, when you
>>>>>>>>>>>>> concurrently
>>>>>>>>>>>>> write
>>>>>>>>>>>>> packets to the connection, which is not the case with HTTP,
>>>>>>>>>>>>> unless
>>>>>>>>>>>>> it's HTTP
>>>>>>>>>>>>> 2.0 :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> WBR,
>>>>>>>>>>>>> Alexey.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>> ...it was evening and it was morning and there were already two ways to
>>> store Unicode...
>
>