Re: Optimal IOStrategy/ThreadPool configuration for proxy

From: Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
Date: Tue, 21 Apr 2015 11:15:16 -0700

Hi Dan,

On 21.04.15 11:05, Daniel Feist wrote:
> That formula means that same buffer is used for both read and write,
> is that really true?
Yes, the same direct buffer is used for both.

> Of course the formula is different based on i) IOStrategy ii) if async
> write queue is used or not. My formular was ssuming WorkerIOStrategy
> and no async writes.
Right, in that case selector threads shouldn't be counted.

Thanks.

WBR,
Alexey.

> Dan
>
> On Tue, Apr 21, 2015 at 5:35 PM, Oleksiy Stashok
> <oleksiy.stashok_at_oracle.com> wrote:
>> Hi guys,
>>
>> just a small fix for the estimate, I think it's following:
>>
>> usage = (workerThreads + selectorThreads) *
>> (max(connection.getReadBufferSize(), min(connection.getWriteBufferSize() *
>> 1.5, sizeHttpResponse));
>>
>> where by default:
>> connection.getReadBufferSize() == socket.getReceiveBufferSize()
>> connection.getWriteBufferSize() == socket.getSendBufferSize();
>>
>> There are also 2 system properties supported:
>>
>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.max-receive-buffer-size
>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.max-send-buffer-size
>>
>> which additionally set the max read, write buffer sizes (by default
>> Integer.MAX_VALUE).
>>
>> WBR,
>> Alexey.
>>
>>
>>
>> On 21.04.15 04:56, Daniel Feist wrote:
>>> Hi,
>>>
>>> The size can be limited to 1MB or less using the following system
>>> properties
>>> -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-receive-buffer-size=1048576.
>>> Send buffer is less problematic because it will only use 16MB if you
>>> payload is 16MB but still, it's a good idea to limit send buffer size
>>> to you using
>>> -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-send-buffer-size=1048576.
>>>
>>> To estimate direct memory usage for HTTP inbound:
>>>
>>> direct memory usage = workerThreads x socket.geReceiveBufferSize() +
>>> workerThreads x min(socket.getSendBufferSize() * 1.5,
>>> sizeHttpResponse)
>>>
>>> So if the OS reports 16MB then with 4000 worker threads and a payload
>>> of 1MB, this means total usage of 66GB. While if limit both to 512KB
>>> say, then that's just under 4GB in total.
>>>
>>> You shouldn't be seeing a leak as such, just grizzly wanting to use
>>> than it's got. The thing you are likely to see before memory actually
>>> runs out is the JVM going into a FullGC loop continuously as a
>>> explicit GC occurs or each http request when remaining direct memory
>>> is under a certain threshold..
>>>
>>> Dan
>>>
>>>
>>> On Tue, Apr 21, 2015 at 9:38 AM, Marc Arens <marc.arens_at_open-xchange.com>
>>> wrote:
>>>> Hey Alexey, Daniel,
>>>>
>>>> this sounds interesting as some customers are seeing long-time off-heap
>>>> memory leaks. Sadly i don't have many infos yet as they didn't install
>>>> the
>>>> jvm direct memory monitoring until now. How exactly was the direct memory
>>>> limited in your tests and what are recommendations when using many worker
>>>> threads?
>>>>
>>>>
>>>> On 21 April 2015 at 03:00 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
>>>> wrote:
>>>>
>>>>
>>>> Hi Dan,
>>>>
>>>> interesting observations! Try to play with the selectors count, try to
>>>> double it for both cases blocking and non-blocking, just to compare peak
>>>> tps. You may also want to try different payload sizes.
>>>> Thread context switch is relatively expensive operation as well as I/O
>>>> operations, but when you run load tests, for different cases they can
>>>> compensate each other, for example you make more thread context
>>>> switches, but somehow it leads to less I/O ops...
>>>>
>>>> Regarding the direct memory usage - it's expected, because Grizzly (JDK
>>>> does the same) store a thread-local direct ByteBuffer for read/write
>>>> operations, more threads - more direct ByteBuffers. We store them in
>>>> Weak references, so they should be recycled at some point, but still...
>>>>
>>>> Thanks.
>>>>
>>>> WBR,
>>>> Alexey.
>>>>
>>>>
>>>>
>>>> On 19.04.15 05:50, Daniel Feist wrote:
>>>>> Hi,
>>>>>
>>>>> Sorry, false alarm. There was a stupid bug in my code (between
>>>>> inbound and outbound) that was causing a deadlock in some of the
>>>>> selectors and it was this that was producing the timeouts and errors
>>>>> :-( Fixed now though..
>>>>>
>>>>> I've now been able to run a full set of tests at different target
>>>>> service latencies and concurrencies and it's running very well.
>>>>>
>>>>> Observations are:
>>>>> - With high latency (e.g. 1000ms) blocking/non-blocking perform the
>>>>> same. Of course blocking needs 1 thread per client thread, but giving
>>>>> the proxy maxWorkerThreads of 10,000 just in case doesn't cause any
>>>>> adverse performance, it just doesn't use the threads.
>>>>> - With low latency (e.g. 0->5ms) blocking is faster, but not by much.
>>>>> Number of worker threads in this case is crucial though, more worker
>>>>> threads than required to reach peak tps, and I start to see a
>>>>> degradation in TPS/latency.
>>>>> - With medium latency (e.g. 50ms) it appears that non-blocking is
>>>>> slightly faster, at least at higher concurrencies.
>>>>>
>>>>> Initially I was expecting to see more negative effects of having say
>>>>> 4000 worker threads from context-switching etc, but this causes
>>>>> minimal impact at low latencies and none at all at high latencies.
>>>>>
>>>>> One other interesting side affect of having 1000's of worker threads
>>>>> rather than 24 selectors is the amount of direct memory used. I'm
>>>>> limiting buffer size via system properties of course, but if i wasn't
>>>>> 4000 worker threads on the hardware I'm using (which reports 16Mb
>>>>> buffer size to java) requires 125GB of direct memory vs 0.75GB, and
>>>>> thats just for read buffer. My calculations might not be perfect but
>>>>> you get the idea..
>>>>>
>>>>> This is just a FYI, but if there is anything you think is strange, be
>>>>> interesting to know..
>>>>>
>>>>> thanks!
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 17, 2015 at 1:56 AM, Oleksiy Stashok
>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>> Hi Dan,
>>>>>>
>>>>>> everything is possible, and may be back pressure caused by blocking I/O
>>>>>> really makes the difference...
>>>>>> If you have time it would be interesting to investigate this more and
>>>>>> try
>>>>>> to
>>>>>> check if you can register any "lost" or "forgotten" request in your
>>>>>> app.
>>>>>> Try
>>>>>> to dump all the request/response processing timestamps to figure out
>>>>>> where
>>>>>> exactly the processing takes the most time and at what stage the
>>>>>> timeout
>>>>>> occurs: jmeter (1)--> proxy (2)--> backend (3)--> proxy (4)--> jmeter.
>>>>>> According to your description it should it either 3 or 4, but it would
>>>>>> be
>>>>>> interesting to see exactly how it happens.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> WBR,
>>>>>> Alexey.
>>>>>>
>>>>>>
>>>>>> On 16.04.15 17:24, Daniel Feist wrote:
>>>>>>> Ignore my last email about this affecting low concurrency, it doesn't.
>>>>>>> I was only seeing some errors at low concurrency due to side-effects
>>>>>>> of previous test run I think. I need 2000+ JMeter client threads to
>>>>>>> reproduce this consistently.
>>>>>>>
>>>>>>> I stripped out everything as much as possible so i'm not doing
>>>>>>> anything in between and AHC is invoking inbound grizzly response as
>>>>>>> directly as possible but no difference. The exact error in jmeter is
>>>>>>> "java.net.SocketTimeoutException,Non HTTP response message: Read timed
>>>>>>> out".
>>>>>>>
>>>>>>> Question: this might sound stupid, but couldn't it simply be that the
>>>>>>> proxy, with the number of selectors it has (and not using worker
>>>>>>> threads) simply cannot handle the load? And that we don't see errors
>>>>>>> with blocking because back-pressured is applied more directly whereas
>>>>>>> with non-blocking the same type of back-pressure doesn't occur and so
>>>>>>> we get this type of error instead?
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>> On Thu, Apr 16, 2015 at 10:16 PM, Daniel Feist <dfeist_at_gmail.com>
>>>>>>> wrote:
>>>>>>>> The thing is, if i remove the outbound call then is ceases to be a
>>>>>>>> proxy and as such I don't have a seperate thread processing the
>>>>>>>> response callback and instead it behaves blocking (which works)
>>>>>>>>
>>>>>>>> Anyway, I'll try to simplify as much as possible in other ways and
>>>>>>>> see
>>>>>>>> where that leads me...
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>> On Thu, Apr 16, 2015 at 9:00 PM, Oleksiy Stashok
>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>> Hi Dan,
>>>>>>>>>
>>>>>>>>> let's try to simplify the test, what happens if the proxy sends the
>>>>>>>>> response
>>>>>>>>> right away (no outbound calls), do you still see the timeouts?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> WBR,
>>>>>>>>> Alexey.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 16.04.15 12:17, Daniel Feist wrote:
>>>>>>>>>> What I forgot to add is that I see the same issue with timeouts
>>>>>>>>>> between jmeter and the proxy even when "jmeter threads <
>>>>>>>>>> selectors",
>>>>>>>>>> which kind of invalidates all of my ideas about selectors all
>>>>>>>>>> potentially being busy..
>>>>>>>>>>
>>>>>>>>>> Wow, even with 1 thread it's occuring.. most be something stupid...
>>>>>>>>>> I
>>>>>>>>>> don't think it's releated to persistent connections, maxKeepAlive
>>>>>>>>>> on
>>>>>>>>>> target service is 100, which wouldn't explain rougly 1 in 2000
>>>>>>>>>> client-side timeout, especially given no errors are being logged.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 16, 2015 at 7:36 PM, Daniel Feist <dfeist_at_gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Nothing different really, just the blocking version returns
>>>>>>>>>>> response
>>>>>>>>>>> when the stack returns after waiting on outbound future returned
>>>>>>>>>>> by
>>>>>>>>>>> AHC, while the non-blocking version returns response when the
>>>>>>>>>>> completion handler passed to AHC is invoked. Ah, also blocking
>>>>>>>>>>> version
>>>>>>>>>>> using WorkerThreadIOStrategy while the non-blocking version uses
>>>>>>>>>>> SameThreadIOStrategy for inbound.
>>>>>>>>>>>
>>>>>>>>>>> I didn't reply earlier because I've been trying to get my head
>>>>>>>>>>> round
>>>>>>>>>>> whats going on. The errors are all timeout errors. Most of the
>>>>>>>>>>> timeout errors between jmeter and the proxy, but also some timeout
>>>>>>>>>>> errors between the proxy and target service, whereas with the
>>>>>>>>>>> blocking
>>>>>>>>>>> version there are no errors at all.
>>>>>>>>>>>
>>>>>>>>>>> Everything seems to be ok, and there are no exceptions being throw
>>>>>>>>>>> (other than timeouts) by grizzly/ahc. So my only hypothesis is
>>>>>>>>>>> that
>>>>>>>>>>> there is an issue with the selectors, either:
>>>>>>>>>>>
>>>>>>>>>>> i) for some reason selectors are blocking. (i see no evidence of
>>>>>>>>>>> this
>>>>>>>>>>> though, only thing i have between inbound and outbound is some
>>>>>>>>>>> copying
>>>>>>>>>>> of headers)
>>>>>>>>>>> ii) different number of inbound/outbound selectors could generate
>>>>>>>>>>> more
>>>>>>>>>>> inbound message than can be handled by outbound (i've ensured both
>>>>>>>>>>> have same number of selectors, and doesn't help, giving outbound
>>>>>>>>>>> more
>>>>>>>>>>> selectors than inbound seemed to improve things, but not solve the
>>>>>>>>>>> problem). BTW thought is what provoked my original email about
>>>>>>>>>>> shared
>>>>>>>>>>> transports/selectors.
>>>>>>>>>>> iii) By using dedicatedAcceptor the proxy is accepting all
>>>>>>>>>>> connection
>>>>>>>>>>> attempts immedialty, but a selector doesn't manage to handle read
>>>>>>>>>>> event before timeout is reached. (although changing this back to
>>>>>>>>>>> false
>>>>>>>>>>> didn't seem to help).
>>>>>>>>>>>
>>>>>>>>>>> I was initially testing with 4000 client threads, hitting proxy on
>>>>>>>>>>> 24-core machine which in turn hits an simple service with 5ms
>>>>>>>>>>> latency
>>>>>>>>>>> on another 24-core machine. But if I run with just 200 client
>>>>>>>>>>> threads
>>>>>>>>>>> I'm seeing the same :-(
>>>>>>>>>>>
>>>>>>>>>>> Last run i just did with concurrency of 200 gave 1159 errors, (6
>>>>>>>>>>> outbound timeouts and 1152 jmeter timeouts) in total of 4,154,978
>>>>>>>>>>> requests. It's only 0.03% but lot more than blocking, and no
>>>>>>>>>>> reason
>>>>>>>>>>> they should be happening.
>>>>>>>>>>>
>>>>>>>>>>> Any hints on where to look next would be greatly appreciated...
>>>>>>>>>>>
>>>>>>>>>>> thanks!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 15, 2015 at 2:16 AM, Oleksiy Stashok
>>>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>>> What's the implementation diff of blocking vs. non-blocking? I
>>>>>>>>>>>> mean
>>>>>>>>>>>> is
>>>>>>>>>>>> there
>>>>>>>>>>>> any change in your code?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> WBR,
>>>>>>>>>>>> Alexey.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 14.04.15 18:01, Daniel Feist wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Very interesting. My previuos tests had been with a simple
>>>>>>>>>>>> inbound
>>>>>>>>>>>> echo.
>>>>>>>>>>>> When testing with a non-blocking proxy (1Kb payload, 5ms target
>>>>>>>>>>>> service
>>>>>>>>>>>> latency) optimizedForMultiplexing=false appears to give better
>>>>>>>>>>>> TPS
>>>>>>>>>>>> and
>>>>>>>>>>>> latency :-)
>>>>>>>>>>>>
>>>>>>>>>>>> Having some issues with non-blocking proxy in general though,
>>>>>>>>>>>> getting
>>>>>>>>>>>> decent
>>>>>>>>>>>> number of errors whereas in blocking mode get zero. Is it
>>>>>>>>>>>> possible
>>>>>>>>>>>> that
>>>>>>>>>>>> stale connections aren't handled in the same way, or is there
>>>>>>>>>>>> something
>>>>>>>>>>>> else
>>>>>>>>>>>> that might be causing this? I'll do some more digging around, but
>>>>>>>>>>>> what
>>>>>>>>>>>> I'm
>>>>>>>>>>>> seeing right now is 0.05% of jmeter client requests timing out
>>>>>>>>>>>> after
>>>>>>>>>>>> 60s.
>>>>>>>>>>>>
>>>>>>>>>>>> Dan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 14, 2015 at 9:25 PM, Oleksiy Stashok
>>>>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> yeah, there is no silver bullet solution for all kind of
>>>>>>>>>>>>> usecases.
>>>>>>>>>>>>> An optimizedForMultiplexing is useful for concurrent writes,
>>>>>>>>>>>>> because
>>>>>>>>>>>>> the
>>>>>>>>>>>>> outbound messages are always added to the queue and written from
>>>>>>>>>>>>> the
>>>>>>>>>>>>> selector/nio thread, and at the write time Grizzly packs all (up
>>>>>>>>>>>>> to
>>>>>>>>>>>>> some
>>>>>>>>>>>>> limit) the available outbound messages and send them as one
>>>>>>>>>>>>> chunk,
>>>>>>>>>>>>> which
>>>>>>>>>>>>> reduces number of I/O operations. When optimizedForMultiplexing
>>>>>>>>>>>>> is
>>>>>>>>>>>>> disabled
>>>>>>>>>>>>> (by default) Grizzly (if the output queue is empty) first tries
>>>>>>>>>>>>> to
>>>>>>>>>>>>> send
>>>>>>>>>>>>> the
>>>>>>>>>>>>> outbound message right away in the same thread.
>>>>>>>>>>>>> So I'd say when optimizedForMultiplexing is disabled we
>>>>>>>>>>>>> potentially
>>>>>>>>>>>>> reduce
>>>>>>>>>>>>> latency, when optimizedForMultiplexing is enabled we increase
>>>>>>>>>>>>> throughput.
>>>>>>>>>>>>> But it's very simple way to look at this config parameter, I bet
>>>>>>>>>>>>> on
>>>>>>>>>>>>> practice
>>>>>>>>>>>>> you can experience opposite :))
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> WBR,
>>>>>>>>>>>>> Alexey.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 13.04.15 23:40, Daniel Feist wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Interestingly I saw a performance improvement using
>>>>>>>>>>>>> optimizedForMultiplexing with HTTP, although this potentially
>>>>>>>>>>>>> only
>>>>>>>>>>>>> affected
>>>>>>>>>>>>> my specific test scenario (simple low latency echo). Also note
>>>>>>>>>>>>> that
>>>>>>>>>>>>> this was
>>>>>>>>>>>>> when using worker threads, so not straight through using
>>>>>>>>>>>>> selectors.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me turn off optimizedForMultiplexing, give inbound 1
>>>>>>>>>>>>> selector
>>>>>>>>>>>>> per
>>>>>>>>>>>>> core, outbound 1 selector per core and see how this runs...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Apr 13, 2015 at 11:44 PM, Oleksiy Stashok
>>>>>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Even if the same selector pool is configured for inbound and
>>>>>>>>>>>>>>> outbound,
>>>>>>>>>>>>>>> during response processing then Grizzly will still do a thread
>>>>>>>>>>>>>>> handover
>>>>>>>>>>>>>>> before sending response to client because of the use of
>>>>>>>>>>>>>>> AsyncQueueIO.
>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not sure I understand this, IMO there won't be any extra
>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>> handover
>>>>>>>>>>>>>>> involved.
>>>>>>>>>>>>>> I was referring to the AsyncWriteQueue. Currently I have
>>>>>>>>>>>>>> 'optimizedForMultiplexing' set true which I thought I'd seen
>>>>>>>>>>>>>> previsouly
>>>>>>>>>>>>>> disabled the direct writing as you described further on in your
>>>>>>>>>>>>>> email.
>>>>>>>>>>>>>> Pherhaps I should try without this flag though.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right, optimizedForMultiplexing is useful, when you
>>>>>>>>>>>>>> concurrently
>>>>>>>>>>>>>> write
>>>>>>>>>>>>>> packets to the connection, which is not the case with HTTP,
>>>>>>>>>>>>>> unless
>>>>>>>>>>>>>> it's HTTP
>>>>>>>>>>>>>> 2.0 :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> WBR,
>>>>>>>>>>>>>> Alexey.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>> ...it was evening and it was morning and there were already two ways to
>>>> store Unicode...
>>