Re: Optimal IOStrategy/ThreadPool configuration for proxy

From: Marc Arens <marc.arens_at_open-xchange.com>
Date: Mon, 10 Aug 2015 11:42:24 +0200 (CEST)

> On 21 April 2015 at 14:32 Marc Arens <marc.arens_at_open-xchange.com> wrote:
>
>
>
> >
> > On 21 April 2015 at 13:56 Daniel Feist <dfeist_at_gmail.com> wrote:
> >
> >
> > Hi,
> >
> > The size can be limited to 1MB or less using the following system
> > properties
> > -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-receive-buffer-size=1048576.
> > Send buffer is less problematic because it will only use 16MB if you
> > payload is 16MB but still, it's a good idea to limit send buffer size
> > to you using
> > -Dorg.glassfish.grizzly.nio.transport.TCPNIOTransport.max-send-buffer-size=1048576.
> >
> > To estimate direct memory usage for HTTP inbound:
> >
> > direct memory usage = workerThreads x socket.geReceiveBufferSize() +
> > workerThreads x min(socket.getSendBufferSize() * 1.5,
> > sizeHttpResponse)
> >
> > So if the OS reports 16MB then with 4000 worker threads and a payload
> > of 1MB, this means total usage of 66GB. While if limit both to 512KB
> > say, then that's just under 4GB in total.
> >
>
> Thanks, I'll do some in-house testing to see how this influences our
> application.
>
> >
> > You shouldn't be seeing a leak as such, just grizzly wanting to use
> > than it's got. The thing you are likely to see before memory actually
> > runs out is the JVM going into a FullGC loop continuously as a
> > explicit GC occurs or each http request when remaining direct memory
> > is under a certain threshold..
> >
>
> Of course there _should_ be no leak but something is leaking to off-heap and
> we haven't found out what exactly it is as it. As grizzly is one of the
> frameworks that uses direct memory i'll have to investigate that part.
> Inspecting the process pmap shows some huge and some constantly growing anon
> regions e.g 2GiB resident RAM usage while the heapdump is around 200MiB. I was
> looking at
> http://developerblog.redhat.com/2015/01/06/malloc-systemtap-probes-an-example/
> already for debugging this, any other recommendations if anybody of you had a
> similar problem? (to hijack this thread completely :D)

Just a late update on this one here. There was no direct problem with Grizzly
but rather the JVM here. The customer updated from
java-1.7.0-oracle-devel-1.7.0.55-1jpp.1.el6_5.x86_64 to 1.7.0_75 on RHEL 6 which
solved the memleak issue for him. Might be good to know if any of you is
suffering from strange long-term off-heap leaks now or in the future.

>
> >
> > Dan
> >
> >
> > On Tue, Apr 21, 2015 at 9:38 AM, Marc Arens
> > <marc.arens_at_open-xchange.com> wrote:
> > > Hey Alexey, Daniel,
> > >
> > > this sounds interesting as some customers are seeing long-time
> > > off-heap
> > > memory leaks. Sadly i don't have many infos yet as they didn't install
> > > the
> > > jvm direct memory monitoring until now. How exactly was the direct
> > > memory
> > > limited in your tests and what are recommendations when using many
> > > worker
> > > threads?
> > >
> > >
> > > On 21 April 2015 at 03:00 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
> > > wrote:
> > >
> > >
> > > Hi Dan,
> > >
> > > interesting observations! Try to play with the selectors count, try to
> > > double it for both cases blocking and non-blocking, just to compare
> > > peak
> > > tps. You may also want to try different payload sizes.
> > > Thread context switch is relatively expensive operation as well as I/O
> > > operations, but when you run load tests, for different cases they can
> > > compensate each other, for example you make more thread context
> > > switches, but somehow it leads to less I/O ops...
> > >
> > > Regarding the direct memory usage - it's expected, because Grizzly
> > > (JDK
> > > does the same) store a thread-local direct ByteBuffer for read/write
> > > operations, more threads - more direct ByteBuffers. We store them in
> > > Weak references, so they should be recycled at some point, but
> > > still...
> > >
> > > Thanks.
> > >
> > > WBR,
> > > Alexey.
> > >
> > >
> > >
> > > On 19.04.15 05:50, Daniel Feist wrote:
> > >> Hi,
> > >>
> > >> Sorry, false alarm. There was a stupid bug in my code (between
> > >> inbound and outbound) that was causing a deadlock in some of the
> > >> selectors and it was this that was producing the timeouts and errors
> > >> :-( Fixed now though..
> > >>
> > >> I've now been able to run a full set of tests at different target
> > >> service latencies and concurrencies and it's running very well.
> > >>
> > >> Observations are:
> > >> - With high latency (e.g. 1000ms) blocking/non-blocking perform the
> > >> same. Of course blocking needs 1 thread per client thread, but giving
> > >> the proxy maxWorkerThreads of 10,000 just in case doesn't cause any
> > >> adverse performance, it just doesn't use the threads.
> > >> - With low latency (e.g. 0->5ms) blocking is faster, but not by much.
> > >> Number of worker threads in this case is crucial though, more worker
> > >> threads than required to reach peak tps, and I start to see a
> > >> degradation in TPS/latency.
> > >> - With medium latency (e.g. 50ms) it appears that non-blocking is
> > >> slightly faster, at least at higher concurrencies.
> > >>
> > >> Initially I was expecting to see more negative effects of having say
> > >> 4000 worker threads from context-switching etc, but this causes
> > >> minimal impact at low latencies and none at all at high latencies.
> > >>
> > >> One other interesting side affect of having 1000's of worker threads
> > >> rather than 24 selectors is the amount of direct memory used. I'm
> > >> limiting buffer size via system properties of course, but if i wasn't
> > >> 4000 worker threads on the hardware I'm using (which reports 16Mb
> > >> buffer size to java) requires 125GB of direct memory vs 0.75GB, and
> > >> thats just for read buffer. My calculations might not be perfect but
> > >> you get the idea..
> > >>
> > >> This is just a FYI, but if there is anything you think is strange, be
> > >> interesting to know..
> > >>
> > >> thanks!
> > >>
> > >> Dan
> > >>
> > >>
> > >>
> > >> On Fri, Apr 17, 2015 at 1:56 AM, Oleksiy Stashok
> > >> <oleksiy.stashok_at_oracle.com> wrote:
> > >>> Hi Dan,
> > >>>
> > >>> everything is possible, and may be back pressure caused by blocking
> > >>> I/O
> > >>> really makes the difference...
> > >>> If you have time it would be interesting to investigate this more
> > >>> and try
> > >>> to
> > >>> check if you can register any "lost" or "forgotten" request in your
> > >>> app.
> > >>> Try
> > >>> to dump all the request/response processing timestamps to figure out
> > >>> where
> > >>> exactly the processing takes the most time and at what stage the
> > >>> timeout
> > >>> occurs: jmeter (1)--> proxy (2)--> backend (3)--> proxy (4)-->
> > >>> jmeter.
> > >>> According to your description it should it either 3 or 4, but it
> > >>> would be
> > >>> interesting to see exactly how it happens.
> > >>>
> > >>> Thanks.
> > >>>
> > >>> WBR,
> > >>> Alexey.
> > >>>
> > >>>
> > >>> On 16.04.15 17:24, Daniel Feist wrote:
> > >>>> Ignore my last email about this affecting low concurrency, it
> > >>>> doesn't.
> > >>>> I was only seeing some errors at low concurrency due to
> > >>>> side-effects
> > >>>> of previous test run I think. I need 2000+ JMeter client threads to
> > >>>> reproduce this consistently.
> > >>>>
> > >>>> I stripped out everything as much as possible so i'm not doing
> > >>>> anything in between and AHC is invoking inbound grizzly response as
> > >>>> directly as possible but no difference. The exact error in jmeter
> > >>>> is
> > >>>> "java.net.SocketTimeoutException,Non HTTP response message: Read
> > >>>> timed
> > >>>> out".
> > >>>>
> > >>>> Question: this might sound stupid, but couldn't it simply be that
> > >>>> the
> > >>>> proxy, with the number of selectors it has (and not using worker
> > >>>> threads) simply cannot handle the load? And that we don't see
> > >>>> errors
> > >>>> with blocking because back-pressured is applied more directly
> > >>>> whereas
> > >>>> with non-blocking the same type of back-pressure doesn't occur and
> > >>>> so
> > >>>> we get this type of error instead?
> > >>>>
> > >>>> Dan
> > >>>>
> > >>>> On Thu, Apr 16, 2015 at 10:16 PM, Daniel Feist <dfeist_at_gmail.com>
> > >>>> wrote:
> > >>>>> The thing is, if i remove the outbound call then is ceases to be a
> > >>>>> proxy and as such I don't have a seperate thread processing the
> > >>>>> response callback and instead it behaves blocking (which works)
> > >>>>>
> > >>>>> Anyway, I'll try to simplify as much as possible in other ways and
> > >>>>> see
> > >>>>> where that leads me...
> > >>>>>
> > >>>>> Dan
> > >>>>>
> > >>>>> On Thu, Apr 16, 2015 at 9:00 PM, Oleksiy Stashok
> > >>>>> <oleksiy.stashok_at_oracle.com> wrote:
> > >>>>>> Hi Dan,
> > >>>>>>
> > >>>>>> let's try to simplify the test, what happens if the proxy sends
> > >>>>>> the
> > >>>>>> response
> > >>>>>> right away (no outbound calls), do you still see the timeouts?
> > >>>>>>
> > >>>>>> Thanks.
> > >>>>>>
> > >>>>>> WBR,
> > >>>>>> Alexey.
> > >>>>>>
> > >>>>>>
> > >>>>>> On 16.04.15 12:17, Daniel Feist wrote:
> > >>>>>>> What I forgot to add is that I see the same issue with timeouts
> > >>>>>>> between jmeter and the proxy even when "jmeter threads <
> > >>>>>>> selectors",
> > >>>>>>> which kind of invalidates all of my ideas about selectors all
> > >>>>>>> potentially being busy..
> > >>>>>>>
> > >>>>>>> Wow, even with 1 thread it's occuring.. most be something
> > >>>>>>> stupid... I
> > >>>>>>> don't think it's releated to persistent connections,
> > >>>>>>> maxKeepAlive on
> > >>>>>>> target service is 100, which wouldn't explain rougly 1 in 2000
> > >>>>>>> client-side timeout, especially given no errors are being
> > >>>>>>> logged.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Apr 16, 2015 at 7:36 PM, Daniel Feist <dfeist_at_gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> Nothing different really, just the blocking version returns
> > >>>>>>>> response
> > >>>>>>>> when the stack returns after waiting on outbound future
> > >>>>>>>> returned by
> > >>>>>>>> AHC, while the non-blocking version returns response when the
> > >>>>>>>> completion handler passed to AHC is invoked. Ah, also blocking
> > >>>>>>>> version
> > >>>>>>>> using WorkerThreadIOStrategy while the non-blocking version
> > >>>>>>>> uses
> > >>>>>>>> SameThreadIOStrategy for inbound.
> > >>>>>>>>
> > >>>>>>>> I didn't reply earlier because I've been trying to get my head
> > >>>>>>>> round
> > >>>>>>>> whats going on. The errors are all timeout errors. Most of the
> > >>>>>>>> timeout errors between jmeter and the proxy, but also some
> > >>>>>>>> timeout
> > >>>>>>>> errors between the proxy and target service, whereas with the
> > >>>>>>>> blocking
> > >>>>>>>> version there are no errors at all.
> > >>>>>>>>
> > >>>>>>>> Everything seems to be ok, and there are no exceptions being
> > >>>>>>>> throw
> > >>>>>>>> (other than timeouts) by grizzly/ahc. So my only hypothesis is
> > >>>>>>>> that
> > >>>>>>>> there is an issue with the selectors, either:
> > >>>>>>>>
> > >>>>>>>> i) for some reason selectors are blocking. (i see no evidence
> > >>>>>>>> of
> > >>>>>>>> this
> > >>>>>>>> though, only thing i have between inbound and outbound is some
> > >>>>>>>> copying
> > >>>>>>>> of headers)
> > >>>>>>>> ii) different number of inbound/outbound selectors could
> > >>>>>>>> generate
> > >>>>>>>> more
> > >>>>>>>> inbound message than can be handled by outbound (i've ensured
> > >>>>>>>> both
> > >>>>>>>> have same number of selectors, and doesn't help, giving
> > >>>>>>>> outbound
> > >>>>>>>> more
> > >>>>>>>> selectors than inbound seemed to improve things, but not solve
> > >>>>>>>> the
> > >>>>>>>> problem). BTW thought is what provoked my original email about
> > >>>>>>>> shared
> > >>>>>>>> transports/selectors.
> > >>>>>>>> iii) By using dedicatedAcceptor the proxy is accepting all
> > >>>>>>>> connection
> > >>>>>>>> attempts immedialty, but a selector doesn't manage to handle
> > >>>>>>>> read
> > >>>>>>>> event before timeout is reached. (although changing this back
> > >>>>>>>> to
> > >>>>>>>> false
> > >>>>>>>> didn't seem to help).
> > >>>>>>>>
> > >>>>>>>> I was initially testing with 4000 client threads, hitting proxy
> > >>>>>>>> on
> > >>>>>>>> 24-core machine which in turn hits an simple service with 5ms
> > >>>>>>>> latency
> > >>>>>>>> on another 24-core machine. But if I run with just 200 client
> > >>>>>>>> threads
> > >>>>>>>> I'm seeing the same :-(
> > >>>>>>>>
> > >>>>>>>> Last run i just did with concurrency of 200 gave 1159 errors,
> > >>>>>>>> (6
> > >>>>>>>> outbound timeouts and 1152 jmeter timeouts) in total of
> > >>>>>>>> 4,154,978
> > >>>>>>>> requests. It's only 0.03% but lot more than blocking, and no
> > >>>>>>>> reason
> > >>>>>>>> they should be happening.
> > >>>>>>>>
> > >>>>>>>> Any hints on where to look next would be greatly appreciated...
> > >>>>>>>>
> > >>>>>>>> thanks!
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Wed, Apr 15, 2015 at 2:16 AM, Oleksiy Stashok
> > >>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
> > >>>>>>>>> What's the implementation diff of blocking vs. non-blocking? I
> > >>>>>>>>> mean
> > >>>>>>>>> is
> > >>>>>>>>> there
> > >>>>>>>>> any change in your code?
> > >>>>>>>>>
> > >>>>>>>>> Thanks.
> > >>>>>>>>>
> > >>>>>>>>> WBR,
> > >>>>>>>>> Alexey.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 14.04.15 18:01, Daniel Feist wrote:
> > >>>>>>>>>
> > >>>>>>>>> Very interesting. My previuos tests had been with a simple
> > >>>>>>>>> inbound
> > >>>>>>>>> echo.
> > >>>>>>>>> When testing with a non-blocking proxy (1Kb payload, 5ms
> > >>>>>>>>> target
> > >>>>>>>>> service
> > >>>>>>>>> latency) optimizedForMultiplexing=false appears to give better
> > >>>>>>>>> TPS
> > >>>>>>>>> and
> > >>>>>>>>> latency :-)
> > >>>>>>>>>
> > >>>>>>>>> Having some issues with non-blocking proxy in general though,
> > >>>>>>>>> getting
> > >>>>>>>>> decent
> > >>>>>>>>> number of errors whereas in blocking mode get zero. Is it
> > >>>>>>>>> possible
> > >>>>>>>>> that
> > >>>>>>>>> stale connections aren't handled in the same way, or is there
> > >>>>>>>>> something
> > >>>>>>>>> else
> > >>>>>>>>> that might be causing this? I'll do some more digging around,
> > >>>>>>>>> but
> > >>>>>>>>> what
> > >>>>>>>>> I'm
> > >>>>>>>>> seeing right now is 0.05% of jmeter client requests timing out
> > >>>>>>>>> after
> > >>>>>>>>> 60s.
> > >>>>>>>>>
> > >>>>>>>>> Dan
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Apr 14, 2015 at 9:25 PM, Oleksiy Stashok
> > >>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
> > >>>>>>>>>> Hi Dan,
> > >>>>>>>>>>
> > >>>>>>>>>> yeah, there is no silver bullet solution for all kind of
> > >>>>>>>>>> usecases.
> > >>>>>>>>>> An optimizedForMultiplexing is useful for concurrent writes,
> > >>>>>>>>>> because
> > >>>>>>>>>> the
> > >>>>>>>>>> outbound messages are always added to the queue and written
> > >>>>>>>>>> from
> > >>>>>>>>>> the
> > >>>>>>>>>> selector/nio thread, and at the write time Grizzly packs all
> > >>>>>>>>>> (up
> > >>>>>>>>>> to
> > >>>>>>>>>> some
> > >>>>>>>>>> limit) the available outbound messages and send them as one
> > >>>>>>>>>> chunk,
> > >>>>>>>>>> which
> > >>>>>>>>>> reduces number of I/O operations. When
> > >>>>>>>>>> optimizedForMultiplexing is
> > >>>>>>>>>> disabled
> > >>>>>>>>>> (by default) Grizzly (if the output queue is empty) first
> > >>>>>>>>>> tries to
> > >>>>>>>>>> send
> > >>>>>>>>>> the
> > >>>>>>>>>> outbound message right away in the same thread.
> > >>>>>>>>>> So I'd say when optimizedForMultiplexing is disabled we
> > >>>>>>>>>> potentially
> > >>>>>>>>>> reduce
> > >>>>>>>>>> latency, when optimizedForMultiplexing is enabled we increase
> > >>>>>>>>>> throughput.
> > >>>>>>>>>> But it's very simple way to look at this config parameter, I
> > >>>>>>>>>> bet
> > >>>>>>>>>> on
> > >>>>>>>>>> practice
> > >>>>>>>>>> you can experience opposite :))
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks.
> > >>>>>>>>>>
> > >>>>>>>>>> WBR,
> > >>>>>>>>>> Alexey.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 13.04.15 23:40, Daniel Feist wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Interestingly I saw a performance improvement using
> > >>>>>>>>>> optimizedForMultiplexing with HTTP, although this potentially
> > >>>>>>>>>> only
> > >>>>>>>>>> affected
> > >>>>>>>>>> my specific test scenario (simple low latency echo). Also
> > >>>>>>>>>> note
> > >>>>>>>>>> that
> > >>>>>>>>>> this was
> > >>>>>>>>>> when using worker threads, so not straight through using
> > >>>>>>>>>> selectors.
> > >>>>>>>>>>
> > >>>>>>>>>> Let me turn off optimizedForMultiplexing, give inbound 1
> > >>>>>>>>>> selector
> > >>>>>>>>>> per
> > >>>>>>>>>> core, outbound 1 selector per core and see how this runs...
> > >>>>>>>>>>
> > >>>>>>>>>> Dan
> > >>>>>>>>>>
> > >>>>>>>>>> On Mon, Apr 13, 2015 at 11:44 PM, Oleksiy Stashok
> > >>>>>>>>>> <oleksiy.stashok_at_oracle.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>> - Even if the same selector pool is configured for inbound
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>> outbound,
> > >>>>>>>>>>>> during response processing then Grizzly will still do a
> > >>>>>>>>>>>> thread
> > >>>>>>>>>>>> handover
> > >>>>>>>>>>>> before sending response to client because of the use of
> > >>>>>>>>>>>> AsyncQueueIO.
> > >>>>>>>>>>>> This
> > >>>>>>>>>>>> right?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Not sure I understand this, IMO there won't be any extra
> > >>>>>>>>>>>> thread
> > >>>>>>>>>>>> handover
> > >>>>>>>>>>>> involved.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I was referring to the AsyncWriteQueue. Currently I have
> > >>>>>>>>>>> 'optimizedForMultiplexing' set true which I thought I'd seen
> > >>>>>>>>>>> previsouly
> > >>>>>>>>>>> disabled the direct writing as you described further on in
> > >>>>>>>>>>> your
> > >>>>>>>>>>> email.
> > >>>>>>>>>>> Pherhaps I should try without this flag though.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Right, optimizedForMultiplexing is useful, when you
> > >>>>>>>>>>> concurrently
> > >>>>>>>>>>> write
> > >>>>>>>>>>> packets to the connection, which is not the case with HTTP,
> > >>>>>>>>>>> unless
> > >>>>>>>>>>> it's HTTP
> > >>>>>>>>>>> 2.0 :)
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks.
> > >>>>>>>>>>>
> > >>>>>>>>>>> WBR,
> > >>>>>>>>>>> Alexey.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >
> > >>
> > >
> > > ...it was evening and it was morning and there were already two ways
> > > to
> > > store Unicode...
> >
>
> ...it was evening and it was morning and there were already two ways to store
> Unicode...

--
...it was evening and it was morning and there were already two ways to store
Unicode...