WebSocket status

From: gustav trede <gustav.trede_at_gmail.com>
Date: Thu, 3 Dec 2009 13:35:15 +0100

Hello,

I have experimented in websocket branch with free standing designs.
This to to see what performance and scalability i could attain while still
keeping in/out data flows under control.
Simply by following the existing natural data flows and let data be
transferred in bulk as side effect of overall efficient design, there is
much to gain.
By very cheaply throttling, limiting queued IO per websocket in a
configurable way the problems can be dealt with at appropriate level.

Incoming dataframes concurrently trigger the websocketcontext listener
onMessage method.
The incoming rate vs the listener consumption rate is limited to a total
number of bytes including datastructure overhead per socket, reads are
throttled if a configured limit is passed (read interest is disabled,
leaving throttling to TCP, and TCP is good at it.
its one of the hardest things to get right when writing IO framework over
UDP).

Reads are performed in selector thread too, further reducing the context
switches etc.

WebSocket.send allows DataFrame's to be sent concurrently.
When a websocketcontext listener representing a chat recieves onMessage
event it can loop over the target sockets and send it
using the same DaFrame and same internal bytebuffer, data is never copied.
In all send calls data that is immediately writable is written, the rest
sliced from dataframes internal bytebuffer and put on per socket queue.

Outstanding sends (data not reached socket native buffer) is limited too.
The websocket non formal spec docs recommends to close connection when limit
is passed, so that is whats done for now.

To efficiently deal with real world data flows in an easy to easy way,
websocketimpl send method does the following :
The amount of data to be sent is added to an per socket atomic integer.
if its value is > queue limit we close.
If value > 0 there is already queued send data.
if value is 0 there is no concurrent write, all data non native layers are
flushed.

Its then safe to write to socket channel.
When the fist websocket.send has completed its channel.write and IF there
is no buffered data in ssl etc,
By checkin the atomic integer it detect if there is any concurrent send
call. if so grab that data from queue if is immiedetly avaialble and write
it until queue or native socket stops.
If data still remains to write, register write interest.
Selectorthread then finishes the writes, that is highly efficient due to
there is no worker thread context switch per write,
and the natural time lag at high load is causing writes to be more in bulk
increasing overall efficiency further and there is significant less load on
threadpool.
Roundrobin load balancing among multiple selector threads is keeping the not
directly writable IO work where it naturally belongs.
The threadpool is not over used and can handle its remaining work better.

Multiple design decisions allow me to safely use my poison based fixed pool
(recently patched in trunk to remove unnecessary atomic statistics) with and
the now stable and highly improved LTQ.
The 1.12 LTQ in grizzly that i prematurely pulled from doug cvs early this
year is slow in comparison, just like sun perf team verified.
It was still a win vs the CLQ for object caching. its another matter that
both CLQ and LTQ now are vastly improved and whats optimal
has changed.

The threadpool is only one of several factors but it alone is interesting to
compare.
The performance and scalability is lowered by 2 to rather high 3 digit
percent numbers for when i use grizzly syncedpool for
3+ minute load tests each using 64 to 1024 sockets and DataFrame sizes from
2 to 16384 byte on a core i7.

When talking about glassfish integration and having to use whats available
in grizzly, the various detrimental effects are not circumventable.
Because its not viable o have an entire extra io framework, doubling
configuration needs and overall complexity.

In grizzly 1.9 or 2.0 there is per design just too much _enforced_ overhead
in form of
1. too many context switches,
2. 1. are extra costly due the syncedpool design. the cost for load adapted
pool size etc is _extreme_ when comparing to recently available
alternatives like LTQ. but inorder to be able to take advantage of that ,
old ways to think must be abounded.
That requires people that dont react with fear but instead become
inspired...
3. overall design is indirectly leading to more channel read and writes then
needed. very roughly 30K+ cycles overhead per call plus we risk to fragment
tcp packets more then needed.

I would happily integrate into 1.9 anyhow, just to let GF v3.x have
websocket.

My thought was that things will be good in grizzly 2.0.
It would be nice if there would be different layers to hook up into,
so each service can get what it needs and nothing more.

When i want to talk about the possibility to allow services like websocket
in grizzly 2.0 to not be enforced
with low scaling high context switching design the answer is short, its not
possible to make changes.

The only argument is the projects inflated paper status, and no technical
reason or discussion.
Well there is a general feeling of dead horse, no weekly project meetings
anymore etc.
How are people supposed to work together in a team and still produce good
stuff ?.

I would gladly change my design or anything when i'm proved to be wrong or
with sound technical reasoning convinced,
then i would learn something and its a win win situation.

Me or anyone else ,cant as a single person try to drive the project and make
it into something that would be competitive and only
only "working".
But when the current situation is what it is and there is lack of interest
and resources to do what it takes to get a competitive product, its very
hard for me to find motivation to integrate at all.
To me the reward "paycheck" for work is the feeling to have a done a good
work,
that performs and scales not only well but as good as possible giving
reasonable circumstances.

-- 
regards
 gustav trede