Re: WebSocket status

From: gustav trede <gustav.trede_at_gmail.com>
Date: Thu, 3 Dec 2009 21:32:56 +0100

2009/12/3 Oleksiy Stashok <Oleksiy.Stashok_at_sun.com>

> Hi Gustav,
>
> first of all, thank you for the work you've done on websockets branch!
> I'm really looking forward to see some websocket examples and blogs.

im not good at blogging, i need to learn it !. I also lost my blog
introduction mail i got long time ago.
Its sad svnsearch is down, would be nice to link to the code at least. it
contains docs and easy to follow code.

>
>
> Incoming dataframes concurrently trigger the websocketcontext listener
>> onMessage method.
>> The incoming rate vs the listener consumption rate is limited to a total
>> number of bytes including datastructure overhead per socket, reads are
>> throttled if a configured limit is passed (read interest is disabled,
>> leaving throttling to TCP, and TCP is good at it.
>> its one of the hardest things to get right when writing IO framework over
>> UDP).
>>
>> Reads are performed in selector thread too, further reducing the context
>> switches etc.
>>
> Are you using Grizzly selector threads or websocket infrastructure
> reregisters websocket channels to own selector-threads?
>
>
Yes. atm grizzly is only used server side to recieve connections.
i have not implemented full stand alone mode, due to primary target was to
integrate..
SocketChannel is moved to websocket multiple selector threads that
loadbalance new connection with round robin.
For testing purposes the number of selector threads is automated, cpu
count/2 with 1 mini, user friendly configuration is not added everywhere
yet.
The total number of context switches, channel write and reads are minimized
this way, there might be room for further improvements,i never had any
feedback on the code.
If multiple small DataFrames needs to be sent, nothing stops the caller to
have many inside one ByteBuffer, there is a send method for that.
Its the same internal method that DataFrame internal bytebuffer is sent
with.

>
> Selectorthread then finishes the writes, that is highly efficient due to
>> there is no worker thread context switch per write,
>> and the natural time lag at high load is causing writes to be more in bulk
>> increasing overall efficiency further and there is significant less load on
>> threadpool.
>> Roundrobin load balancing among multiple selector threads is keeping the
>> not directly writable IO work where it naturally belongs.
>> The threadpool is not over used and can handle its remaining work better.
>>
> The same here. As I understand you use own websocket selector-threads both
> for reads and writes?
>

yes. all reads, and then the writes that dont fit in socket native buffer.
"all writes" is not 100% true, currently if a writeQueue.poll returns null,
write interest is registered.
it could be changed to 100% but the cost would be a worker thread doing
nothing for an undefined timelength depending on when the producing thread
actually
performs the writeQueue.add(rawdataframe.duplicate()); after the
bufferedSendBytes.addAndGet(tosend); .
Further testing is needed to evaluate this design, its however quite
efficient and scalable as is.

whenever a dataframe is ready the atomic int that tracks outstanding (not
yet handled by listener) incoming dataframe total memory load is increased
(byte length + 100)
DataFrame is then sent to the corresponding websocketcontexts eventlistner
using threadpool
listener onMessage method is called in the runnable,
the consumer side of incoming dataframequeue atomic int updated, removing
the same amount.

>
> Multiple design decisions allow me to safely use my poison based fixed
>> pool (recently patched in trunk to remove unnecessary atomic statistics)
>> with and the now stable and highly improved LTQ.
>> The 1.12 LTQ in grizzly that i prematurely pulled from doug cvs early this
>> year is slow in comparison, just like sun perf team verified.
>> It was still a win vs the CLQ for object caching. its another matter that
>> both CLQ and LTQ now are vastly improved and whats optimal
>> has changed.
>>
> I remember you mentioned some logic, which detects the current JDK version
> and uses appropriate collection. Is it still the case?

yes. the quality of some concurrent datastructures like CLQ needs to be
ensured by us, the latest jdk6 version still has datarace problems and its
performance problems was only fixed recently this year.
java 6 can still hope for more backports, according to Martin buchholz , a
JSR 166y backporter its unlikely for java 5, its not part of hes deal.

>
> The threadpool is only one of several factors but it alone is interesting
>> to compare.
>> The performance and scalability is lowered by 2 to rather high 3 digit
>> percent numbers for when i use grizzly syncedpool for
>> 3+ minute load tests each using 64 to 1024 sockets and DataFrame sizes
>> from 2 to 16384 byte on a core i7.
>>
> IMO, it's not fair to compare those 2. SyncThreadPool is mostly used for
> GF, because it supports min/max threads and more predictable in its
> behavior. Otherwise nobody forces people to use SyncThreadPool.
>

True but glassfish is what stuff needs to work in if real world penetration
is wanted and sun to benefit from new functionality.
I think the recent advances in alternatives to the synced concept needs to
be compared, and the cost of load dependent runtime pool automatic resizing
must be
evaluated if its worth it and if so in what context its truly needed.

> In grizzly 1.9 or 2.0 there is per design just too much _enforced_
>> overhead in form of
>> 1. too many context switches,
>>
> Can you pls. provide details, which switches we can avoid?
>
channel reads,
and the writes when the socket native buffer is full or likely to be full
depending on how its implemented.

>

> 2. 1. are extra costly due the syncedpool design.
>>
> syncthreadpool is not design, it's implementation of a ThreadPool and
> ExecutorService interface. Using the same interfaces (which are part of
> design), you can build async implementation. So I'd say *design* is not
> correct word here.
>
>
> the cost for load adapted pool size etc is _extreme_ when comparing to
>> recently available alternatives like LTQ. but inorder to be able to take
>> advantage of that , old ways to think must be abounded.
>> That requires people that dont react with fear but instead become
>> inspired...
>>
> LTQ is available since 1.6, Grizzly 1.9.x is JDK 1.5 compatible.
> Personally I'm very inspired with the new LTQ, but in practice customers
> will use JDK 1.5 for 5+ more years and will be happy.
>
>
Yes. people can be happy.
But if concurrency that scales truly well is wanted, newer concepts then
what 1.5 offers seems to be needed.
Its of course dependent on the context, how short the jobs are and the
overall design affects how pronounced the threadpool and other concurrent
queues efficiency is.

With 1K websockets ping ponging chat sized dataframes, like 100 byte each
for more then enough time for a system to stabilize:
The difference in system throughput on my core i7 is several 100%.
Its almost full cpu load vs 20% load or so due to threads are piled up at a
certain place.

Doug Lea deserves some recognition for hes great work imo.

> 3. overall design is indirectly leading to more channel read and writes
>> then needed. very roughly 30K+ cycles overhead per call plus we risk to
>> fragment tcp packets more then needed.
>>
> Can you pls. provide more details. Will appreciate your help in improvment
> of those areas.
>
>
This is more something i try to think of myself and i expressed myself wrong
when i made it sound as being part of the major problems.
The other effects discussed above is of more interest and are more
controllable.
I got a 10 line text about this topic to paste in but i not much point
since its just too hard to control.

> When i want to talk about the possibility to allow services like websocket
>> in grizzly 2.0 to not be enforced
>> with low scaling high context switching design the answer is short, its
>> not possible to make changes.
>>
> You make such a statement even without looking into 2.0, where for each I/O
> operation you can chose whether you want to process it in the same or worker
> thread.
> Regarding changes... when you were told, that it's not possible to make
> changes? If I remember correctly I was talking, that we're not planning
> *big* *design* changes, but all reasonable improvment are welcome.
>
>
I admit that i have not followed 2.0 so much, its been too uncertain if the
bird ever will fly.
Its 1.x that is used in sun products in the real world and naturally the
primary focus.

Http stack needs an overhaul etc, implement or prepare for http5 that brings
a lot of goodies that also increases the implementation complexity that
needs
to be efficiently implemented.
Porting from 1.x is not completed for existing features, still the official
status gives a general impression of being close to release ?.

Please correct me if the matters pointed out above are still not valid for
2.0:
if a service needs multiple reads or (remaining) writes to happen before it
needs a logic state change, those io operations should be possible to do in
one of the many selector threads and then the triggered logic optionally but
likely per default in worker threads.
If not there is enforced context switching and pressure on threadpool. costs
that is a miss match with the actual logic needs in many cases.
Its the specific needs of each service or protocol that should allow to
decide how or if expensive operations trigger.

Imo Its real life applicable performance and scalability matters, how things
are when integrated into future glassfish and other major products,
and not whats possible in theory in a standalone framework.

Any core changes should preferable be discussed and reviewed by a team or
so. and not just more or less commited based on one persons findings.
It would be nice with some regression along with normal unit testing,
preferable automated over night on the multitude of platforms and jdk
versions that sun intend to commercially support.
Currently too many of the problems or regressions are not only found the
hard way but by the embarrassing way, by external people or teams.

> Well there is a general feeling of dead horse, no weekly project meetings
>> anymore etc.
>>
> Agree, this should be changed. Having finished our work on GFv3, we will
> have more time to work on Grizzly, so we will have info to share on
> meetings.
>
> Meetings !. thats great news.

>
> How are people supposed to work together in a team and still produce good
>> stuff ?.
>>
> We still have mailing list, where trying to answer the question as quickly
> as possible.
>
> Yes but internal dev talk is handled in private chats or emails. no group
talk it seems ?.

>
> I would gladly change my design or anything when i'm proved to be wrong or
>> with sound technical reasoning convinced,
>>
> Ok.
>
>
> Me or anyone else ,cant as a single person try to drive the project and
>> make it into something that would be competitive and only
>> only "working".
>>
> I really appreciate the work you do!
>
Thanks !

I have tried to discuss these matters in private chats etc, it should not
take public "whining" from sad and frustrated devs to get attention, i dont
know if the sudden interest is real or more about damage control.

-- 
regards
 gustav trede