users@websocket-spec.java.net

[jsr356-users] Re: [jsr356-experts] Re: MessageHandler.Async(Text) and split UTF8 unicode bytes

From: Joakim Erdfelt <joakim_at_intalio.com>
Date: Wed, 14 Nov 2012 13:00:42 -0700

On Wed, Nov 14, 2012 at 12:48 PM, Scott Ferguson <ferg_at_caucho.com> wrote:

> On 11/14/12 11:35 AM, Joakim Erdfelt wrote:
>
> Hit a snag implementing support in Jetty for MessageHandler.Async(Text).
> I realize that the MessageHandler.Async has changed in SVN, but this
> fundamental concept of RFC 6455 and TEXT message framing I fear has been
> missed (at least in documentation).
>
> Namely, the expectation that the onMessage(T partialMessage, boolean
> last) method expectations on what partialMessage means.
> If the partialMessage type is anything but an array of bytes, for example
> a String for TEXT messages, then we have a problem with what to do with
> multi-byte UTF-8 code points that are fragmented within the code point.
> If we assume that this is called for frames of type TEXT (opcode 0x1) then
> the server implementations need to hold onto unprocessed multi-byte UTF-8
> code points within that frame to prepend onto the next frame (if it
> arrives).
>
>
> Yes, but why is that a problem?
>
> The partialMessage has no necessary relationship to the fragmentation.
> When the next fragment arrives, the char can be completed and it becomes
> the first char of the next partialMessage.
>
>

The problem isn't technical, Jetty is already supporting this carry over
with a custom UTF8 decoder/validator.
But this behavior of the API needs to be documented, both for implementors
and eventually users of the API.

Implementor Concern:
What if the fragment received results in a zero length string?
Should the implementors still call MessageHandler.onMessage(String
partialMessage, boolean last) with an empty String?

User Concern:
The documentation should also include text aimed at the users of this
interface that for String types, some bytes might be carried over to the
next call of onMessage(String partialMessage, boolean last) as a result of
fragmentation of UTF-8 code points.

- Joakim