jsr356-experts@websocket-spec.java.net

[jsr356-experts] Re: [jsr356-users] MessageHandler.Async(Text) and split UTF8 unicode bytes

From: Scott Ferguson <ferg_at_caucho.com>
Date: Wed, 14 Nov 2012 11:48:41 -0800

On 11/14/12 11:35 AM, Joakim Erdfelt wrote:
> Hit a snag implementing support in Jetty for MessageHandler.Async(Text).
> I realize that the MessageHandler.Async has changed in SVN, but this
> fundamental concept of RFC 6455 and TEXT message framing I fear has
> been missed (at least in documentation).
>
> Namely, the expectation that the onMessage(T partialMessage, boolean
> last) method expectations on what partialMessage means.
> If the partialMessage type is anything but an array of bytes, for
> example a String for TEXT messages, then we have a problem with what
> to do with multi-byte UTF-8 code points that are fragmented within the
> code point.
> If we assume that this is called for frames of type TEXT (opcode 0x1)
> then the server implementations need to hold onto unprocessed
> multi-byte UTF-8 code points within that frame to prepend onto the
> next frame (if it arrives).

Yes, but why is that a problem?

The partialMessage has no necessary relationship to the fragmentation.
When the next fragment arrives, the char can be completed and it becomes
the first char of the next partialMessage.

-- Scott

>
> Example:
>
> // Frame 1 payload (TEXT, FIN=false)
> byte buf1[] = new byte[] {(byte)0xce, (byte)0xba, (byte)0xe1 };
> // Frame 2 payload (CONTINUATION, FIN=true)
> byte buf2[] = new byte[] {(byte)0xbd, (byte)0xb9, (byte)0xcf,
> (byte)0x83, (byte)0xce, (byte)0xbc};
> String part1 = new String(buf1, Charset.forName("UTF-8"));
> String part2 = new String(buf2, Charset.forName("UTF-8"));
> System.out.printf("part1: %s%n", part1);
> System.out.printf("part2: %s%n", part2);
> // As a Message
> byte bufMsg[] = new byte[buf1.length + buf2.length];
> System.arraycopy(buf1,0,bufMsg,0,buf1.length);
> System.arraycopy(buf2,0,bufMsg,buf1.length,buf2.length);
> String msg = new String(bufMsg, Charset.forName("UTF-8"));
> System.out.printf("msg : %s%n", msg);
>
> The 3rd byte on buf1 (0xe1) is the start of the 2nd codepoint, but to
> itself is invalid.
>
> This is is allowed per RFC-6455.
> The UTF8 bytes that make up TEXT message are valid only when viewed as
> a message, not as a fragment.
>
> I would suggest that the signature for onMessage(T partialMessage,
> boolean last) is insufficient for this case.
> At least the partialMessage parameter needs to always be ByteBuffer
> (or similar).
> However, then the single method approach of the new interface becomes
> insufficient to determine TEXT vs BINARY fragments.
> That would lead us back to ...
> MessageHandler.Async.Text
> onTextFragment(ByteBuffer partialMessage, boolean last);
> MessageHandler.Async.Binary
> onBinaryFragment(ByteBuffer partialMessage, boolean last);
>
> --
> Joakim Erdfelt <joakim_at_intalio.com <mailto:joakim_at_intalio.com>>
>