users@websocket-spec.java.net

[jsr356-users] MessageHandler.Async(Text) and split UTF8 unicode bytes

From: Joakim Erdfelt <joakim_at_intalio.com>
Date: Wed, 14 Nov 2012 12:35:35 -0700

Hit a snag implementing support in Jetty for MessageHandler.Async(Text).
I realize that the MessageHandler.Async has changed in SVN, but this
fundamental concept of RFC 6455 and TEXT message framing I fear has been
missed (at least in documentation).

Namely, the expectation that the onMessage(T partialMessage, boolean last)
method expectations on what partialMessage means.
If the partialMessage type is anything but an array of bytes, for example a
String for TEXT messages, then we have a problem with what to do with
multi-byte UTF-8 code points that are fragmented within the code point.
If we assume that this is called for frames of type TEXT (opcode 0x1) then
the server implementations need to hold onto unprocessed multi-byte UTF-8
code points within that frame to prepend onto the next frame (if it
arrives).

Example:

        // Frame 1 payload (TEXT, FIN=false)
        byte buf1[] = new byte[] {(byte)0xce, (byte)0xba, (byte)0xe1 };
        // Frame 2 payload (CONTINUATION, FIN=true)
        byte buf2[] = new byte[] {(byte)0xbd, (byte)0xb9, (byte)0xcf,
(byte)0x83, (byte)0xce, (byte)0xbc};
        String part1 = new String(buf1, Charset.forName("UTF-8"));
        String part2 = new String(buf2, Charset.forName("UTF-8"));

        System.out.printf("part1: %s%n", part1);
        System.out.printf("part2: %s%n", part2);

        // As a Message
        byte bufMsg[] = new byte[buf1.length + buf2.length];
        System.arraycopy(buf1,0,bufMsg,0,buf1.length);
        System.arraycopy(buf2,0,bufMsg,buf1.length,buf2.length);
        String msg = new String(bufMsg, Charset.forName("UTF-8"));
        System.out.printf("msg : %s%n", msg);

The 3rd byte on buf1 (0xe1) is the start of the 2nd codepoint, but to
itself is invalid.

This is is allowed per RFC-6455.
The UTF8 bytes that make up TEXT message are valid only when viewed as a
message, not as a fragment.

I would suggest that the signature for onMessage(T partialMessage, boolean
last) is insufficient for this case.
At least the partialMessage parameter needs to always be ByteBuffer (or
similar).
However, then the single method approach of the new interface becomes
insufficient to determine TEXT vs BINARY fragments.
That would lead us back to ...
MessageHandler.Async.Text
   onTextFragment(ByteBuffer partialMessage, boolean last);
MessageHandler.Async.Binary
   onBinaryFragment(ByteBuffer partialMessage, boolean last);

--
Joakim Erdfelt <joakim_at_intalio.com>