Re: FI for microjava

From: Thomas Skjølberg <thomas.skjolberg_at_iet.ntnu.no>
Date: Fri, 23 Sep 2005 14:32:06 +0200

Hi,

On Fri, 23 Sep 2005 11:27:31 +0200, Paul Sandoz <Paul.Sandoz_at_Sun.COM>
wrote:

> Hi Thomas,
>
> Thomas Skjølberg wrote:
>> Hi,
>> this is actually a really delayed response, but I'm been busy.
>
> I fully empathize!
>
>
>> First of all, I got FIME working fine (great!), my problems very due
>> to different implentations of the same interfaces => _use_ the Eclipse
>> refactoring guys (or add a comment to the readme?).
>>
>
> Great. I have not had any time to work on FIME. If you have some
> proposed improvements perhaps you could work with Ias, if he has any
> time :-)
>
>
>> Second, I've been composing a document which tries to address 'my'
>> needs in terms of (binary) XML and some imagined API. I've sendt the
>> same document to the mpeg workgroup on mpeg-21 Digital Item Streaming.
>> Whatever comments - I need input, so feel free to tell me how
>> incompetent I am;) - there are a lot of loose threads and I need to
>> find a working solution (or at least API).
>>
>
> I would be happy to review the document if you send it to me. I will be
> objective and not be biased towards Fast Infoset.

pointers: http://www.itscj.ipsj.or.jp/sc29/open/29view/29n4314t.doc (old
but OK),
http://umabase.adactus.no/SpiderWeb/temp/embedded.pdf

I actually thought I attatched the pdf to the previous list post, but how
knows what became of it. Maybe it had a stupid filename. Actually, the
contents is also getting more stupid as I'm thinking about it..

>
>
>> I think that data navigation should be a natural part of any binary
>> format, and that it in the XML case should be exposed thourgh an API,
>> have you considered it (the database 'view' of XML)?
>
> Do you mean random access into the document, or selected access to
> certain parts.

I want random access via inserted (at serialization) pointers and also
generate jump to saved 'addresses' after analyzing the document.

>
> A DOM representation provides random access to the infoset, but first
> the DOM representation has to be instantiated.
>
yes:(

> XML databases do seem to work quite well and i am sure there are loads
> of proprietary formats (with good reasons because it is not a high
> requirement for these formats to be interoperable) for representing XML
> infosets such that general querying of such data is efficient.
>
> What we wanted to avoid with Fast Infoset was a 'memory representation'
> of an XML infoset. We concentrated on a format to stream XML infosets.
Ok but I don't want DOM in my J2EE system. Perhaps J2EE supports
xpath/xquery query caching (having the same document in-memory at the same
time)..
>
> For Fast Infoset we designed it such that it is possible to provide
> 'jump points' into the encoding i.e. selected access as chosen by the
> serializer. This technique is not standardized but the capability is
> there, so for example it could be possible if say an XML document
> represented multiple pages to provide 'jump points' to each page.
> Certain constraints have to be met in terms of the encoding and the XML
> infoset for it to work and the current APIs would need to be extended.
> In fact the concept of 'jump points' can be applied equally to XML
> documents, although how you send the jump points with the XML document
> is not quite so easy.

Ok I think this is just what I need:) Where can I read more?

>
> An alternative approach is to support simple and composable linear XPath
> expressions operating on a stream of SAX events. This can be extremely
> efficient. But it depends what type of processing you need to do on the
> XML infoset.
>

Basically, insert information into the stream (preferly without
re-buffering the whole thing) while reading it at the same time.

>
>> It is not only that 'some namespace cannot be understood', it is that
>> 'the sets of understood namespaces may differ in parts of my
>> application (or potentially in some 3rd party software)'. It is not
>> inventing the wheel, but it will be usefull.
>>
>
> Not sure i understand. Can you give an example?

<XML>
        <girlsTalk:conversationStart>
                ....
        </girlsTalk:conversationStart>
        <girlsTalk:conversationStart>
                ....
        </girlsTalk:conversationStart>
        <girlsTalk:conversationStart>
                ....
        </girlsTalk:conversationStart>
        ...

        <boysTalk:GameStart>
                ....
        </boysTalk:GameStart>
        <boysTalk:GameStart>
                ....
        </boysTalk:GameStart>
        <boysTalk:GameStart>
                ....
        </boysTalk:GameStart>
                ....

</XML>

First, analyze girls talk by traversing the document. Then analyse boys
talk by traversing the document (from top). Now image that girls talk is 1
mb of XML and I have to do this as many times as possible on a PC. It is
stupid not to analyze them together, but I cannot predict if I want to
analyse boys talk.

Or another scenario: I get tired of girls talk and want to skip to boys
game (and I know in advance that that is a real possibility). :P

>
>
>> I have not investigated BiM, the MPEG-7 XML compression, but I seen
>> before writing this mail that some of the features exist, but i'll
>> have to check it out more. If BiM definently is schema-based, there is
>> no way I can use it just out of the box, because I need to compress
>> XML without prior knowlegde of schemas.
>
> BiM 1.0 is defintely schema-based. I think the BiM 1.x or 2.0 (i cannot
> recall) has the ability to encode an XML infoset without a schema but as
> i understand it is not very efficient. I think this feature is mainly
> there to better encode instances of xsd:any.
>

Ok, but I don't know if I want a <what namespaces do you support and here
are the schemas you're missing> negotiation.

>
>> But:
>> http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm#E11E20 .
>> Also, without knowing, I imagine that schema-based solutions either
>> use much memory and/or encode slow.
>>
>
> Actually schema-based solutions tend to be the fastests in terms of
> encoding and decoding. The reason being is that special
> encoders/decoders can be generated from the schema and the encodings
> tend to be more compact than XML infoset-based solutions (especially for
> small infosets), although using certain techniques of Fast Infoset it is
> possible to close the gap.
>
> A companion standard to FI , called X.694 (or Fast Schema as i like to
> call it) provides functionality that is similar to BiM. See:
>
> http://asn1.elibel.tm.fr/xml/#schema-mapping
>
>
> The downside of such schema-based encodings is that there are not
> self-describing or self-structuring, which is the reason why BiM defines
> forwards and backwards support in terms of a higher-layer of the
> encoding.
>

I will read up on this soon, but the following scenario:

I have a schema on both sides of some server-client solution.

The encoder schema has an extra attribute/namespace which is used in an
document instance.

The (application-) server does (naturally) not understand how this new
attribute/namespace works. But it reads the attribute/new namespace,
reencodes the document to text xml and deposits it in an Oracle XML
database.

Somewhat later, the client wants the document back in its original form
(with the new attribute/namespace) (so it must be encoded and decoded)

Does BiM support that, any idea about the granurality of BiM on this sort
of thing (does it support any scenario)? What if there is no schema for
some XML?

I need my system to be really robust (no really, I just don't want to the
same job twice;))

> If you could send me useful pointers to MPEG-21 (and your document) i
> may be able to help you on your requirements based on the MPEG-21
> use-cases.
>
>
>> Anyways, having BiM in your onlince Parser performance / Compactness
>> charts would be sweet.
>>
>
> That would be tricky since i do not have a BiM implementation available
> to me.
>

Ok if I find one I'll notice you.

> Note that Fast Infoset can be applied effectively to X3D (already has
> been applied effectively), SVG and Geography Markup Language documents
> when there are encodings algorithms specified for the efficient encoding
> of co-ordinate information represented as integers or real numbers. In
> these cases i think Fast Infoset will perform well against a
> schema-based encoding that use the same encodings algorithms.
>

That sounds good. Memory vs performance is my primary consern (J2ME
environment). Footprints is not that important.

Many thanks for your thourough reply!

Thomas

dev@fi.java.net

Re: FI for microjava - IaS