Re: Realistic generation and use of external vocabularies

From: Paul Sandoz <Paul.Sandoz_at_Sun.COM>
Date: Thu, 13 Apr 2006 13:29:36 +0200

Jaakko Kangasharju wrote:
> Mark Swanson <mark_at_ScheduleWorld.com> writes:
>
>
>>Paul Sandoz wrote:
>>
>>>Hi,
>>>Some people on this list may be interested in this blog i just wrote:
>>>http://blogs.sun.com/roller/page/sandoz?entry=realistic_generation_and_use_of
>>
>>Good article.
>>I was also wondering about the last column. You seem to imply that if
>>the most frequent information is at the end of the document then it
>>won't be taken into consideration by the indexer. Wouldn't it make
>>sense for the indexer to base its decisions after reading the entire
>>document (which seems to be the case if more than one file is given)?
>
>
> I thought that the indexer simply gives a running number to the data
> that it indexes, so things indexed later get larger index numbers, and
> representing these larger numbers takes up more space in the encoded
> document. When using a schema and samples, the most frequent
> information is available sooner, so it gets smaller index numbers.
>

Yes, for example if an external vocabulary was generated from the
following XML document without taking into account the frequency of
occurrence:

<foo>
   <bar1/>
   <bar2/>
   <bar3/>
   <bar4/>
   ...
   <bar100/>
   <baz/>
   <baz/>
   ...
   <baz/>
   <baz/>
   <baz/>
</foo>

the element baz would be assigned an index of 102.

This would be the same index if the fast infoset document was created
with out using an external vocabulary.

Paul.

-- 
| ? + ? = To question
----------------\
    Paul Sandoz
         x38109
+33-4-76188109