Jaakko Kangasharju wrote:
> Mark Swanson <mark_at_ScheduleWorld.com> writes:
>
>
>>Paul Sandoz wrote:
>>
>>>Hi,
>>>Some people on this list may be interested in this blog i just wrote:
>>>http://blogs.sun.com/roller/page/sandoz?entry=realistic_generation_and_use_of
>>
>>Good article.
>>I was also wondering about the last column. You seem to imply that if
>>the most frequent information is at the end of the document then it
>>won't be taken into consideration by the indexer. Wouldn't it make
>>sense for the indexer to base its decisions after reading the entire
>>document (which seems to be the case if more than one file is given)?
>
>
> I thought that the indexer simply gives a running number to the data
> that it indexes, so things indexed later get larger index numbers, and
> representing these larger numbers takes up more space in the encoded
> document. When using a schema and samples, the most frequent
> information is available sooner, so it gets smaller index numbers.
>
Yes, for example if an external vocabulary was generated from the
following XML document without taking into account the frequency of
occurrence:
<foo>
<bar1/>
<bar2/>
<bar3/>
<bar4/>
...
<bar100/>
<baz/>
<baz/>
...
<baz/>
<baz/>
<baz/>
</foo>
the element baz would be assigned an index of 102.
This would be the same index if the fast infoset document was created
with out using an external vocabulary.
Paul.
--
| ? + ? = To question
----------------\
Paul Sandoz
x38109
+33-4-76188109