users@fi.java.net

Re: Realistic generation and use of external vocabularies

From: Jaakko Kangasharju <jkangash_at_hiit.fi>
Date: Thu, 13 Apr 2006 12:25:24 +0300

Mark Swanson <mark_at_ScheduleWorld.com> writes:

> Paul Sandoz wrote:
>> Hi,
>> Some people on this list may be interested in this blog i just wrote:
>> http://blogs.sun.com/roller/page/sandoz?entry=realistic_generation_and_use_of
>
> Good article.
> I was also wondering about the last column. You seem to imply that if
> the most frequent information is at the end of the document then it
> won't be taken into consideration by the indexer. Wouldn't it make
> sense for the indexer to base its decisions after reading the entire
> document (which seems to be the case if more than one file is given)?

I thought that the indexer simply gives a running number to the data
that it indexes, so things indexed later get larger index numbers, and
representing these larger numbers takes up more space in the encoded
document. When using a schema and samples, the most frequent
information is available sooner, so it gets smaller index numbers.

-- 
Jaakko Kangasharju, Helsinki Institute for Information Technology
I + NT = Problem