Designed primarily for use with unstructured data, the First
module ranks documents by how close the query terms are to the beginning of the
document.
The First module groups its results into variably-sized strata. The
strata are not the same size, because while the first word is probably more
relevant than the tenth word, the 301st is probably not so much more relevant
than the 310th word. This module takes advantage of the fact that the closer
something is to the beginning of a document, the more likely it is to be
relevant.
The First module works as follows:
- When the query has a
single term, First’s behavior is straight-forward: it retrieves the first
absolute position of the word in the document, then calculates which stratum
contains that position. The score for this document is based upon that stratum;
earlier strata are better than later strata.
- When the query has
multiple terms, First behaves as follows: The first absolute position for each
of the query terms is determined, and then the median position of these
positions is calculated. This median is treated as the position of this query
in the document and can be used with stratification as described in the single
word case.
- With query expansion
(using stemming, spelling correction, or the thesaurus), the First module
treats expanded terms as if they occurred in the source query. For example, the
phrase
glucose intolerence would be corrected to
glucose intolerance (with
intolerence spell-corrected to
intolerance). First then continues as it does in the
non-expansion case. The first position of each term is computed and the median
of these is taken.
- In a partially matched
query, where only some of the query terms cause a document to match, First
behaves as if the intersection of terms that occur in the document and terms
that occur in the original query were the entire query. For example, if the
query
cat bird dog is partially matched to a document on the terms
cat and
bird, then the document is scored as if the query were
cat bird. If no terms match, then the document is scored in
the lowest strata.
Note: The First module does not work with Boolean searches, cross-field
matching, or wildcard search. It assigns all such matches a score of zero.