Determining the Data Change |
||||
One of Timothy's concerns is about tracking the data. He wants to know when the data gets changed. Philip explains it this way: Crawlers normally perform "incremental crawls" which only provide documents which are new or have been modified since the last crawl. It is possible for a crawler to declare that it does not support incremental crawling, in which case it will perform a complete recrawl of all data each time it is invoked. Obviously, this should be avoided where possible for performance reasons. Hence, we need to know which documents are new or have changed since
the last crawl. Depending on the source of the data, this information
may be available by a simple lookup in some central directory, or it may
require a "surface" crawl of all documents to check the date
and time of last modification. When this information is not available,
it may be necessary to do a complete crawl, and rely on the duplicate
detection mechanisms within SES to ignore documents which are already
indexed.
|