Tell Me Glossary
 

Determining the Data Change

Previous previous|next Next Page

One of Timothy's concerns is about tracking the data. He wants to know when the data gets changed.

Philip explains it this way:

Crawlers normally perform "incremental crawls" which only provide documents which are new or have been modified since the last crawl. It is possible for a crawler to declare that it does not support incremental crawling, in which case it will perform a complete recrawl of all data each time it is invoked. Obviously, this should be avoided where possible for performance reasons.

Hence, we need to know which documents are new or have changed since the last crawl. Depending on the source of the data, this information may be available by a simple lookup in some central directory, or it may require a "surface" crawl of all documents to check the date and time of last modification. When this information is not available, it may be necessary to do a complete crawl, and rely on the duplicate detection mechanisms within SES to ignore documents which are already indexed.