Tell Me

Deciding on How to Crawl and Recrawl the Data Source

previous|next

Timothy should now decide how to crawl the data source. Crawling the data source simply means to go through the list of documents in the data source and retrieve them for indexing.

Crawlers normally perform "incremental crawls" which only provide documents which are new or have been modified since the last crawl. It is possible for a crawler to declare that it does not support incremental crawling, in which case it will perform a complete recrawl of all data each time it is invoked. Obviously, this should be avoided where possible for performance reasons.

Hence, we need to know which documents are new or have changed since the last crawl. Depending on the source of the data, this information may be available by a simple lookup in some central directory, or it may require a "surface" crawl of all documents to check the date and time of last modification. When this information is not available, it may be necessary to do a complete crawl, and rely on the duplicate detection mechanisms within SES to ignore documents which are already indexed.

This answers all the questions that Timothy has in mind:

What should the plug-in do on subsequent crawls?
Will the plug-in be able to detect the set of insert/update/delete documents since the last crawl?
Is detecting the deleted document set trivial? Generally, it is not.
How does the plug-in know that this is a recrawl and that it has to behave differently?