Tell Me

Determining Crawler Authentication

previous|next

Philip says that it is important to determine who crawls the data, and how does the crawler authenticate.

A secure crawler generally has to run in some sort of “superuser” mode. The crawler needs read access to all of the documents to be indexed. This can be achieved in one of two ways:

A username and password for a privileged user are provided as parameters to the crawler, from the admin screens. The crawler logs on to the data source using these credentials, and fetches information as that privileged user.

“Service to Service” authentication (S2S). The data source trusts SES and gives it unrestricted access to all information, knowing that SES will enforce security at query time. The SES instance authenticates itself (proves that it the actual trusted SES instance) by means of an authentication key (or password) which is known to both systems.

These two methods are essentially very similar – a password proves that the SES instance is authorized. The main difference is that in the first method, there is no need for the data source to “know” that it is being crawled by SES.