|
Filtering removes the unwanted URL links. Oracle SES crawler provides
a mechanism by which users can control what type of URL links should be
allowed to be inserted into the queue.
The mechanism includes:
- robots.txt file on the target Web site; for example, disallow URLs
from the /cgi directory
- Hosts inclusion and exclusion rules; for example, only allow URLs
from
www.example.com
- File path inclusion and exclusion rules; for example, only allow
URLs under the /archive directory
- Mimetype inclusion rules; for example, only allow HTML and PDF files
- Robots metatag NOFOLLOW; for example, do not extract any link from
that page
- Black list URL; for example, URL explicitly singled out not to be
crawled
With these mechanisms, only URL links that meet the filtering criteria
are processed. However, there are other criteria that users might want
to use to filter URL links. For example:
- Allow URLs with certain file name extensions
- Allow URLs only from a particular port number
- Disallow any PDF file if it is from a particular directory
The possible criteria could be very large, which is why it is delegated
to a user-implemented module that can be used by the crawler when evaluating
an extracted URL link.
|