URL Link Filtering

Print Close

Filtering removes the unwanted URL links. Oracle SES crawler provides a mechanism by which users can control what type of URL links should be allowed to be inserted into the queue.

The mechanism includes:

  • robots.txt file on the target Web site; for example, disallow URLs from the /cgi directory
  • Hosts inclusion and exclusion rules; for example, only allow URLs from
    www.example.com
  • File path inclusion and exclusion rules; for example, only allow URLs under the /archive directory
  • Mimetype inclusion rules; for example, only allow HTML and PDF files
  • Robots metatag NOFOLLOW; for example, do not extract any link from that page
  • Black list URL; for example, URL explicitly singled out not to be crawled

With these mechanisms, only URL links that meet the filtering criteria are processed. However, there are other criteria that users might want to use to filter URL links. For example:

  • Allow URLs with certain file name extensions
  • Allow URLs only from a particular port number
  • Disallow any PDF file if it is from a particular directory

The possible criteria could be very large, which is why it is delegated to a user-implemented module that can be used by the crawler when evaluating an extracted URL link.