URL Rewriter API

Print Close

A URL rewriter is a user supplied Java module that implements the Oracle SES UrlRewriter Java interface. When activated, it is used by the crawler to filter and rewrite extracted URL links before they are inserted into the URL queue. The URL Rewriter API is used for Web sources.

During a Web crawl, the next URL is obtained from the URL queue.The content of the URL is fetched, and URL links are extracted from the content. After this, the links are inserted into the URL queue. But, for some applications, due to security reasons, the URL crawled is different from the one seen by the end user.

For example, the crawler might actually crawl, http://www.example_qa.us.com:9393/index.html, which is the access URL.

But, what the end user views upon clicking the search link is http://www.example.com/index.html, which is the display URL.

Therefore, when the URL link http://www.example_qa.us.com:9393/index.html is extracted and before it is inserted into the queue, the crawler generates a new display URL and a new access URL for it. The extracted URL link is rewritten, and the crawler crawls the internal Web site without exposing it to the end user.