Display and Access URLs

Print Close

For some applications, due to security reasons, the URL crawled is different from the one seen by the end user. For example, crawling is done on an internal Web site behind a firewall without security checking, but when queried by an end user, a corresponding mirror URL outside the firewall must be used.

  • A display URL is a URL string used for search result display. This is the URL used when users click the search result link.
  • An access URL is a URL string used by the crawler for crawling and indexing.
    • An access URL is optional. If it does not exist, then the crawler uses the display URL for crawling and indexing.

    • If it does exist, then it is used by the crawler instead of the display URL for crawling.

For regular Web crawling, there are only display URLs available. But in some situations, the crawler needs an access URL for crawling the internal site while keeping a display URL for the external use. For every internal URL, there is an external mirrored one. When the URL link http://www.example-qa.us.com:9393/index.html is extracted and before it is inserted into the queue, the crawler generates a new display URL and a new access URL for it:

    Access URL:
    http://www.example-qa.us.com:9393/index.html

    Display URL:
    http://www.example.com/index.html


The extracted URL link is rewritten, and the crawler crawls the internal Web site without exposing it to the end user.