Integrating and running CAS crawls that write to record output files

This section describes the high-level steps for integrating and running a CAS crawl that writes output to a record output file.

This section assumes you have done the following:

The steps below are described in their own topics.

  1. Create a CAS crawl.
  2. Specify a CAS Server host in AppConfig.xml.
  3. Specify a CAS Server as a custom component for any CAS crawl that writes to record output files.
  4. Specify a pipeline to run in AppConfig.xml.
  5. Edit fetchCasCrawlDataConfig.xml to reflect the details of your crawling environment.
  6. Create a CAS crawl script, for the crawl you created in step 1, by running the [appdir]/control/cas/make_cas_crawl_scripts script.
  7. Run a baseline CAS crawl (a full crawl) using the sample CAS crawl pipeline. If the baseline update runs without failing, you can start to make further modifications to your deployment, such as using your custom pipeline.
  8. Optionally, run an incremental CAS crawl. These steps verify that your configuration files are correct.
  9. Load the crawl files generated in the previous step to be processed by the sample CAS crawl pipeline.
  10. Run a baseline update using the sample CAS crawl pipeline with the new crawl record output files.
Note: The instructions provided in this section apply to the Dgraph deployment type. If your are using an Agraph, all of the file system crawl integration components are deployed and work the same way, but you need to customize the cas_crawl_pipeline (or create your own pipeline) to process the crawl data into an Agraph. Essentially, the crawl functionality works exactly the same way, but the Deployment Template does not provide a sample pipeline for the Agraph case.