Incremental crawl script

This topic describes how to set up an incremental crawl script that manages record output files.

Because the script is basically the same for all file system and CMS crawl configurations, the EndecaCasCrawlConfig.xml sample is used to illustrate the script (for a crawl named Endeca).

<script id="Endeca_incrementalCasCrawl">
    <![CDATA[
  crawlName = "Endeca";
  1. Check if the crawl is set to write output to a record output file, and throw an exception if the crawl is set to output to a Record Store instance.
      if (!CAS.isCrawlFileOutput(Endeca)) {
          throw new UnsupportedOperationException("The crawl " + 
          crawlName + " does not have a File System output type. 
          The only supported output type for this script is 
          File System.");           
          }
        log.info("Starting incremental CAS crawl '" + crawlName + "'.");
  2. Obtain a lock on the crawl. The incremental crawl attempts to set a flag in the EAC to serve as a lock or mutex. The name of the flag is the string "crawl_lock_" plus the name of the crawl (such as "crawl_lock_Endeca" for this example). If the flag is already set, this step fails, ensuring that a crawl (either baseline or incremental) cannot be started more than once simultaneously, as this would interfere with data processing. The flag is removed in the case of an error or when the script completes successfully.
      // obtain lock
      if (LockManager.acquireLock("crawl_lock_" + crawlName)) {
    
  3. Clean the output directories. Any previous crawl output file (either baseline or incremental) is removed from the crawl's configured output directory (that is, the output directory that was configured when the crawl was created). The data/complete_cas_crawl_output directory is not affected.
      CAS.cleanOutputDir(crawlName);
  4. Run the incremental crawl. The incremental crawl is run with the crawl name as the ID.
      CAS.runIncrementalCasCrawl(crawlName);
  5. Rename the output file. The incremental crawl output file is renamed by prefixing the crawl name and a timestamp, to indicate the order in which the incremental crawl file was generated relative to others.
      CAS.renameIncrementalCrawlOutput(crawlName);
  6. Obtain a second lock on the complete crawl data directory. The script will attempt to set a flag to serve as a lock on the data/complete_cas_crawl_output directory. If the flag is already set, this step will idle for up to ten minutes, waiting for the flag to become available. If the flag remains set for 10 minutes, this action will fail, meaning that the renamed output file is not copied. This step ensures that access to the directory is synchronized, so that a downstream process like the baseline update does not retrieve a half-delivered crawl file.
      // try to acquire a lock on the complete crawl data directory
      // for up to 10 minutes
      if (LockManager.acquireLockBlocking("complete_cas_crawl_data_lock",
          600))
    
  7. Get the path of the output destination directory. The path of the destination directory (to which the incremental crawl output file will be copied) is obtained. The directory name is specified by the casCrawlIncrementalOutputDestDir property (which is data/complete_cas_crawl_output/incremental by default) in the fetchCasCrawlDataConfig.xml file.
      destDir = PathUtils.getAbsolutePath(CAS.getWorkingDir(),
        CAS.getFsCrawlIncrementalOutputDestDir());
    
  8. Create the destination directory. The destination directory for the crawl output file is created if it does not exist. The name is in the destDir variable.
      // create the target dir, if it doesn't already exist
      mkDirUtil = new CreateDirUtility(CAS.getAppName(),
        CAS.getEacHost(), CAS.getEacPort(), CAS.isSslEnabled());
      mkDirUtil.init(Forge.getHostId(), destDir, CAS.getWorkingDir());
      mkDirUtil.run();
    
  9. Copy the output file to the destination directory. The renamed incremental crawl output file is copied from the original output directory to the destination directory (data/complete_cas_crawl_output/incremental by default ).
      // deliver crawl output to destination directory
      CAS.copyIncrementalCrawlOutputToDestinationDir(crawlName);
    
  10. Release the second lock. The "complete_cas_crawl_data_lock" flag is removed from the EAC, indicating that the copy operation was successful.
      // release lock on the crawl data directory
      LockManager.releaseLock("complete_cas_crawl_data_lock");
    
  11. Release the first lock. The "crawl_lock_Endeca" flag is removed from the EAC (indicating that the crawl operation was successful) and a "finished" message is logged.
      LockManager.releaseLock("crawl_lock_" + crawlName);
      ...
      log.info("Finished incremental CAS crawl '" + crawlName +
        "'.");