Coordinating CAS crawls and baseline or partial updates

After running a CAS crawl, you run a baseline or partial update that incorporates the records from a Record Store instance. How you configure the coordination between your CAS crawls and baseline or partial updates depends upon how complicated your environment is. This topic describes several scenarios.

A single CAS crawl and then a Forge update

There is a simple case where the Deployment Template runs a full CAS crawl and then runs a baseline update that incorporates the records from the crawl. To create this sequential workflow of CAS crawl and then baseline update, you can doc the following:
  • Remove the default Forge.isDataReady check from the baseline update script. This call checks whether a flag was set that indicates the incoming data files are ready for Forge. Removing this call means that the lock manager does not check the flag or wait on the flag to be cleared before running a CAS crawl.
  • add a call to runBaselineCasCrawl() to run the full CAS crawl.
For example, this baseline update script calls CAS.runBaselineCasCrawl("MyCrawl") which runs a full CAS crawl named MyCrawl. Then the script continues with baseline update processing.
<!--
    ########################################################################
    # Baseline update script
    #
  -->
  <script id="BaselineUpdate">
    <log-dir>./logs/provisioned_scripts</log-dir>
    <provisioned-script-command>./control/baseline_update.bat</provisioned-script-command>
    <bean-shell-script>
      <![CDATA[ 
    log.info("Starting baseline update script.");
    // obtain lock
    if (LockManager.acquireLock("update_lock")) {
        
        // call the baseline crawl script to run a full CAS 
        // crawl.
        CAS.runBaselineCasCrawl("MyCrawl");
        
        if (ConfigManager.isWebStudioEnabled()) {
          // get Web Studio config, merge with Dev Studio config
          ConfigManager.downloadWsConfig();
          ConfigManager.fetchMergedConfig();
        } else {
          ConfigManager.fetchDsConfig();
        }
        
        // clean directories
        Forge.cleanDirs();
        PartialForge.cleanCumulativePartials();
        Dgidx.cleanDirs();
        
        // fetch extracted data files to forge input
        Forge.getIncomingData();
        LockManager.removeFlag("baseline_data_ready");
        
        // fetch config files to forge input
        Forge.getConfig();
        
        // archive logs and run ITL
        Forge.archiveLogDir();
        Forge.run();
        Dgidx.archiveLogDir();
        Dgidx.run();
        
        // distributed index, update Dgraphs
        DistributeIndexAndApply.run();

        // if Web Studio is integrated, update Web Studio with latest 
        // dimension values
        if (ConfigManager.isWebStudioEnabled()) {
          ConfigManager.cleanDirs();
          Forge.getPostForgeDimensions();
          ConfigManager.updateWsDimensions();
        }
        
        // archive state files, index
        Forge.archiveState();
        Dgidx.archiveIndex();
        
        // (start or) cycle the LogServer
        LogServer.cycle();
  
      // release lock
      LockManager.releaseLock("update_lock");
      log.info("Baseline update script finished.");
      } else {
            log.warning("Failed to obtain lock.");
      }
      ]]>
    </bean-shell-script>
  </script>

You run the baseline update by running baseline_update in the apps/[appDir]/control directory.

For example:
C:\Endeca\apps\DocApp\control>baseline_update.bat
There is also the very similar case where the Deployment Template runs an incremental CAS crawl and then runs a partial update that incorporates the records from the crawl. To create this sequential workflow of incremental CAS crawl and then partial update, you can do the following:
  • Remove the default PartialForge.isPartialDataReady check from the partial update script. This call checks whether a flag was set that indicates the incoming data files are ready for Forge. Removing this call means that the lock manager does not check the flag or wait on the flag to be cleared before running a CAS crawl.
  • Add a call runIncrementalCasCrawl() to run the incremental CAS crawl.
  • Remove the call to PartialForge.getPartialIncomingData() that fetches extracted data files.
For example, this partial update script calls CAS.runIncrementalCasCrawl("MyCrawl") which runs an incremental CAS crawl named MyCrawl. Then the script continues with partial update processing.
  <!--
    ########################################################################
    # Partial update script
    #
  -->
  <script id="PartialUpdate">
    <log-dir>./logs/provisioned_scripts</log-dir>
    <provisioned-script-command>./control/partial_update.bat</provisioned-script-command>
    <bean-shell-script>
      <![CDATA[ 
    log.info("Starting partial update script.");
    
    // obtain lock
    if (LockManager.acquireLock("update_lock")) {

       // call the partial crawl script to run an incremental 
       // CAS crawl.
       CAS.runIncrementalCasCrawl("MyCrawl");
        
       // archive logs
        PartialForge.archiveLogDir();
        
        // clean directories
        PartialForge.cleanDirs();
        
        // fetch config files to forge input
        PartialForge.getConfig();
        
        // run ITL
        PartialForge.run();
        
        // timestamp partial, save to cumulative partials dir
        PartialForge.timestampPartials();
        PartialForge.fetchPartialsToCumulativeDir();
        
        // distribute partial update, update Dgraphs
        DgraphCluster.cleanLocalPartialsDirs();
        DgraphCluster.copyPartialUpdateToDgraphServers();
        DgraphCluster.applyPartialUpdates();
        
        // archive partials
        PartialForge.archiveCumulativePartials();

      // release lock
      LockManager.releaseLock("update_lock");
      log.info("Partial update script finished.");
    } else {
      log.warning("Failed to obtain lock.");
    }
      ]]>
    </bean-shell-script>
  </script>

You run the partial update by running partial_update in the apps/[appDir]/control directory.

For example:
C:\Endeca\apps\DocApp\control>partial_update.bat

Multiple CAS crawls and multiple Forge updates

There is a more complicated case where multiple CAS crawls are running on their own schedules, and updates are running on their own schedules. To coordinate this asynchronous workflow of CAS crawls and baseline or partial updates, you add code that calls methods in ContentAcquisitionServerComponent.

For details about ContentAcquisitionServerComponent, see EAC Component API Reference for CAS Server (Javadoc) installed in <Endeca installation path>\CAS\<version>\doc\cas-dt-javadoc.

In your AppConfig.xml code, the main coordination task is one of determining how you time running CAS crawls and how you time running baseline or partial updates that consume records from those crawls. For example, suppose you have an application that runs three full CAS crawls and those records are consumed by a single baseline update. In that scenario, each of the three full crawls has its own full crawl script in AppConfig.xml that runs on a nightly schedule. And the AppConfig.xml file contains a baseline update that runs nightly to consume the latest generation of records from each of the three crawls. The Forge.isDataReady check is not required in the baseline update script because the source data is not locked.

Here is an example script for one of the full CAS crawls named endeca.
<!--
    ########################################################################
    # full crawl script
    #
 -->

  <script id="endeca_fullCasCrawldoc">
    <log-dir>./logs/provisioned_scripts</log-dir>
    <provisioned-script-command>./control/runcommand.bat endeca_fullCasCrawldoc</provisioned-script-command>
    <bean-shell-script>
      <![CDATA[ 
    crawlName = "endeca";
         
    log.info("Starting full CAS crawl '" + crawlName + "'.");
    
    // obtain lock
    if (LockManager.acquireLock("crawl_lock_" + crawlName)) {

      CAS.runBaselineCasCrawl(crawlName);

      LockManager.releaseLock("crawl_lock_" + crawlName);
    }
    else {
      log.warning("Failed to obtain lock.");
    }
    
    log.info("Finished full CAS crawl '" + crawlName + "'.");
      ]]>
    </bean-shell-script>
  </script>
Here is an example script for a Forge baseline update that processes records from the three CAS crawls mentioned above (The record adapter configuration in the Developer Studio project specifies that Forge reads from multiple Record Store instances).
<!--
    ########################################################################
    # Baseline update script
    #
  -->
  <script id="BaselineUpdate">
    <log-dir>./logs/provisioned_scripts</log-dir>
    <provisioned-script-command>./control/baseline_update.bat</provisioned-script-command>
    <bean-shell-script>
      <![CDATA[ 
    log.info("Starting baseline update script.");
    // obtain lock
    if (LockManager.acquireLock("update_lock")) {
              
        if (ConfigManager.isWebStudioEnabled()) {
          // get Web Studio config, merge with Dev Studio config
          ConfigManager.downloadWsConfig();
          ConfigManager.fetchMergedConfig();
        } else {
          ConfigManager.fetchDsConfig();
        }
        
        // clean directories
        Forge.cleanDirs();
        PartialForge.cleanCumulativePartials();
        Dgidx.cleanDirs();
        
        // fetch extracted data files to forge input
        Forge.getIncomingData();
        LockManager.removeFlag("baseline_data_ready");
        
        // fetch config files to forge input
        Forge.getConfig();
        
        // archive logs and run ITL
        Forge.archiveLogDir();
        Forge.run();
        Dgidx.archiveLogDir();
        Dgidx.run();
        
        // distributed index, update Dgraphs
        DistributeIndexAndApply.run();

        // if Web Studio is integrated, update Web Studio with latest 
        // dimension values
        if (ConfigManager.isWebStudioEnabled()) {
          ConfigManager.cleanDirs();
          Forge.getPostForgeDimensions();
          ConfigManager.updateWsDimensions();
        }
        
        // archive state files, index
        Forge.archiveState();
        Dgidx.archiveIndex();
        
        // (start or) cycle the LogServer
        LogServer.cycle();
  
      // release lock
      LockManager.releaseLock("update_lock");
      log.info("Baseline update script finished.");
      } else {
            log.warning("Failed to obtain lock.");
      }
      ]]>
    </bean-shell-script>
  </script>