Fetch baseline crawl data script

This fetch script is used to copy the crawl data to the appropriate directories for all baseline update operations, including those performed with a delta update pipeline. The script is included in this section, with numbered steps indicating the actions performed at each point in the script.

Note that the script does not actually perform the baseline update itself; that update operation is managed by scripts in the AppConfig.xml document.

<script id="fetchFullCasCrawlData">
    <![CDATA[
  log.info("Fetching full CAS crawl data for processing.");
  1. Obtain a lock on the complete crawl data directory. The script attempts to set a flag to serve as a lock on the data/complete_cas_crawl_output directory. If the flag is already set, this step idles for up to ten minutes, waiting for the flag to become available. If the flag remains set for 10 minutes, this action fails, meaning that the renamed output file is not copied. This step ensures that access to the directory is synchronized, so that a downstream process like the baseline update does not retrieve a half-delivered crawl file.
      // try to acquire a lock on the complete crawl data directory
      // for up to 10 minutes
      if (LockManager.acquireLockBlocking("complete_cas_crawl_data_lock",
          600))
    
  2. Release the baseline data ready lock. The "baseline_data_ready" flag is removed from the EAC, which ensures that the baseline update does not start until all the data sources have been copied to the proper directories.
      // remove baseline data ready flag, ensuring baseline doesn't start
      // before data is completely copied and ready for processing
      LockManager.removeFlag("baseline_data_ready");
    
  3. Get the paths of the source data directories. The paths of the source data directories are obtained. The directory names are set by the fsCrawlFullOutputDestDir and fsCrawlIncrementalOutputDestDir properties of the custom-component section.
      fullSrcDir = PathUtils.getAbsolutePath(CAS.getWorkingDir(),
        CAS.getCasCrawlFullOutputDestDir()) + "/\\*";
      incrSrcDir = PathUtils.getAbsolutePath(CAS.getWorkingDir(),
        CAS.getCasCrawlIncrementalOutputDestDir()) + "/\\*";
    
  4. Get the paths of the destination directories. The paths of the destination directories (to which crawl output files will be copied) are obtained. The directory name is specified by the IncomingDataDir property ( ./data/incoming is the default) in the Forge section of the AppConfig.xml file. Note that /full and /incremental will be added to the incoming name.
      fullDestDir = PathUtils.getAbsolutePath(Forge.getWorkingDir(),
        Forge.getIncomingDataDir()) + "/full";
      incrDestDir = PathUtils.getAbsolutePath(Forge.getWorkingDir(),
        Forge.getIncomingDataDir()) + "/incremental";
    
  5. Create the destination directories. The destination directories for the source data files are created if they do not exist. The directory names are in the fullDestDir and incrDestDir variables.
      // create destination directories
      mkDirUtil = new CreateDirUtility(Forge.getAppName(),
        Forge.getEacHost(), Forge.getEacPort(), Forge.isSslEnabled());
      mkDirUtil.init(Forge.getHostId(), fullDestDir, Forge.getWorkingDir());
      mkDirUtil.run();
    
      mkDirUtil.init(Forge.getHostId(), incrDestDir, Forge.getWorkingDir());
      mkDirUtil.run();
    
  6. Copy the source data to the destination directories. Instantiate a CopyUtility object (named crawlDataCopy) and use it to copy the full and incremental source data to the data/incoming directories.
      crawlDataCopy = new CopyUtility(Forge.getAppName(),
        Forge.getEacHost(), Forge.getEacPort(), Forge.isSslEnabled()); 
    
      // copy full crawl data
      crawlDataCopy.init("copy_complete_cas_full_crawl_data",
        CAS.getCasCrawlOutputDestHost(),Forge.getHostId(), fullSrcDir,
        fullDestDir, true);
      crawlDataCopy.run();
    
      // copy incremental crawl data
      crawlDataCopy.init("copy_complete_cas_incremental_crawl_data",
        CAS.getCasCrawlOutputDestHost(),Forge.getHostId(), incrSrcDir,
        incrDestDir, true);
      crawlDataCopy.run();
    
  7. If no incremental files exist, create a dummy file. Forge will fail when running the delta pipeline if there are no incremental files. Therefore, verify whether incremental files exist and, if none exist, create a dummy file named "placeholder.bin.gz". If at least one incremental file exists, skip to Step 11 (the else statement). Note that the comments in the following code were added to explain the steps.
    Note: This step is required to support the behavior of the default pipeline included with the Deployment Template. Specifically, the baseline pipeline always expects to read a set of full crawl files and a set of incremental crawl files and joins these by keeping the most recent copy of each record that's available between the files. Forge fails when no incremental files are available, so this dummy file ensures that the pipeline works when a full crawl has been run, but no incremental crawls have been run. For pipeline implementations that do not require such a dummy file (e.g., pipelines that only process full crawls), this step can be removed.
      // test for existing incremental files, since the dummy file is only
      // needed when there are no real incremental files
      if (! fileUtil.dirContainsFiles(incrDestDir, Forge.getHostId())) {
        // create a variable for the dummy file name and location
        placeholder = incrDestDir + "/placeholder.bin";
        // create Unix touch and gzip commands
        touchCmd = "touch " + placeholder;
        zipCmd = "gzip " + placeholder;
        // for Windows platforms, rewrite the commands using Win commands
        if (System.getProperty("os.name").startsWith("Win")) {
          touchCmd = "%ENDECA_ROOT%\\utilities\\touch.exe " + placeholder;
          zipCmd = "%ENDECA_ROOT%\\utilities\\gzip.exe " + placeholder;
        }
        // use a ShellUtility to touch (i.e. create) the dummy file
        shell = new ShellUtility(Forge.getAppName(), Forge.getEacHost(),
          Forge.getEacPort(), Forge.isSslEnabled());
        shell.init("create_incremental_cas_crawl_placeholder",
          Forge.getHostId(),touchCmd, Forge.getWorkingDir());
        shell.run();
        // use the same ShellUtility to produce a .bin.gz compressed file
        shell.init("zip_incremental_cas_crawl_placeholder",
          Forge.getHostId(),zipCmd, Forge.getWorkingDir());
        shell.run();
      } // end of if clause
    
  8. If the test at Step 7 showed the incremental directory was not empty, rename the incremental files (which have timestamped names) so they are read in reverse chronological order. This means that the name of the latest (most recent) file must begin with 000001, the next one with 000002, and so on.
    Note: As with the previous step, this logic is required to support the behavior of the default pipeline. This step ensures that Forge keeps the most up-to-date copy of any record, ignoring any older copies. Since files are read and processed in alphanumeric order, renaming them ensures that the most recent records are processed first.
      // incremental files do exist, so rename them
      else {
        // get the number of files, to be used to generate the prefix
        incrFiles = fileUtil.getDirContents(incrDestDir, Forge.getHostId());
        fileNum = incrFiles.size();
        // import Java classes we will use for the renaming
        import java.text.NumberFormat;
        import java.text.DecimalFormat;
        import java.util.SortedMap;
        import java.util.TreeMap;
        import java.io.File;
        // instantiate a NumberFormat to format the prefix name
        NumberFormat formatter = new DecimalFormat("000000");
        // instantiate a SortedMap and add the file names,
        // which will be in an ascending key order
        SortedMap sortedFiles = new TreeMap();
        sortedFiles.putAll(incrFiles);
        // loop through the sorted treemap
        for (incrFile : sortedFiles.keySet()) {
          // generate a filename prefix, based on the number of files left
          prefix = formatter.format(fileNum);
          // get the original filename and prepend the generated prefix
          origFileName = PathUtils.getFileNameFromPath(incrFile);
          newFileName = prefix + "_" + origFileName;
          // generate the pathname to which we will rename the file
          absNewFile = PathUtils.getAbsolutePath(Forge.getWorkingDir(),
            Forge.getIncomingDataDir()) + File.separator + "incremental" +
            File.separator + newFileName;
          // use the LocalMoveUtility to rename the file
          renameUtil = new LocalMoveUtility(Forge.getAppName(),
            Forge.getEacHost(), Forge.getEacPort(), Forge.isSslEnabled());
          renameUtil.init(Forge.getHostId(), incrFile, absNewFile,
            Forge.getWorkingDir());
          renameUtil.run();
          // decrease the fileNum variable by one so that the name of the
          // next file will be numerically more recent
          fileNum--;
        } // end of for loop
      } // end of else clause
    
  9. Set the baseline data ready flag. Set the "baseline_data_ready" flag in the EAC, which means that baseline updates can be performed at any time.
      // (re)set flag indicating that the baseline can process incoming data
      LockManager.setFlag("baseline_data_ready");
    
  10. Release the lock. The “complete_cas_crawl_data_lock” flag is removed from the EAC, indicating that the fetch operation was successful. A “finished” message is also logged.
      // release lock on the crawl data directory
      LockManager.releaseLock("complete_cas_crawl_data_lock");
      ...
      log.info("Crawl data fetch script finished.");