This topic describes how to set up a baseline crawl script that manages record output files.
Because the script is basically the same for all file system and CMS crawl configurations, the EndecaCasCrawlConfig.xml sample is used to illustrate the script (for a crawl named Endeca).
<script id="Endeca_baselineCasCrawl">
<![CDATA[
crawlName = "Endeca";
- Check if the crawl is set to write output to a record output file, and throw an exception if the crawl is set to output to a Record Store instance.
if (!CAS.isCrawlFileOutput(crawlName)) {
throw new UnsupportedOperationException("The crawl " + crawlName +
" does not have a File System output type. The only supported
output type for this script is File System.");
}
log.info("Starting full CAS crawl '" + crawlName + "'.");
- Obtain a lock on the crawl. The baseline crawl attempts to set a flag in the EAC to serve as a lock or mutex. The name of the flag is the string "crawl_lock_" plus the name of the crawl (such as "crawl_lock_Endeca" for this example). If the flag is already set, this step fails, ensuring that a crawl (either baseline or incremental) cannot be started more than once simultaneously, as this would interfere with data processing. The flag is removed in the case of an error or when the script completes successfully.
// obtain lock
if (LockManager.acquireLock("crawl_lock_" + crawlName)) {
- Clean the output directories. Baseline and incremental crawl output files from the previous crawls are removed from the crawl's configured output directory.
CAS.cleanOutputDir(crawlName);
- Run the baseline crawl. The baseline crawl is run with the crawl name as the ID.
CAS.runBaselineCasCrawl(crawlName);
- Rename the output file. The baseline crawl output file is renamed by prefixing the crawl name.
CAS.renameBaselineCrawlOutput(crawlName);
- Obtain a second lock on the complete crawl data directory. The script attempts to set a flag to serve as a lock on the data/complete_cas_crawl_output directory. If the flag is already set, this step idles for up to ten minutes, waiting for the flag to become available. If the flag remains set for 10 minutes, this action fails, meaning that the renamed output file is not copied. This step ensures that access to the directory is synchronized, so that a downstream process like the baseline update does not retrieve a half-delivered crawl file.
// try to acquire a lock on the complete crawl data directory
// for up to 10 minutes
if (LockManager.acquireLockBlocking("complete_cas_crawl_data_lock",
600))
- Get the path of the output destination directory. The path of the destination directory (to which the baseline crawl output file will be copied) is obtained. The directory name is specified by the casCrawlFullOutputDestDir property in the fetchCasCrawlDataConfig.xml file, which is data/complete_cas_crawl_output/full by default.
destDir = PathUtils.getAbsolutePath(CAS.getWorkingDir(),
CAS.getCasCrawlFullOutputDestDir());
- Create the destination directory. The destination directory for the crawl output file is created if it does not exist. The name is in the destDir variable.
// create the target dir, if it doesn't already exist
mkDirUtil = new CreateDirUtility(CAS.getAppName(),
CAS.getEacHost(), CAS.getEacPort(), CAS.isSslEnabled());
mkDirUtil.init(Forge.getHostId(), destDir, CAS.getWorkingDir());
mkDirUtil.run();
- Delete existing baselines. To ensure that no previous baseline crawl files are left, all baseline output files (if they exist) must be removed.
// clear the destination dir of full crawl files, in case
// we are not overwriting the same file such as when the
// crawl output format has changed.
CAS.clearFullCrawlOutputFromDestinationDir(crawlName);
- Delete existing incrementals. Because this is a baseline crawl, existing incremental output files must be removed.
// remove previously collected incremental crawl files,
// which are expected to be incorporated in this full crawl
CAS.clearIncrementalCrawlOutputFromDestinationDir(crawlName);
- Copy the output file to the destination directory. The renamed baseline crawl output file is copied from the original output directory to the destination directory ( data/complete_cas_crawl_output/full by default ).
// deliver crawl output to destination directory
CAS.copyBaselineCrawlOutputToDestinationDir(crawlName);
- Release the second lock. The "complete_cas_crawl_data_lock" flag is removed from the EAC, indicating that the copy operation was successful.
// release lock on the crawl data directory
LockManager.releaseLock("complete_cas_crawl_data_lock");
- Release the first lock. The "crawl_lock_Endeca" flag is removed from the EAC (indicating that the crawl operation was successful) and a "finished" message is logged.
LockManager.releaseLock("crawl_lock_" + crawlName);
...
log.info("Finished full CAS crawl '" + crawlName + "'.");