Running the sample CAS pipeline using the CAS crawl record output files

The Deployment Template includes a sample CAS crawl pipeline that is located in the [appdir]/config/cas_crawl_pipeline directory.

Make sure that you have updated the AppConfig.xml file to use the sample CAS crawl pipeline located in [appdir]/config/cas_crawl_pipeline. Specifically, you will need to update the devStudioConfigDir property in the ConfigManager section of the document.

Also, make sure you revise both pipeline.epx and partial_pipeline.epx so that the record adapter reads records in the format that the CAS crawl output. The format can be either XML or binary and the URL indicates whether the format is compressed or uncompressed. See the FORMAT and URL attributes in the following XML. For example, the following record adapter reads compressed XML:
<RECORD_ADAPTER COL_DELIMITER="" COMPRESSION_LEVEL="1" DIRECTION="INPUT" FILTER_EMPTY_PROPS="TRUE" FORMAT="XML" FRC_PVAL_IDX="FALSE"
MULTI="TRUE" NAME="LoadFullCrawls" PREFIX="" REC_DELIMITER="" REQUIRE_DATA="FALSE" ROW_DELIMITER="" STATE="FALSE" URL="./full/*.xml.gz">
<COMMENT></COMMENT>
</RECORD_ADAPTER>

The pipeline is configured for delta updates, so that it reads in a baseline crawl data file (or multiple baseline crawl data files, if you create and run multiple crawls in the CAS Server) and one or more incremental crawl data files.

The pipeline looks as follows:

As mentioned above, the LoadFullCrawls record adapter reads in the baseline crawl data while the LoadIncrementalCrawls record adapter reads in all the incremental data. A record assembler performs a First Record join on all the data sets, while the RemoveDeletes record manipulator removes any record whose Endeca.Action property has a value of "DELETE".

To run the sample CAS pipeline using the CAS crawl record output files:

Run the baseline_update script located in [appdir]/control.
The baseline_update script displays the following informational message when the process is complete:
INFO: Baseline update script finished.
After completion, the Dgraph should be running on the host and port specified in the AppConfig.xml configuration file.