The Deployment Template includes a sample CAS crawl pipeline that is located in the [appdir]/config/cas_crawl_pipeline directory.
Make sure that you have updated the AppConfig.xml file to use the sample CAS crawl pipeline located in [appdir]/config/cas_crawl_pipeline. Specifically, you will need to update the devStudioConfigDir property in the ConfigManager section of the document.
<RECORD_ADAPTER COL_DELIMITER="" COMPRESSION_LEVEL="1" DIRECTION="INPUT" FILTER_EMPTY_PROPS="TRUE" FORMAT="XML" FRC_PVAL_IDX="FALSE" MULTI="TRUE" NAME="LoadFullCrawls" PREFIX="" REC_DELIMITER="" REQUIRE_DATA="FALSE" ROW_DELIMITER="" STATE="FALSE" URL="./full/*.xml.gz"> <COMMENT></COMMENT> </RECORD_ADAPTER>
The pipeline is configured for delta updates, so that it reads in a baseline crawl data file (or multiple baseline crawl data files, if you create and run multiple crawls in the CAS Server) and one or more incremental crawl data files.
The pipeline looks as follows:
As mentioned above, the LoadFullCrawls record adapter reads in the baseline crawl data while the LoadIncrementalCrawls record adapter reads in all the incremental data. A record assembler performs a First Record join on all the data sets, while the RemoveDeletes record manipulator removes any record whose Endeca.Action property has a value of "DELETE".
To run the sample CAS pipeline using the CAS crawl record output files:
INFO: Baseline update script finished.After completion, the Dgraph should be running on the host and port specified in the AppConfig.xml configuration file.