Loading crawl record output files for use in the sample CAS pipeline

To move the crawl output files to the appropriate incoming directory, use one of the scripts located in the [appdir]/control/cas directory.

To load crawl record output files for use in the sample CAS pipeline:

  1. To load baseline crawl data to the incoming/full directory and also copy any incremental data to the incoming/incremental directory, run the load_full_cas_crawl_data script. The record adapters in the pipeline look in these directories for the data files. Use this script if you are running baseline updates or delta updates.
  2. To load incremental crawl data to the data/partials/incoming directory, run the load_incremental_cas_crawl_data script. This script is used for partial updates.
Note that by default, the load_full_cas_crawl_data script renames the incremental data files when they are copied to the incoming/incremental directory. For example, this file in the data/complete_cas_crawl_output/incremental directory:
MyCrawl_2008.07.18.02.46.15_CrawlerOutput-INCR-sgmt000.bin.gz
will be renamed to this when copied to the incoming/incremental directory:
000001_MyCrawl_2008.07.18.02.46.15_CrawlerOutput-INCR-sgmt000.bin.gz

The reason is that the delta update pipeline must keep the most up-to-date copy of any incremental record. To make sure this happens, the incremental crawl files must be read in reverse chronological order (i.e., the file order must be reversed) so that the first copy of each record read (which is the only record the pipeline keeps) is the most recent copy.

This reordering is not required for a partial update pipeline, which reads the incremental files in chronological order and creates updates that will be progressed by the Dgraph in the same order (effectively applying the most recent updates). Therefore, incremental files being copied to the data/partials/incoming directory are not renamed.