To move the crawl output files to the appropriate incoming directory, use one of the scripts located in the [appdir]/control/cas directory.
To load crawl record output files for use in the sample CAS pipeline:
MyCrawl_2008.07.18.02.46.15_CrawlerOutput-INCR-sgmt000.bin.gzwill be renamed to this when copied to the incoming/incremental directory:
000001_MyCrawl_2008.07.18.02.46.15_CrawlerOutput-INCR-sgmt000.bin.gz
The reason is that the delta update pipeline must keep the most up-to-date copy of any incremental record. To make sure this happens, the incremental crawl files must be read in reverse chronological order (i.e., the file order must be reversed) so that the first copy of each record read (which is the only record the pipeline keeps) is the most recent copy.
This reordering is not required for a partial update pipeline, which reads the incremental files in chronological order and creates updates that will be progressed by the Dgraph in the same order (effectively applying the most recent updates). Therefore, incremental files being copied to the data/partials/incoming directory are not renamed.