Batch failure and recovery

The Batch Processor uses parallel processors to progress through cases quickly and efficiently. However, when using multiple parallel processors, it is possible for an error that prevents some, but not all, of the processors from continuing. It is important in this scenario that the Batch Processor handles the error gracefully and that it can be restarted to complete the processing for any cases that were missed.

Handling errors

There are two types of errors that can occur when the batch processor is running; Fatal and Non-Fatal. The Batch Processor manages occurrences of these errors separately.

Fatal errors

Fatal errors are unexpected errors that apply to a processor as a whole; for example, if the database the processor is reading from or writing to becomes unexpectedly unavailable. Fatal errors may affect one or more of the parallel processors, but do not necessarily affect all parallel processors.

When a fatal error is encountered, the error details are logged according to the log configuration, and the affected processor stopped. If using multiple parallel processors, any unaffected processors will continue working to ensure that as many cases as possible are processed. The final summary message provided by the Batch Processor will indicate that an error occurred during processing, and identify the total number of cases that were successfully processed.

Non-Fatal errors

Non-Fatal errors are predictable errors that apply to a single case only; for example, data validation errors (such as trying to read the value 'abc' into a numeric attribute) and errors returned by a specific rule in a rulebase.

When a non-fatal error is encountered, the error details are logged according to the log configuration, the affected case ignored, and the processor continues on to the next case. The final summary message provided by the Batch Processor will identify the total number of cases that were successfully processed and the total number of cases that were ignored due to non-fatal errors.

Recovery after failure

If the Batch Processor has encountered a fatal error, it is likely there will be cases that were not processed. Once the cause of the fatal error has been identified and resolved, the Batch Processor can be run again to reprocess all cases, including those missed previously due to the fatal error.

Best practice - identifying processed and unprocessed cases

To assist in recovery from a failure, a means of easily identifying processed and unprocessed cases should be implemented within the data sources and rulebases to be used. The recommended approach is to use a top-level attribute to specifically record if a case has been processed or not.

When the output of the Batch Processor is database out, you should always treat the database tables that will be updated as the final point of truth as to which cases were processed regardless of log messages that the Batch Processor has produced.

 

Using a top-level attribute, the data source can be examined after a fatal error to easily identify the cases that have been successfully processed and the cases that have not.

Best practice - presenting unprocessed cases with database views

When connecting to a database, it is possible that the batch processor is responsible for reading and updating data for a very large number of cases. To make recovery from failure more efficient, the data source can be designed to ensure only unprocessed cases are presented to the Batch Processor. The recommended approach is to combine a method of clearly identifying cases that need to be processed (such as the recommended approach described above) with database views.

 

Using this approach, the Batch Processor will only read and update data for cases that need to be processed. When restarting the Batch Processor after a fatal error, it will only process those cases that were not processed in the previous run.