Retrieving and updating records in bulk

Bulk retrieval
Bulk updates
Bulk deletes

When retrieving or modifying large numbers of records, the number of method calls can often dominate performance. Berkeley DB offers bulk get, put and delete interfaces which can significantly increase performance for some applications.

Bulk retrieval

To retrieve records in bulk, an application buffer must be specified to the DB->get() or DBC->get() methods. This is done in the C API by setting the data and ulen fields of the data DBT to reference an application buffer, and the flags field of that structure to DB_DBT_USERMEM. In the Berkeley DB C++ and Java APIs, the actions are similar, although there are API-specific methods to set the DBT values. Then, the DB_MULTIPLE or DB_MULTIPLE_KEY flags are specified to the DB->get() or DBC->get() methods, which cause multiple records to be returned in the specified buffer.

The difference between DB_MULTIPLE and DB_MULTIPLE_KEY is as follows: DB_MULTIPLE returns multiple data items for a single key. For example, the DB_MULTIPLE flag would be used to retrieve all of the duplicate data items for a single key in a single call. The DB_MULTIPLE_KEY flag is used to retrieve multiple key/data pairs, where each returned key may or may not have duplicate data items.

Once the DB->get() or DBC->get() method has returned, the application will walk through the buffer handling the returned records. This is implemented for the C and C++ APIs using four macros: DB_MULTIPLE_INIT, DB_MULTIPLE_NEXT, DB_MULTIPLE_KEY_NEXT, and DB_MULTIPLE_RECNO_NEXT. For the Java API, this is implemented as three iterator classes: MultipleDataEntry, MultipleKeyDataEntry, and MultipleRecnoDataEntry.

The DB_MULTIPLE_INIT macro is always called first. It initializes a local application variable and the data DBT for stepping through the set of returned records. Then, the application calls one of the remaining three macros: DB_MULTIPLE_NEXT, DB_MULTIPLE_KEY_NEXT, and DB_MULTIPLE_RECNO_NEXT.

If the DB_MULTIPLE flag was specified to the DB->get() or DBC->get() method, the application will always call the DB_MULTIPLE_NEXT macro. If the DB_MULTIPLE_KEY flag was specified to the DB->get() or DBC->get() method, and the underlying database is a Btree or Hash database, the application will always call the DB_MULTIPLE_KEY_NEXT macro. If the DB_MULTIPLE_KEY flag was specified to the DB->get() or DBC->get() method, and the underlying database is a Queue or Recno database, the application will always call the DB_MULTIPLE_RECNO_NEXT macro. The DB_MULTIPLE_NEXT, DB_MULTIPLE_KEY_NEXT, and DB_MULTIPLE_RECNO_NEXT macros are called repeatedly, until the end of the returned records is reached. The end of the returned records is detected by the application's local pointer variable being set to NULL.

Note that if you want to use a cursor for bulk retrieval of records in a Btree database, you should open the cursor using the DB_CURSOR_BULK flag. This optimizes the cursor for bulk retrieval.

The following is an example of a routine that displays the contents of a Btree database using the bulk return interfaces.

int
rec_display(DB *dbp)
{
    DBC *dbcp;
    DBT key, data;
    size_t retklen, retdlen;
    void *retkey, *retdata;
    int ret, t_ret;
    void *p;

    memset(&key, 0, sizeof(key));
    memset(&data, 0, sizeof(data));

    /* Review the database in 5MB chunks. */
#define    BUFFER_LENGTH    (5 * 1024 * 1024)
    if ((data.data = malloc(BUFFER_LENGTH)) == NULL)
        return (errno);
    data.ulen = BUFFER_LENGTH;
    data.flags = DB_DBT_USERMEM;

    /* Acquire a cursor for the database. */
    if ((ret = dbp->cursor(dbp, NULL, &dbcp, DB_CURSOR_BULK)) 
        != 0) {
            dbp->err(dbp, ret, "DB->cursor");
            free(data.data);
            return (ret);
    }

    for (;;) {
        /*
         * Acquire the next set of key/data pairs.  This code 
         * does not handle single key/data pairs that won't fit 
         * in a BUFFER_LENGTH size buffer, instead returning 
         * DB_BUFFER_SMALL to our caller.
         */
        if ((ret = dbcp->get(dbcp,
            &key, &data, DB_MULTIPLE_KEY | DB_NEXT)) != 0) {
            if (ret != DB_NOTFOUND)
                dbp->err(dbp, ret, "DBcursor->get");
            break;
        }

        for (DB_MULTIPLE_INIT(p, &data);;) {
            DB_MULTIPLE_KEY_NEXT(p,
                &data, retkey, retklen, retdata, retdlen);
            if (p == NULL)
                break;
            printf("key: %.*s, data: %.*s\n",
                (int)retklen, (char *)retkey, (int)retdlen, 
                (char *)retdata);
        }
    }

    if ((t_ret = dbcp->close(dbcp)) != 0) {
        dbp->err(dbp, ret, "DBcursor->close");
        if (ret == 0)
            ret = t_ret;
    }

    free(data.data);

    return (ret);
}

Bulk updates

To put records in bulk with the btree or hash access methods, construct bulk buffers in the key and data DBT using DB_MULTIPLE_WRITE_INIT and DB_MULTIPLE_WRITE_NEXT. To put records in bulk with the recno or queue access methods, construct bulk buffers in the data DBT as before, but construct the key DBT using DB_MULTIPLE_RECNO_WRITE_INIT and DB_MULTIPLE_RECNO_WRITE_NEXT with a data size of zero;. In both cases, set the DB_MULTIPLE flag to DB->put().

Alternatively, for btree and hash access methods, construct a single bulk buffer in the key DBT using DB_MULTIPLE_WRITE_INIT and DB_MULTIPLE_KEY_WRITE_NEXT. For recno and queue access methods, construct a bulk buffer in the key DBT using DB_MULTIPLE_RECNO_WRITE_INIT and DB_MULTIPLE_RECNO_WRITE_NEXT. In both cases, set the DB_MULTIPLE_KEY flag to DB->put().

A successful bulk operation is logically equivalent to a loop through each key/data pair, performing a DB->put() for each one.

Bulk deletes

To delete all records with a specified set of keys with the btree or hash access methods, construct a bulk buffer in the key DBT using DB_MULTIPLE_WRITE_INIT and DB_MULTIPLE_WRITE_NEXT. To delete a set of records with the recno or queue access methods, construct the key DBT using DB_MULTIPLE_RECNO_WRITE_INIT and DB_MULTIPLE_RECNO_WRITE_NEXT with a data size of zero. In both cases, set the DB_MULTIPLE flag to DB->del(). This is equivalent to calling DB->del() for each key in the bulk buffer. In particular, if the database supports duplicates, all records with the matching key are deleted.

Alternatively, to delete a specific set of key/data pairs, which may be items within a set of duplicates, there are also two cases depending on whether the access method uses record numbers for keys. For btree and hash access methods, construct a single bulk buffer in the key DBT using DB_MULTIPLE_WRITE_INIT and DB_MULTIPLE_KEY_WRITE_NEXT. For recno and queue access methods, construct a bulk buffer in the key DBT using DB_MULTIPLE_RECNO_WRITE_INIT and DB_MULTIPLE_RECNO_WRITE_NEXT. In both cases, set the DB_MULTIPLE_KEY flag to DB->del().

A successful bulk operation is logically equivalent to a loop through each key/data pair, performing a DB->del() for each one.