Thursday, March 14, 2013

Couchbase Map/Reduce/Rereduce

MapReduce (http://en.wikipedia.org/wiki/MapReduce) has recently become one of my favorite things to talk about in computing. I am currently in charge of building a system that uses Couchbase to store large amounts of denormalized data. This data needs to be rebuilt frequently depending on many factors. In order to help others I thought I'd post a small code chunk here with a couple brief explanations. Enjoy.

// Map

function(doc, meta) {
  if (doc.type == "invoice") {
    for (var i = 0; i < doc.items.length; i++) {
      var item = doc.items[i];
      emit(
        [ doc.year, item.id, item.unit ],
        { cost: item.cost, quantity: item.quantity}
      );            
    }
  }
}

This map produces a list of items with their cost and quantity. Using the "group" and "group_level" parameters will allow me to group by year, id, and unit (the items that make up the compound key) should I need to do so.

The next step is to aggregate totals for cost and quantity. It is important in this step that I group by unit because it's possible I may have the same item come up more than once but have a different unit. In that case I'd need to convert to some base unit before aggregating, but I'll save that for later for the sake of keeping this simple.

// Reduce

function(key, values, rereduce) {
  var result = {
    TotalCost: 0,
    TotalQuantity: 0,
    ItemCount: 0
  };

  for(var i = 0; i < values.length; i++) {
    if (rereduce) {
      result.TotalCost += values[i].TotalCost;
      result.TotalQuantity += values[i].TotalQuantity;
      result.ItemCount += values[i].ItemCount;
    } else {  
      result.TotalCost = values[i].cost;
      result.TotalQuantity = values[i].quantity;
      result.ItemCount = 1;
    }
  }
  return(result);
}

In this example, I am taking advantage of the rereduce parameter that is managed by Couchbase itself. This is important and it confused me quite a bit for a little while. Couchbase Server uses internal logic to determine if rereduce is true or false, this is NOT something you provide but you can control it depending on how you setup your reduce function.

For more info please see the following link: http://www.couchbase.com/docs/couchbase-devguide-2.0/understanding-custom-reduce.html.

Think of Reduce/Rereduce as two passes. One through the values coming from your map and another through the results of the reduce logic itself. The first pass takes the values from your map and converts them to another values collection that is recursively passed back into your reduce function this time with the "rereduce" boolean set to true. Executing the rereduce part of the above if statement will aggregate your totals and return you a collection of rows based on your grouping.

I would also suggest looking up the Couchbase documentation for how to convert sql to Map/Reduce. That proved to be very helpful in my case.



No comments: