CouchDB and wrangling large amounts of data
CouchDB is a new toy I’ve been playing with recently. It’s a document database that lends itself to semi-structured data. I’ve been using it for storing report data. Report data goes so well with CouchDB because it’s usually structured but the schema is fluid. To be clearer, the structure of reporting data changes frequently, especially during the development of an application. To use an example from my primary field of work, say a client wants to track impressions and clicks by keyword. No problem, the document schema would look something like this:
{
date: 1211612400
keyword: nascar
impressions: 450
clicks: 121
}
Excellent, we have data for one keyword during one day. We can create documents for each keyword during each day and then perform a roll up of data for any time period using CouchDB’s “views”. Views allow you to create an index on any or all of the attributes of a document. The most obvious view for this type of document would be one with the keyword and date as the key. To create a view, CouchDB takes a map function written in javascript. For the view I just described, the map function would look like this:
function (doc) {
if(doc.keyword && doc.date) {
emit([doc.keyword, doc.date], doc);
}
}
This function will return all documents sorted by the keyword and date. Documents are just a series of key/value pairs, so you can easily write a reduce function to get total amounts of impressions and clicks for any date period. Christopher Lenz has written an incredible Python API for interfacing with CouchDB. I’ll include a little code to show how I’d take a keyword and find total clicks for a time period:
server = couchdb.Server('http://myserver.com:5984')
db = server['mydb']
#Assume I’ve created a view entitled “by_keyword”
rKeyword = db.view(’_view/myviews/by_keyword’)
#The library treats all documents as one big array. Great use
#of Pythonic concepts
docs = rKeyword[[startDate, 'nascar']:[endDate, 'nascar']]
clicks = reduce(lambda x, y: x+y.value['clicks'], docs)
print clicks
I’ve done some handwaving around the date stuff. I store dates as seconds since the epoch in CouchDB because they’re easier to sort on. I have a sneaking suspicion that I could actually save them as javascript date objects, but I’ve never tried it.
Alright, so we have a quick way to retrieve and massage data. Now the client starts wanting to include which account the keyword is in. With a relational database, this means a schema change and possible lost data. What a pain. With CouchDB, it means adding a key/value pair to the document. It’ll now look like this:
{
date: 1211612400
keyword: nascar
impressions: 450
clicks: 121
account_id:4
}
But what about the code to roll up performance data? What about previous documents not fitting the new schema? Don’t worry about it. None of those things need to change. We don’t have to change a thing and we continue to get the same data we’ve always had. Now, if we want to start splitting the data by account, we need a new view, but we don’t have to change any of the old stuff. We just create a new view:
function (doc) {
if(doc.account_id && doc.keyword && doc.date) {
emit([doc.account_id, doc.keyword, doc.date], doc);
}
}
et Voila! You can now pull performance by account, keyword and date. I particularly like this arrangement because it means, even if new attributes need to be tracked, it won’t affect current code.
There is one more concept that I like using for data wrangling on large reporting datasets. It’s called currying and it allows me to cut down on the amount of code I’m writing. The problem is that I usually need to operate over a number of different attributes. The example above only sums the value of clicks. I’ll almost always need to get more than just clicks. Currying involves writing a function that returns another function. You’ll see that the reduce statement above uses a lambda function that reduces on clicks. I’m going to write a function that returns a function that will sum any attribute value and use that in the reduce function:
def _add(attribute):
def toReduce(x, y):
return x + y.value[attribute]
return toReduce
totalImpressions = reduce(_add(’impressions’), docs, 0)
totalClicks = reduce(_add(’clicks’), docs, 0)
The code above is simple, but I think it shows how incredibly powerful a few small programming concepts and CouchDB can become. It’s great for large datasets with information that doesn’t have to be relational. I use it specifically for huge amounts of web analytics data. I don’t lose any granularity, it takes almost no time to maintain and it’s incredibly fast. CouchDB just recently added a reduce option to its views. I haven’t had time to play with it, but it looks great. I plan on using it to take even more tedious operations out of my applications.