SEM Development

June 19, 2008

CouchDB and wrangling large amounts of data

Filed under: Python, Tips & Tricks — Bob @ 5:39 pm

CouchDB is a new toy I’ve been playing with recently. It’s a document database that lends itself to semi-structured data. I’ve been using it for storing report data. Report data goes so well with CouchDB because it’s usually structured but the schema is fluid. To be clearer, the structure of reporting data changes frequently, especially during the development of an application. To use an example from my primary field of work, say a client wants to track impressions and clicks by keyword. No problem, the document schema would look something like this:

{
  date: 1211612400
  keyword: nascar
  impressions: 450
  clicks: 121
}

Excellent, we have data for one keyword during one day. We can create documents for each keyword during each day and then perform a roll up of data for any time period using CouchDB’s “views”. Views allow you to create an index on any or all of the attributes of a document. The most obvious view for this type of document would be one with the keyword and date as the key. To create a view, CouchDB takes a map function written in javascript. For the view I just described, the map function would look like this:

function (doc) {
  if(doc.keyword && doc.date) {
    emit([doc.keyword, doc.date], doc);
  }
}

This function will return all documents sorted by the keyword and date. Documents are just a series of key/value pairs, so you can easily write a reduce function to get total amounts of impressions and clicks for any date period. Christopher Lenz has written an incredible Python API for interfacing with CouchDB. I’ll include a little code to show how I’d take a keyword and find total clicks for a time period:

server = couchdb.Server('http://myserver.com:5984')
db = server['mydb']
 
#Assume I’ve created a view entitled “by_keyword”
rKeyword = db.view(’_view/myviews/by_keyword’)
 
#The library treats all documents as one big array. Great use
#of Pythonic concepts
docs = rKeyword[[startDate, 'nascar']:[endDate, 'nascar']]
clicks = reduce(lambda x, y: x+y.value['clicks'], docs)
print clicks

I’ve done some handwaving around the date stuff. I store dates as seconds since the epoch in CouchDB because they’re easier to sort on. I have a sneaking suspicion that I could actually save them as javascript date objects, but I’ve never tried it.

Alright, so we have a quick way to retrieve and massage data. Now the client starts wanting to include which account the keyword is in. With a relational database, this means a schema change and possible lost data. What a pain. With CouchDB, it means adding a key/value pair to the document. It’ll now look like this:

{
  date: 1211612400
  keyword: nascar
  impressions: 450
  clicks: 121
  account_id:4
}

But what about the code to roll up performance data? What about previous documents not fitting the new schema? Don’t worry about it. None of those things need to change. We don’t have to change a thing and we continue to get the same data we’ve always had. Now, if we want to start splitting the data by account, we need a new view, but we don’t have to change any of the old stuff. We just create a new view:

function (doc) {
  if(doc.account_id && doc.keyword && doc.date) {
    emit([doc.account_id, doc.keyword, doc.date], doc);
  }
}

et Voila! You can now pull performance by account, keyword and date. I particularly like this arrangement because it means, even if new attributes need to be tracked, it won’t affect current code.

There is one more concept that I like using for data wrangling on large reporting datasets. It’s called currying and it allows me to cut down on the amount of code I’m writing. The problem is that I usually need to operate over a number of different attributes. The example above only sums the value of clicks. I’ll almost always need to get more than just clicks. Currying involves writing a function that returns another function. You’ll see that the reduce statement above uses a lambda function that reduces on clicks. I’m going to write a function that returns a function that will sum any attribute value and use that in the reduce function:


def _add(attribute):
  def toReduce(x, y):
    return x + y.value[attribute]
  return toReduce
 
totalImpressions = reduce(_add(’impressions’), docs, 0)
totalClicks = reduce(_add(’clicks’), docs, 0)

The code above is simple, but I think it shows how incredibly powerful a few small programming concepts and CouchDB can become. It’s great for large datasets with information that doesn’t have to be relational. I use it specifically for huge amounts of web analytics data. I don’t lose any granularity, it takes almost no time to maintain and it’s incredibly fast. CouchDB just recently added a reduce option to its views. I haven’t had time to play with it, but it looks great. I plan on using it to take even more tedious operations out of my applications.

March 11, 2008

Pylons and Django newforms — SelectMultiple problems

Filed under: Python, Tips & Tricks — Bob @ 2:38 pm

Pylons has become my framework of choice for any new projects where I get to choose the technology. I’ve had trouble, however, finding a form library I like. Max Ischenko posted an entry about how to get Pylons to work with Django’s newforms library and I’ve been hooked ever since. It just seems to stay out my way better than any other form framework out there.
I ran into a problem a couple of days ago when using the SelectMultiple widget. The MultipleChoiceField expects a list from the SelectMultiple widget. In the Django framework, this isn’t a problem because it checks for a MultiValueDict:

def value_from_datadict(self, data, files, name):
  if isinstance(data, MultiValueDict):
    return data.getlist(name)
  return data.get(name, None)

Unfortunately for me, and any other WSGI framework users, MultiValueDict is a Django-specific datastructure. I also think that isinstance is not a very Pythonic way to handle this. Duck typing seems the correct way to do this and I’ve used it in my patch. It checks to see if the data object has a getall attribute. This tells me its a Paste MultiValueDict object which is more prevalent in other WSGI frameworks, such as Pylons.

Just take the patch I’ve created to fix the SelectMultiple widget and apply it to django/newforms/widgets.py. Now the widget will return your values in a list, exactly how MultipleChoiceField wants it.

December 14, 2007

AdWords API, Reports, and Python (The Decent, the Ugly and the Good, respectively)

Filed under: Google AdWords, Python — Bob @ 8:28 am

I’ve always been annoyed with the way the ReportService in the AdWords API works with most interpreted languages. Yes, you could say it’s a problem with SOAP (and it is), but I don’t buy that. Google engineers use Python enough that you’d think they would be more sensitive to this problem. The sample code given for reports is … disgusting. If you decide it’s too terrible to look at directly, let me fill you in. They build the XML manually in a string and then do something akin to an sprintf to fill in values. I know it’s just an example, but come on, that’s terrible.

I decided I couldn’t live like this and went about finding a better way. A few quick explanations:

  • There may be a way to do this better in ZSI. I haven’t tried, I’m using SOAPpy.
  • I’ll be using my AdWords client, but you can use any one you want, it’s just the conversion of the report data structure into a SOAPpy type that matters.

Alright, let’s get started, first I’ll set up the client:

import adwords.client as client
 
aw = client.AdWordsClient(
  email='...',
  password='...',
  client_email='...',
  developer_token='...',
  application_token='...',
  user_agent='...'
  )

Now to build the report data structure. Of course, this is just an example so any of the data in this report could be different:

reportJob = {'selectedReportType': 'Account'}
reportJob['aggregationTypes'] = ['Daily']
reportJob['startDay'] = ‘2007-10-31′
reportJob['endDay'] = ‘2007-11-05′
reportJob['selectedColumns'] = [
  'CustomerName',
  'AdWordsType',
  'CPC',
  'CPM',
  'Clicks'
]

I’ll now build a SOAPpy representation of the data. With most API calls you would be able to just plop that data structure into the call and SOAPpy would take care of it. With reports, however, Google makes a small statement at the top of the DefinedReportJob page:

DefinedReportJob   - V11 (AdWords API Developer's Guide)

To do this, you need to create a SOAPpy type and the set some attributes on that type. Here’s how I do it:

reportJob = SOAPpy.Types.structType(reportJob)
reportJob._setAttr('xmlns:impl', 'https://adwords.google.com/api/adwords/v11')
reportJob._setAttr('xsi:type', 'impl:DefinedReportJob')

As you can see, SOAPpy offers up some types. The structType is the best for dict objects, but if you run into other typing problems it supplies array types, string types, numerical types, etc. Now the only thing left to do is schedule the job:

jobId = int(aw.scheduleReportJob(reportJob))

It’s actually better practice to first validate your job with the new validateReportJob function in v11 of the API, but you get the picture.

There’s actually one additional problem I ran into when checking the status of a job. The Java service expects a long variable but if I cast it to long in Python, the API returns an error because it’s passed as a BigInt. I have no idea why it does this, but to get around it, I used some of SOAPpy’s types again:

status = aw.getReportJobStatus(SOAPpy.Types.longType(jobId))

Make sure your jobId variable is an int or SOAPpy will throw an exception.

So, I was able to get reports up and running and Python and I didn’t have to resort to archaic search and replace functionality in my code. I’ll probably be adding this to my AdWords client soon so I don’t have to think about it every time I create a report.

November 15, 2007

Extending the Python AdWords API client to understand “unbounded”

Filed under: Google AdWords, Programming, Python, pyadwords-client — Bob @ 4:48 pm

I’ve been adding some functionality to the sample AdWords API client. After using it for a couple of minutes I found that it does not understand when a list should be returned. In other words, if I call an API method that should return a list, getAdGroupList for example, and only one ad group is returned, it is returned as a single instance, not a list of length one. So now instead of:

for adgroup in service.getAdGroupList(listofAdGroupIds):
  print adgroup.name

I have to do some duck typing on each return value:

adgroups = service.getAdGroupList(listofAdGroupIds)
if not hasattr(adgroups, 'sort'):
  adgroups = [adgroups]
for adgroup in adgroups:
  print adgroup.name

It’s more than doubled the lines of code needed just to call a simple API method. Now, I could just overload the service to turn all returned values into a list. The problem, however, is that some methods aren’t supposed to return a list, updateAdGroup, for example. To make the client understand when a list is needed I have to deconstruct the WSDL. I’ve already been loading the WSDLs to extract API method names so I just need to find where the return value was defined. For the AdWords API, it’s listed pretty far down. Here’s an look at the WSDL where it defines the getAdgroupList response element:

<element name="getAdGroupListResponse">
  <complextype>
    <sequence>
      <element name="getAdGroupListReturn" maxOccurs="unbounded" type="impl:AdGroup"/>
    </sequence>
  </complextype>
</element>

In the getAdgroupListReturn element you’ll see the maxOccurs value is unbounded. This means the return value should be a list. I’m not sure why, but the SOAPpy module doesn’t seem to understand this. To fix the behavior, I first had to find out where SOAPpy was putting the *methodName*Return element in the data structure. This was pretty much trial and error with a lot of dir() commands until I found the culprit. I’m not going to post the convoluted code here but, if you’re interested, you can look through it yourself. The method name is getPluralMethods.

Ok, so I’ve gotten a list of method names that need to return a list. What now? Well, I created a method named expectsList that takes another method as an argument:

def expectsList(self, fn):
  ""Decorator that guarantees that the
  return value of a function is a list
 
  Args:
    fn: function
  Returns:
    function
  """
  def returnList(*args, **kwargs):
    out = fn(*args, **kwargs)
 
    #Quack, quack: duck typing
    if not hasattr(out, 'reverse'):
      if not hasattr(out, 'id'):
        #Empty return? Return an empty list
        return []
      #Single return element? Return it in a list
      return [out]
    #Otherwise, it must already be a list
    return out
  return returnList

In the comments, I call this a decorator although it doesn’t technically follow the decorator syntax in the next bit of code where I wrap it around plural API methods:

plurals = self.getPluralMethods(wsdl)
for meth in wsdl.methods.keys():
  methFn = getattr(service, meth)
  if meth in plurals:
    methFn = self.expectsList(methFn)
  setattr(self, meth, methFn)

So, as you can see I’m wrapping the methods that expect a list in the expectsList function. This will make sure that all data returned from these methods is in the correct format and it’s been working so far.

November 12, 2007

Extending the python AdWords client sample code

Filed under: Google AdWords, Programming, Python, pyadwords-client — Bob @ 5:12 pm

I’ve recently started using Python and I’m really starting to love it. Now, with my reintroduction to SEM, I’ve had to figure out whether I should port my old Perl clients over to Python or just see if there’s already something written in Python. For AdWords, there’s a whole site of code samples in Python. One of these samples is actually a small Python client for AdWords. I loaded it up and started playing around with it. It’s a good starting point but, as usual, I want more.

First, and this is pretty much a port of some Perl code I had, I wanted to be able to access all the API methods from one object. I mean ALL of them. I don’t want to load each service separately. I want to be able to access the getAllAdWordsCampaigns method from the same object I access getAllAdGroups. I know, I know, “What about method name collisions?” or “That’s not a strict interpretation of OO design.” is what I’m hearing. And both are valid concerns, however here’s what I think:

Method name collisions: I haven’t seen any evidence that Google will start overlapping method names in different services. If they had that intention they would have used getAll as a method name in different services, not getAllAdWordsCampaigns, getAllAdGroups, getAllCriteria, and so on and so forth. If, for some reason, they decide to switch things up, I’ll have to rethink, but I’m willing to bet they won’t.

OO Design: This is easy. I don’t care. I don’t like SOAP. I see it as a play by Sun and MS to create such a complicated protocol that people are forced to use their language to implement it completely and correctly. I’ll stick with the view that the API is one big object when I’m replicating between my local db and the remote API. I’ll employ OO techniques when I’m using my local DB as a model layer.

So, I’ve created a small python AdWords API client in Google Code. I’ll be writing about it now and again. Here are some of the first changes I’ve made:

All API methods are loaded as actual methods in the API objects. By that I mean you can just do:

awclient = AdWordsClient(**loginParams)
campaigns = awclient.getAllAdWordsCampaigns(1)

Yup, you don’t need to get any services or call any wrapper “call” methods. Just call the method directly from the client object. I was able to do that by loading all the WSDLs and extracting the methods names. I use the setattr command to add the API call as an actual method. Also, the WSDLs are cached so you don’t have to keep grabbing them remotely.

This project is in its infancy and really just tailored to what I want right now. Of course, I’m always open to suggestions. Again, the code is at http://pyadwords-client.googlecode.com. It’s under the “Source” tab.