SEM Development

October 4, 2008

Playing around with Ruby and Digg

Filed under: Tips & Tricks — Bob @ 7:07 pm

“The Google Cache” blog is on my RSS list and an interesting post about Digg SEO came up recently.

It details a way to make yourself known to other diggers that happen to be powerusers, thus opening the door to get your stories dug. It seemed pretty easy to code up and I’ve been playing with ruby lately so I thought it’d be fun to solve the first half of the tedium; finding the top diggers.

This code is my very quick reply. It uses John Wulff’s digg-ruby library:

Finding this library was the best part of this little exercise. It includes some incredible meta-programming.

You can either look at my small, undocumented Digg class or clone it:

git://github.com/rbriski/top-digger-tool.git

June 19, 2008

CouchDB and wrangling large amounts of data

Filed under: Python, Tips & Tricks — Bob @ 5:39 pm

CouchDB is a new toy I’ve been playing with recently. It’s a document database that lends itself to semi-structured data. I’ve been using it for storing report data. Report data goes so well with CouchDB because it’s usually structured but the schema is fluid. To be clearer, the structure of reporting data changes frequently, especially during the development of an application. To use an example from my primary field of work, say a client wants to track impressions and clicks by keyword. No problem, the document schema would look something like this:

{
  date: 1211612400
  keyword: nascar
  impressions: 450
  clicks: 121
}

Excellent, we have data for one keyword during one day. We can create documents for each keyword during each day and then perform a roll up of data for any time period using CouchDB’s “views”. Views allow you to create an index on any or all of the attributes of a document. The most obvious view for this type of document would be one with the keyword and date as the key. To create a view, CouchDB takes a map function written in javascript. For the view I just described, the map function would look like this:

function (doc) {
  if(doc.keyword && doc.date) {
    emit([doc.keyword, doc.date], doc);
  }
}

This function will return all documents sorted by the keyword and date. Documents are just a series of key/value pairs, so you can easily write a reduce function to get total amounts of impressions and clicks for any date period. Christopher Lenz has written an incredible Python API for interfacing with CouchDB. I’ll include a little code to show how I’d take a keyword and find total clicks for a time period:

server = couchdb.Server('http://myserver.com:5984')
db = server['mydb']
 
#Assume I’ve created a view entitled “by_keyword”
rKeyword = db.view(’_view/myviews/by_keyword’)
 
#The library treats all documents as one big array. Great use
#of Pythonic concepts
docs = rKeyword[[startDate, 'nascar']:[endDate, 'nascar']]
clicks = reduce(lambda x, y: x+y.value['clicks'], docs)
print clicks

I’ve done some handwaving around the date stuff. I store dates as seconds since the epoch in CouchDB because they’re easier to sort on. I have a sneaking suspicion that I could actually save them as javascript date objects, but I’ve never tried it.

Alright, so we have a quick way to retrieve and massage data. Now the client starts wanting to include which account the keyword is in. With a relational database, this means a schema change and possible lost data. What a pain. With CouchDB, it means adding a key/value pair to the document. It’ll now look like this:

{
  date: 1211612400
  keyword: nascar
  impressions: 450
  clicks: 121
  account_id:4
}

But what about the code to roll up performance data? What about previous documents not fitting the new schema? Don’t worry about it. None of those things need to change. We don’t have to change a thing and we continue to get the same data we’ve always had. Now, if we want to start splitting the data by account, we need a new view, but we don’t have to change any of the old stuff. We just create a new view:

function (doc) {
  if(doc.account_id && doc.keyword && doc.date) {
    emit([doc.account_id, doc.keyword, doc.date], doc);
  }
}

et Voila! You can now pull performance by account, keyword and date. I particularly like this arrangement because it means, even if new attributes need to be tracked, it won’t affect current code.

There is one more concept that I like using for data wrangling on large reporting datasets. It’s called currying and it allows me to cut down on the amount of code I’m writing. The problem is that I usually need to operate over a number of different attributes. The example above only sums the value of clicks. I’ll almost always need to get more than just clicks. Currying involves writing a function that returns another function. You’ll see that the reduce statement above uses a lambda function that reduces on clicks. I’m going to write a function that returns a function that will sum any attribute value and use that in the reduce function:


def _add(attribute):
  def toReduce(x, y):
    return x + y.value[attribute]
  return toReduce
 
totalImpressions = reduce(_add(’impressions’), docs, 0)
totalClicks = reduce(_add(’clicks’), docs, 0)

The code above is simple, but I think it shows how incredibly powerful a few small programming concepts and CouchDB can become. It’s great for large datasets with information that doesn’t have to be relational. I use it specifically for huge amounts of web analytics data. I don’t lose any granularity, it takes almost no time to maintain and it’s incredibly fast. CouchDB just recently added a reduce option to its views. I haven’t had time to play with it, but it looks great. I plan on using it to take even more tedious operations out of my applications.

March 11, 2008

Pylons and Django newforms — SelectMultiple problems

Filed under: Python, Tips & Tricks — Bob @ 2:38 pm

Pylons has become my framework of choice for any new projects where I get to choose the technology. I’ve had trouble, however, finding a form library I like. Max Ischenko posted an entry about how to get Pylons to work with Django’s newforms library and I’ve been hooked ever since. It just seems to stay out my way better than any other form framework out there.
I ran into a problem a couple of days ago when using the SelectMultiple widget. The MultipleChoiceField expects a list from the SelectMultiple widget. In the Django framework, this isn’t a problem because it checks for a MultiValueDict:

def value_from_datadict(self, data, files, name):
  if isinstance(data, MultiValueDict):
    return data.getlist(name)
  return data.get(name, None)

Unfortunately for me, and any other WSGI framework users, MultiValueDict is a Django-specific datastructure. I also think that isinstance is not a very Pythonic way to handle this. Duck typing seems the correct way to do this and I’ve used it in my patch. It checks to see if the data object has a getall attribute. This tells me its a Paste MultiValueDict object which is more prevalent in other WSGI frameworks, such as Pylons.

Just take the patch I’ve created to fix the SelectMultiple widget and apply it to django/newforms/widgets.py. Now the widget will return your values in a list, exactly how MultipleChoiceField wants it.

May 4, 2006

Regression tests and web service APIs

Filed under: Tips & Tricks — Bob @ 4:17 pm

I’ve recently realized what a pain it is to write regression tests for modules that access the Google AdWords and Yahoo DTCXML API directly.  I already hate writing tests so this is pain^2.  Here’s my setup:


The green box is Google’s AdWords API.  The blue box is my API object that uses SOAP::Lite to access the API.  The red box is supposed to represent a Criterion object.  I’ve talked about how I create an object representation of the API services before.  Of course there are more objects but they’re not important for this discussion.

I can take two approaches when writing tests for the API object:

  1. Ask for login info and do live tests on the API.
  2. Write canned SOAP envelopes and insert them into my module.

I originally took the first approach which works OK but dies if the service is down, the person enters the wrong login info, an on and on.  Now, I know I could write tests that make sure the service is up, or retry a call if a 500 is returned but these are just tests.  When I have to introduce exception handling into my tests scripts they’ve ceased to be tests.  Also, it uses up Google’s quota units and gets stuck on Yahoo’s rate limits without careful planning.

I got the idea for the second approach by looking over Mike Schilli’s CPAN contributions.  Specifically, I looked at Net::Amazon which is a perl interface for the Amazon shopping API.  He’s written canned files that hold various responses that should be expected from the Amazon service.  He then tests against these files rather than the actual service.

I think I’m going to try the second way out for the bulk of my tests and then throw a few live tests in for a sanity check.  I guess I could also keep local versions of the WSDLs and then do diffs against them to make sure they haven’t changed.  The problem is that my API modules are written to change with the WSDL (or schema for Yahoo! Search Marketing).  Taking that in account, I’m not going to keep local versions of the schema or WSDL … for testing anyway.  It’s a good idea to keep local versions to save loading them over the internet every time an object is instantiated.

I’ll talk about how I test the object layer modules in a later post.

February 19, 2006

“Forking” with POE

Filed under: Tips & Tricks — Bob @ 9:17 am

I’ve found a good way to allay my previous fear of forking. Using the POE library. The website says POE stands for “Perl Object Environment” but that doesn’t really explain much about it. It’s an easy way to timeslice the execution of your program. It’s threading in only one thread. The explanation of the library can get pretty complex so I’ll leave it to the people who wrote the documentation site. There’s a comprehensive cookbook that has examples of many things that are likely similar to some process you’re trying to write.

How does this all relate to SEM? It allows me to run multiple HTTP requests to the same engine at the same time. This works incredibly well for my saveAll and getAll functions in my Net objects. Now I can just spray a bunch of keywords at these functions, they’ll split the keywords into their appropriate ad groups and then update them in parallel instead of in series. What used to take me 12 hours, now only takes me 2.

January 31, 2006

The terror of forking and LWP::Parallel

Filed under: Tips & Tricks — Bob @ 6:37 pm

I’m not sure about anyone else but I hate using the raw fork() command in Perl. It may just be a psychological thing but even in programs I know will run well with good sychronization, I alway harbor a deep fear that the parent will start spawning zombie processes like an undead rabbit. Up until now, I’ve had to deal with that fear in my campaign updating processes. I fork both Google and YSM updates and it saves a ton of time.

Today, however, I was directed toward LWP::Parallel. It allows you to run HTTP calls in parallel. There’s also an LWP::Parallel::UserAgent. This is perfect. I currently inherit the LWP::UserAgent module and overload the request methods for both Google and YSM. Then I just use these modules as I would LWP::UserAgent. Now I’m going to add some logic to take advantage of this parallelization. I’ll keep you apprised of the details.

December 13, 2005

Making web service APIs behave the same

Filed under: Tips & Tricks — Bob @ 5:18 pm

YSM and Google’s account management APIs look incredibly different but do, essentially, the same thing. That is, allow you to manage your SEM campaigns through a web service. Ask Jeeves has a sponsored listings product with an API for larger advertisers and MSN will soon be releasing it sponsored search product. So, how are you going to keep a similar programming interface across all of these products? I’ll tell you how I’m doing it.

Essentially, these web services boil down to a CRUD (Create, Retrieve, Update, Destroy) service for your search advertising campaigns. So, why not create a class that does all of these things for each part of a campaign?

Firstly, I want to express the way of handling CRUD that is the most intuitive to me. I use four methods in all of my classes. There are:

new() or new(ID):
Instantiates an empty object or tries to instantiate a populated object if an ID is supplied. Dies if the ID does not exist.

exists(ID):
Instantiates an object with the ID supplied or returns false.

remove():
Deletes the object.

save():
Creates the object if none exists yet or updates it if it already exists.

For Google, I’ve written 5 classes:

  • Account
  • Campaign
  • AdGroup
  • Creative
  • Keyword

For YSM, there are only 3 classes:

  • Account
  • Category
  • Listing

This allows for code that looks kind of like this (in Perl, by the way):

#Get the ad group
my @ags = Google::AdGroup->getAll;
my $ag = pop @ags;


#Make the keyword
my $kw = new Google::Keyword;
$kw->maxCpc(40000); #microns, remember
$kw->text('somethin');
$kw->destinationUrl('http://www.example.com');


#Save the keyword in the ad group
$ag->addKeyword($kw);

More intuitive, don’t you think?