SEM Development

December 11, 2008

Top Keywords — Yahoo! Pipes

Filed under: SEO — Bob @ 3:58 pm

In my last post, I described a way to extract data from Google Analytics into an XML document for consumption by Yahoo! Pipes. Now I’d like to take this to it’s logical conclusion and create an RSS feed from this XML document.

1) XML Attachment from Google Groups
This is the module that takes an XML attachment from Google Groups. I created it in the previous post. This is another thing I love about Pipes. You can create a custom pipe, wrap it up, and use it in other pipes. It’s a great way to practice the art of DRY (Don’t Repeat Yourself). Every time you see you’re doing something more than once, just create a separate pipe for it for the process and use it over and over again. But, I digress. The output of the “XML Attachment” process is a data feed containing whatever information is in the last XML document in your Google Groups thread.

Yahoo! Pipe -- XML Document

2) Fetch Data Loop
I’m going to take this attachment and go straight to the list of keywords. It’s down a couple of levels in the XML document as you can see when I put “Report.Table.Row” in the “Path to item list.” Here’s a little better picture of the data I’ll be getting:

Example XML

This process returns a list of the rows in the XML attachment.

Fetch Data Loop -- Y! Pipes

3) Sub-element

There’s a lot of extraneous data that comes along with the keyword. I’m going to ignore it and just grab the keyword. It happens to be tagged as “PrimaryKey” in the feed:

XML Output -- Y! Pipes

After this process runs, I’ll have a list of only the keywords.

Sub element -- Y! Pipes

4) String Builder Loop
The sub-element process outputs the keyword with the key “item.content.” This isn’t correctly formatted for RSS readers so I’m going to change item.content to item.title. RSS readers should be able to pick up the feed now.

String Builder -- Y! Pipes

There you go, an RSS feed of the top keywords that result in traffic to your site.

December 8, 2008

Google Analytics API - Yahoo! Pipes Retread

Filed under: SEO — Bob @ 9:52 pm

In preparation for a presentation on SEO Tools at SES Chicago this year, I’ve been rounding up some sites I find most useful for SEO. Google Analytics tops the list. Since the meat of my speech will be explaining how to use Yahoo! Pipes to glue SEO information together, I need my tools to have APIs. Google Analytics doesn’t yet.

After a little bit of searching, I found someone else’s novel approach to putting an API on Analytics. It involves scheduling an XML report to be sent to a Google Groups page and then accessing that report through Yahoo! Pipes. I don’t know if Google saw this guy’s article and intentionally changed how Groups handles attachments or, more likely, it was just changed in a recent release for other reasons, but part of his recipe no longer works. Scheduling the report and setting up the group are still prerequisites and I’ll let him explain:

Setting up the Google Group

Since Google Analytics doesn’t provide an API, or allow you to link directly to any exported reports, we’ll use a Google Group to host the files which we’ll schedule Google Analytics to email to us. When you setup your Google Group, choose the Announcement-only option. Once created, under the Group settings menu item, select Access and make sure that Anybody can view group content, Do not list this group and People have to be invited are all selected. This is so that no one else can post to the group, which would cause issues when trying to retrieve the Analytics message. Keeping the group unlisted makes it less likely for someone to stumble across your Analytics reports when searching Google Groups. Although it would be preferential to make the group private, this would prevent public access to the feeds for the group, which we’ll need later.

While we could email our reports directly to the Google Groups email address, each message would then contain an “opt-out” link because it’s not the email address we’ve got registered with Google Analytics. Given that our messages will be publicly available, we’ll be using Gmail to forward the messages from the same Gmail address we use for Google Accounts so that if anyone manages to find the Google Group, they can’t stop our scheduled report. Simply create a new filter, looking for any email with Analytics in the subject that has attachments and have Gmail forward the email to your Google Group. (You can choose to “skip the inbox” so you don’t have automated reports cluttering up your inbox too.)

Setting up Google Analytics

In Google Analytics, under the Content section, view the Top Content report and change Show rows from 10 to 50. (You can’t configure how many results to include in your report any other way; it just remembers the last setting you selected.) Now click the Email link button near the top of the page, beneath the page title. Select the Schedule tab, change the report format to XML, set the date range/schedule to Monthly (unless you have a really active blog, then you might want to keep it on Weekly) and click the Schedule button at the bottom. Just to test everything, select the Send Now tab, choose XML as the format and click the Send button.

If everything worked correctly, after a few seconds your Google Group should have a Top Content XML report in it! :o)

I’ve created a Yahoo! Pipe that takes the base address of your group, finds the latest attachment, and returns the XML. If you want, you could actually just stop reading here, clone my pipe and start using it immediately. If you’d like a little more explanation, however, I’ll oblige.

You can check out the full pipe but I’ll go over each part individually. Quick nomenclature note; I’ll be calling each of the little boxes a “process.” Unix habits die hard.

1) Group Page (user input):
This allows any user of the pipe to input a custom page. This process just passes that user input into the pipe.

Group Page (Y! Pipe)

2) Feed Auto-Discovery:
If the page includes a data feed, either RSS or Atom, this process will return it along with it’s address.

Feed Auto-Discovery (Y! Pipes)

3) Filter:
Pretty self explanatory, the process filters any data coming into it. In this case, it returns only the RSS feed. I doubt it really matters, but I had to pick one.

Filter (Y! Pipes)

4) Fetch Feed Loop:
The data after the filter only includes one item, but I still have to loop over it to get to the internal data. In this case, I want to get the RSS feed address and return the actual feed.

Fetch Feed Loop (Y! Pipes)

5) Truncate:
Takes a list of items and returns how ever many you want. In this case, I just want to the first, and latest, item in the feed. This translates into the latest post in the group

Truncate (Y! Pipes)

6) Links Loop:
“Links” is a process I created. It returns all href links on an HTML page. Specifically, it’s returning all links in the first post of your group.

Links Loops (Y! Pipes)

7) Filter:

This second filter will take the list of links and only return those containing the characters “/attach/” and “.xml”. That should give us the link to the XML attachment in the post.

Filter (Y! Pipes)

8) Truncate:

The truncate here is purely a precaution. It’s unlikely the filter will return any more than one link, but if you’ve accidentally sent two reports, this will grab the first one only.

9) Links Loop:
Oh, you thought that opening the XML attachment’s link would give you an XML attachment? Oh dear heart, I’ll forgive you for your naivete. Instead, Google sends you to another page with the “real” link. This process grabs that link for you.

10) Regex:

Regex is short for “regular expression.” They’re a very powerful way to find and replace text. So powerful, in fact, that a full explanation is beyond this post. Entire books have been written on this topic, the best by Jeffrey Friedl. This regex is very simple, it just deletes any part of the link that looks like this “amp;”. The reason is that Google returns the attachment link as encoded HTML. In HTML, “&” are encoded into “&”. I need to switch it back to create the correct attachment URL.

Regex (Y! Pipes)

Et Voila! A fully formed URL for retrieving your XML attachment from Google Groups. Once you’ve set up Analytics to email the report to your Google Groups page, just plop this pipe into the beginning of your own pipe and you’ll be able to access the correct XML.

October 4, 2008

Playing around with Ruby and Digg

Filed under: Tips & Tricks — Bob @ 7:07 pm

“The Google Cache” blog is on my RSS list and an interesting post about Digg SEO came up recently.

It details a way to make yourself known to other diggers that happen to be powerusers, thus opening the door to get your stories dug. It seemed pretty easy to code up and I’ve been playing with ruby lately so I thought it’d be fun to solve the first half of the tedium; finding the top diggers.

This code is my very quick reply. It uses John Wulff’s digg-ruby library:

Finding this library was the best part of this little exercise. It includes some incredible meta-programming.

You can either look at my small, undocumented Digg class or clone it:

git://github.com/rbriski/top-digger-tool.git

June 19, 2008

CouchDB and wrangling large amounts of data

Filed under: Python, Tips & Tricks — Bob @ 5:39 pm

CouchDB is a new toy I’ve been playing with recently. It’s a document database that lends itself to semi-structured data. I’ve been using it for storing report data. Report data goes so well with CouchDB because it’s usually structured but the schema is fluid. To be clearer, the structure of reporting data changes frequently, especially during the development of an application. To use an example from my primary field of work, say a client wants to track impressions and clicks by keyword. No problem, the document schema would look something like this:

{
  date: 1211612400
  keyword: nascar
  impressions: 450
  clicks: 121
}

Excellent, we have data for one keyword during one day. We can create documents for each keyword during each day and then perform a roll up of data for any time period using CouchDB’s “views”. Views allow you to create an index on any or all of the attributes of a document. The most obvious view for this type of document would be one with the keyword and date as the key. To create a view, CouchDB takes a map function written in javascript. For the view I just described, the map function would look like this:

function (doc) {
  if(doc.keyword && doc.date) {
    emit([doc.keyword, doc.date], doc);
  }
}

This function will return all documents sorted by the keyword and date. Documents are just a series of key/value pairs, so you can easily write a reduce function to get total amounts of impressions and clicks for any date period. Christopher Lenz has written an incredible Python API for interfacing with CouchDB. I’ll include a little code to show how I’d take a keyword and find total clicks for a time period:

server = couchdb.Server('http://myserver.com:5984')
db = server['mydb']
 
#Assume I’ve created a view entitled “by_keyword”
rKeyword = db.view(’_view/myviews/by_keyword’)
 
#The library treats all documents as one big array. Great use
#of Pythonic concepts
docs = rKeyword[[startDate, 'nascar']:[endDate, 'nascar']]
clicks = reduce(lambda x, y: x+y.value['clicks'], docs)
print clicks

I’ve done some handwaving around the date stuff. I store dates as seconds since the epoch in CouchDB because they’re easier to sort on. I have a sneaking suspicion that I could actually save them as javascript date objects, but I’ve never tried it.

Alright, so we have a quick way to retrieve and massage data. Now the client starts wanting to include which account the keyword is in. With a relational database, this means a schema change and possible lost data. What a pain. With CouchDB, it means adding a key/value pair to the document. It’ll now look like this:

{
  date: 1211612400
  keyword: nascar
  impressions: 450
  clicks: 121
  account_id:4
}

But what about the code to roll up performance data? What about previous documents not fitting the new schema? Don’t worry about it. None of those things need to change. We don’t have to change a thing and we continue to get the same data we’ve always had. Now, if we want to start splitting the data by account, we need a new view, but we don’t have to change any of the old stuff. We just create a new view:

function (doc) {
  if(doc.account_id && doc.keyword && doc.date) {
    emit([doc.account_id, doc.keyword, doc.date], doc);
  }
}

et Voila! You can now pull performance by account, keyword and date. I particularly like this arrangement because it means, even if new attributes need to be tracked, it won’t affect current code.

There is one more concept that I like using for data wrangling on large reporting datasets. It’s called currying and it allows me to cut down on the amount of code I’m writing. The problem is that I usually need to operate over a number of different attributes. The example above only sums the value of clicks. I’ll almost always need to get more than just clicks. Currying involves writing a function that returns another function. You’ll see that the reduce statement above uses a lambda function that reduces on clicks. I’m going to write a function that returns a function that will sum any attribute value and use that in the reduce function:


def _add(attribute):
  def toReduce(x, y):
    return x + y.value[attribute]
  return toReduce
 
totalImpressions = reduce(_add(’impressions’), docs, 0)
totalClicks = reduce(_add(’clicks’), docs, 0)

The code above is simple, but I think it shows how incredibly powerful a few small programming concepts and CouchDB can become. It’s great for large datasets with information that doesn’t have to be relational. I use it specifically for huge amounts of web analytics data. I don’t lose any granularity, it takes almost no time to maintain and it’s incredibly fast. CouchDB just recently added a reduce option to its views. I haven’t had time to play with it, but it looks great. I plan on using it to take even more tedious operations out of my applications.

March 11, 2008

Pylons and Django newforms — SelectMultiple problems

Filed under: Python, Tips & Tricks — Bob @ 2:38 pm

Pylons has become my framework of choice for any new projects where I get to choose the technology. I’ve had trouble, however, finding a form library I like. Max Ischenko posted an entry about how to get Pylons to work with Django’s newforms library and I’ve been hooked ever since. It just seems to stay out my way better than any other form framework out there.
I ran into a problem a couple of days ago when using the SelectMultiple widget. The MultipleChoiceField expects a list from the SelectMultiple widget. In the Django framework, this isn’t a problem because it checks for a MultiValueDict:

def value_from_datadict(self, data, files, name):
  if isinstance(data, MultiValueDict):
    return data.getlist(name)
  return data.get(name, None)

Unfortunately for me, and any other WSGI framework users, MultiValueDict is a Django-specific datastructure. I also think that isinstance is not a very Pythonic way to handle this. Duck typing seems the correct way to do this and I’ve used it in my patch. It checks to see if the data object has a getall attribute. This tells me its a Paste MultiValueDict object which is more prevalent in other WSGI frameworks, such as Pylons.

Just take the patch I’ve created to fix the SelectMultiple widget and apply it to django/newforms/widgets.py. Now the widget will return your values in a list, exactly how MultipleChoiceField wants it.

December 14, 2007

AdWords API, Reports, and Python (The Decent, the Ugly and the Good, respectively)

Filed under: Google AdWords, Python — Bob @ 8:28 am

I’ve always been annoyed with the way the ReportService in the AdWords API works with most interpreted languages. Yes, you could say it’s a problem with SOAP (and it is), but I don’t buy that. Google engineers use Python enough that you’d think they would be more sensitive to this problem. The sample code given for reports is … disgusting. If you decide it’s too terrible to look at directly, let me fill you in. They build the XML manually in a string and then do something akin to an sprintf to fill in values. I know it’s just an example, but come on, that’s terrible.

I decided I couldn’t live like this and went about finding a better way. A few quick explanations:

  • There may be a way to do this better in ZSI. I haven’t tried, I’m using SOAPpy.
  • I’ll be using my AdWords client, but you can use any one you want, it’s just the conversion of the report data structure into a SOAPpy type that matters.

Alright, let’s get started, first I’ll set up the client:

import adwords.client as client
 
aw = client.AdWordsClient(
  email='...',
  password='...',
  client_email='...',
  developer_token='...',
  application_token='...',
  user_agent='...'
  )

Now to build the report data structure. Of course, this is just an example so any of the data in this report could be different:

reportJob = {'selectedReportType': 'Account'}
reportJob['aggregationTypes'] = ['Daily']
reportJob['startDay'] = ‘2007-10-31′
reportJob['endDay'] = ‘2007-11-05′
reportJob['selectedColumns'] = [
  'CustomerName',
  'AdWordsType',
  'CPC',
  'CPM',
  'Clicks'
]

I’ll now build a SOAPpy representation of the data. With most API calls you would be able to just plop that data structure into the call and SOAPpy would take care of it. With reports, however, Google makes a small statement at the top of the DefinedReportJob page:

DefinedReportJob   - V11 (AdWords API Developer's Guide)

To do this, you need to create a SOAPpy type and the set some attributes on that type. Here’s how I do it:

reportJob = SOAPpy.Types.structType(reportJob)
reportJob._setAttr('xmlns:impl', 'https://adwords.google.com/api/adwords/v11')
reportJob._setAttr('xsi:type', 'impl:DefinedReportJob')

As you can see, SOAPpy offers up some types. The structType is the best for dict objects, but if you run into other typing problems it supplies array types, string types, numerical types, etc. Now the only thing left to do is schedule the job:

jobId = int(aw.scheduleReportJob(reportJob))

It’s actually better practice to first validate your job with the new validateReportJob function in v11 of the API, but you get the picture.

There’s actually one additional problem I ran into when checking the status of a job. The Java service expects a long variable but if I cast it to long in Python, the API returns an error because it’s passed as a BigInt. I have no idea why it does this, but to get around it, I used some of SOAPpy’s types again:

status = aw.getReportJobStatus(SOAPpy.Types.longType(jobId))

Make sure your jobId variable is an int or SOAPpy will throw an exception.

So, I was able to get reports up and running and Python and I didn’t have to resort to archaic search and replace functionality in my code. I’ll probably be adding this to my AdWords client soon so I don’t have to think about it every time I create a report.

November 15, 2007

Extending the Python AdWords API client to understand “unbounded”

Filed under: Google AdWords, Programming, Python, pyadwords-client — Bob @ 4:48 pm

I’ve been adding some functionality to the sample AdWords API client. After using it for a couple of minutes I found that it does not understand when a list should be returned. In other words, if I call an API method that should return a list, getAdGroupList for example, and only one ad group is returned, it is returned as a single instance, not a list of length one. So now instead of:

for adgroup in service.getAdGroupList(listofAdGroupIds):
  print adgroup.name

I have to do some duck typing on each return value:

adgroups = service.getAdGroupList(listofAdGroupIds)
if not hasattr(adgroups, 'sort'):
  adgroups = [adgroups]
for adgroup in adgroups:
  print adgroup.name

It’s more than doubled the lines of code needed just to call a simple API method. Now, I could just overload the service to turn all returned values into a list. The problem, however, is that some methods aren’t supposed to return a list, updateAdGroup, for example. To make the client understand when a list is needed I have to deconstruct the WSDL. I’ve already been loading the WSDLs to extract API method names so I just need to find where the return value was defined. For the AdWords API, it’s listed pretty far down. Here’s an look at the WSDL where it defines the getAdgroupList response element:

<element name="getAdGroupListResponse">
  <complextype>
    <sequence>
      <element name="getAdGroupListReturn" maxOccurs="unbounded" type="impl:AdGroup"/>
    </sequence>
  </complextype>
</element>

In the getAdgroupListReturn element you’ll see the maxOccurs value is unbounded. This means the return value should be a list. I’m not sure why, but the SOAPpy module doesn’t seem to understand this. To fix the behavior, I first had to find out where SOAPpy was putting the *methodName*Return element in the data structure. This was pretty much trial and error with a lot of dir() commands until I found the culprit. I’m not going to post the convoluted code here but, if you’re interested, you can look through it yourself. The method name is getPluralMethods.

Ok, so I’ve gotten a list of method names that need to return a list. What now? Well, I created a method named expectsList that takes another method as an argument:

def expectsList(self, fn):
  ""Decorator that guarantees that the
  return value of a function is a list
 
  Args:
    fn: function
  Returns:
    function
  """
  def returnList(*args, **kwargs):
    out = fn(*args, **kwargs)
 
    #Quack, quack: duck typing
    if not hasattr(out, 'reverse'):
      if not hasattr(out, 'id'):
        #Empty return? Return an empty list
        return []
      #Single return element? Return it in a list
      return [out]
    #Otherwise, it must already be a list
    return out
  return returnList

In the comments, I call this a decorator although it doesn’t technically follow the decorator syntax in the next bit of code where I wrap it around plural API methods:

plurals = self.getPluralMethods(wsdl)
for meth in wsdl.methods.keys():
  methFn = getattr(service, meth)
  if meth in plurals:
    methFn = self.expectsList(methFn)
  setattr(self, meth, methFn)

So, as you can see I’m wrapping the methods that expect a list in the expectsList function. This will make sure that all data returned from these methods is in the correct format and it’s been working so far.

November 12, 2007

Extending the python AdWords client sample code

Filed under: Google AdWords, Programming, Python, pyadwords-client — Bob @ 5:12 pm

I’ve recently started using Python and I’m really starting to love it. Now, with my reintroduction to SEM, I’ve had to figure out whether I should port my old Perl clients over to Python or just see if there’s already something written in Python. For AdWords, there’s a whole site of code samples in Python. One of these samples is actually a small Python client for AdWords. I loaded it up and started playing around with it. It’s a good starting point but, as usual, I want more.

First, and this is pretty much a port of some Perl code I had, I wanted to be able to access all the API methods from one object. I mean ALL of them. I don’t want to load each service separately. I want to be able to access the getAllAdWordsCampaigns method from the same object I access getAllAdGroups. I know, I know, “What about method name collisions?” or “That’s not a strict interpretation of OO design.” is what I’m hearing. And both are valid concerns, however here’s what I think:

Method name collisions: I haven’t seen any evidence that Google will start overlapping method names in different services. If they had that intention they would have used getAll as a method name in different services, not getAllAdWordsCampaigns, getAllAdGroups, getAllCriteria, and so on and so forth. If, for some reason, they decide to switch things up, I’ll have to rethink, but I’m willing to bet they won’t.

OO Design: This is easy. I don’t care. I don’t like SOAP. I see it as a play by Sun and MS to create such a complicated protocol that people are forced to use their language to implement it completely and correctly. I’ll stick with the view that the API is one big object when I’m replicating between my local db and the remote API. I’ll employ OO techniques when I’m using my local DB as a model layer.

So, I’ve created a small python AdWords API client in Google Code. I’ll be writing about it now and again. Here are some of the first changes I’ve made:

All API methods are loaded as actual methods in the API objects. By that I mean you can just do:

awclient = AdWordsClient(**loginParams)
campaigns = awclient.getAllAdWordsCampaigns(1)

Yup, you don’t need to get any services or call any wrapper “call” methods. Just call the method directly from the client object. I was able to do that by loading all the WSDLs and extracting the methods names. I use the setattr command to add the API call as an actual method. Also, the WSDLs are cached so you don’t have to keep grabbing them remotely.

This project is in its infancy and really just tailored to what I want right now. Of course, I’m always open to suggestions. Again, the code is at http://pyadwords-client.googlecode.com. It’s under the “Source” tab.

November 1, 2007

Back in the saddle

Filed under: Uncategorized — Bob @ 8:06 am

As you can see, there have been no posts for quite a while now. The reason has to do with me moving teams within Yahoo! to something completely different than SEM, that is Yahoo! Search operations. In two weeks, however, I’ll be leaving Yahoo! to work as a consultant for Raybeam Solutions. Generic website aside, I’m excited to get back into SEM. Raybeam has contracts with some pretty hard hitters in the industry and I’m looking forward to getting my feet wet again.

May 23, 2006

Google AdWords releases Version 4 of its API

Filed under: Google AdWords — Bob @ 3:00 pm

Some highlights of the recent version upgrade of Google AdWords:

Local Time Zone Support
This is nice.  Now I don’t have to figure out the timezone myself.  I think it makes my code a little more portable since it’s one less thing I have to configure.  I’m not sure if it was part of this release but it looks like the ReportService uses endDay (format YYYY-MM-DD) instead of requiring the time as well in endDate (format YYYY-MM-DDTHH:MM:SS).  Now I don’t have to specify that I want the report from midnight to midnight.

Traffic Estimator

Another change in this service.  I think it highlights how hard forecasting is.  They replace avgPosition with lowerAvgPosition and upperAvgPosition.  Same goes for clickPerDay and cpc estimations.  Next step … changing the values to include insideTheBallpark and wayTheHellOff.

Zero Impression Reporting
This is fantastic.  Now I can see if Google has changed the status of any of my low traffic keywords without grabbing them all with getKeywordList.  A nice savings on my quota.  Unfortunately, it’s not active yet.

Unique Request ID
This may be good for internal auditing but not much else.  I doubt Google is going to use it as a reference number for tech support.

Next Page »