HTML Scraping, Starring: Ruby, HPRICOT and Firebug

by peter on 09/09/2009

I had a really sticky problem.  I’m building a web 2.0 application that aggregates data from disparate social media sites.  The good news is that MANY sites have really great, REST API’s while many do not.  My nemesis was Google Blog Search.  Sorry Charlie, no API!

What to to?

I’ve burned several nights trying to work around this problem with mixed results and very disappointing results.  Plugging away and looking at dozen’s of different ways to tackle this problem, I finally settled on an approach suggested here by igvita.com.  While this got me 90% of the way there, I had to do a bit of twiddling to get it to work properly.

First, the igvita.com article didn’t explain how to submit data.  No big deal, you can pass the search criteria in on the requesting URL:

@url = "http://blogsearch.google.com/blogsearch?hl=en&ie=UTF-8&q=ruby+rails"

So far, so good.

The next trick is to use XPATH to locate and snag the piece of data you’re interested in.  This is where Firebug completely rocks.  If you’re not using Firebug for web development, you’re either a glutton for punishment or have been living under a rock for quite a while.  In my particular case, the only piece of data that I was interested was the number of hits on the particular query:

rubyrails-google-blog-search1

The igvita.com article goes into detail regarding how to find the right Xpath by leveraging Firebug.  Firebug pretty slick and the article does a great job explaining how to use it, so I won’t go into the details here.  The problem is that the HTML needs to be converted to XML before the XPATH will work properly.  Below is the Ruby code in its entirety:

require 'rubygems'
require 'open-uri'
require 'hpricot'

@url = “http://blogsearch.google.com/blogsearch?hl=en&ie=UTF-8&q=ruby+rails”
@response = ”

begin

# HPricot RDoc: http://code.whytheluckystiff.net/hpricot/
doc = Hpricot(@response)
xml = Hpricot.XML(open(@url).read)

# Retrive number of comments
number_of_hits = (xml/”/html/body/div[5]/table[3]/tbody/tr/td[2]/font/b[3]“).inner_html
puts “Number of hits: #{number_of_hits}”

rescue Exception => e
print e, “n”
end

Happy hacking,

Peter

No TweetBacks yet. (Be the first to Tweet this post)

Technorati Tags: , , ,

{ 1 comment… read it below or add one }

pramod 09/14/2009 at 1:24 pm

this is good one to know the basics of the hpricot by using the Firedebug. Really helpful to novice in the field of hpricot.

Leave a Comment

Previous post: Six Things to Keep in Mind when Considering Code Reviews