Scraping HTML with Ruby + HPRICOT and Firebug
I had a really sticky problem. I’m building a web 2.0 application that aggregates data from disparate social media sites. The good news is that MANY sites have really great, REST API’s while many do not. My nemesis was Google Blog Search. Sorry Charlie, no API!
What to to?
I’ve burned several away several nights trying to work around this problem with mixed results. Plugging away and looking at dozen’s of different approaches, I finally settled on an approach suggested here by igvita.com. While this got me 90% of the way there, I had to do a bit of twiddling to get it to work properly.
First, the igvita.com article didn’t explain how to submit data. No big deal, you can pass the search criteria in on the requesting URL:
@url = "http://blogsearch.google.com/blogsearch?hl=en&ie=UTF-8&q=ruby+rails"
So far, so good.
The next trick is to use XPATH to locate and snag the piece of data you’re interested in. This is where Firebug completely rocks. If you’re not using Firebug for web development, you’re either a glutton for punishment or have been living under a rock for quite a while. In my particular case, the only piece of data that I was interested was the number of hits on the particular query:

The igvita.com article goes into detail regarding how to find the right Xpath. It’s pretty slick but I won’t go into it here. The problem was that the HTML needs to be converted to XML before the XPATH worked properly. Below is the Ruby code in its entirety:
require 'rubygems'
require 'open-uri'
require 'hpricot'
@url = “http://blogsearch.google.com/blogsearch?hl=en&ie=UTF-8&q=ruby+rails”
@response = ”
begin
# HPricot RDoc: http://code.whytheluckystiff.net/hpricot/
doc = Hpricot(@response)
xml = Hpricot.XML(open(@url).read)
# Retrive number of comments
number_of_hits = (xml/”/html/body/div[5]/table[3]/tbody/tr/td[2]/font/b[3]“).inner_html
puts “Number of hits: #{number_of_hits}”
rescue Exception => e
print e, “\n”
end
{ 1 comment… read it below or add one }
I want to trade my Mac for a PC so bad to be able to play this game