Tue Nov 18 12:27:05 EST 2008

Python and BeautifulSoup

I have always wondered how changes in politics and news affects the “most emailed” list of articles posted by the NY Times. So I've written a short script to parse and log that list that I can run from a cron job at regular intervals. This was a fun exercise and helped me learn some cool things about python, to include a very handy html/xml parsing engine – BeautifulSoup. It was not part of my standard install (EPD) so I installed it with:

sudo easy_install BeautifulSoup

Extracting from my code, these are the parts that make it all work:

Import the module:

from BeautifulSoup import BeautifulSoup

Get the web page:

url = 'http://www.nytimes.com'                                                               
page = urllib2.urlopen(url)                                                                  
soup = BeautifulSoup(page.read())  

*Parse the page, looking for the mostEmailed section.

me = soup.find('div', id="mostEmailed").findAll('a',href = re.compile('.*')) 

This line looks for a set of div tags in which the id is mostEmailed. This returns the entire section of list entry tags and their contents, but we only want the contents of each anchor tag within each list entry. So the findAll statement returns all the anchor tags. This statement requires an argument to match against, so I'm matching against the .* regular expression which, of course matches everything.

Extract the data

for item in me:                                                                              
    melist.append(item.get('href'))                                                          
    metitle.append(item.renderContents())  

Finally the urls are extracted from the anchor tag data by “getting” the href data and the title of each article is returned with renderContents.

I am absolutely sure there is a more direct method within BeautifulSoup to extract this data, but as a first go, this works fine.

I'm saving the results in a flat file that looks like this:

2008-11-17T19:23:49.772734      1       http://www.nytimes.com/2008/11/16/arts/design/16ouro.html?em    Architecture: Saving Buffalo’s Untold Beauty
2008-11-17T19:23:49.772734      2       http://www.nytimes.com/2008/11/16/opinion/16rich.html?em        Frank Rich: The Moose Stops Here
2008-11-17T19:23:49.772734      3       http://www.nytimes.com/2008/11/17/opinion/17mcwilliams.html?em  Op-Ed Contributor: Our Home-Grown Melamine Problem

So now how to illustrate the data in some meaningful way.


Posted by vschmidt | Permanent link | File under: python