I have always wondered how changes in politics and news affects the “most emailed” list of articles posted by the NY Times. So I've written a short script to parse and log that list that I can run from a cron job at regular intervals. This was a fun exercise and helped me learn some cool things about python, to include a very handy html/xml parsing engine – BeautifulSoup. It was not part of my standard install (EPD) so I installed it with:
sudo easy_install BeautifulSoup
Extracting from my code, these are the parts that make it all work:
Import the module:
from BeautifulSoup import BeautifulSoup
Get the web page:
url = 'http://www.nytimes.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
*Parse the page, looking for the mostEmailed section.
me = soup.find('div', id="mostEmailed").findAll('a',href = re.compile('.*'))
This line looks for a set of div tags in which the
id is mostEmailed. This returns the entire section of
list entry tags and their contents, but we only want the contents
of each anchor tag within each list entry. So the
findAll statement returns all the anchor tags. This
statement requires an argument to match against, so I'm matching
against the .* regular expression which, of course
matches everything.
Extract the data
for item in me:
melist.append(item.get('href'))
metitle.append(item.renderContents())
Finally the urls are extracted from the anchor tag data by
“getting” the href data and the title of each article
is returned with renderContents.
I am absolutely sure there is a more direct method within BeautifulSoup to extract this data, but as a first go, this works fine.
I'm saving the results in a flat file that looks like this:
2008-11-17T19:23:49.772734 1 http://www.nytimes.com/2008/11/16/arts/design/16ouro.html?em Architecture: Saving Buffalo’s Untold Beauty
2008-11-17T19:23:49.772734 2 http://www.nytimes.com/2008/11/16/opinion/16rich.html?em Frank Rich: The Moose Stops Here
2008-11-17T19:23:49.772734 3 http://www.nytimes.com/2008/11/17/opinion/17mcwilliams.html?em Op-Ed Contributor: Our Home-Grown Melamine Problem
So now how to illustrate the data in some meaningful way.