With the caveats that (a) I don’t know much about Python, (b) I don’t want to learn that much about it right now, and (c) I’m not concerned with performance at the moment, the following Python script does the following:
- Download an RSS feed from the URL given on the command line.
- Checks a database to see if the title of each feed is already in the database, and if so, if it was put in there more than 12 hours ago.
- Prints only the “new” RSS feed titles.
- For titles not already in the database, it writes the titles and timestamps to the database.
Here’s the admittedly-crappy-but-functional Python source code:
#!/usr/bin/python
import feedparser
import time
from subprocess import check_output
import sys
#feed_name = 'TRIBUNE'
#url = 'http://chicagotribune.feedsportal.com/c/34253/f/622872/index.rss'
feed_name = sys.argv[1]
url = sys.argv[2]
db = '/var/www/radio/data/feeds.db'
limit = 12 * 3600 * 1000
#
# function to get the current time
#
current_time_millis = lambda: int(round(time.time() * 1000))
current_timestamp = current_time_millis()
def post_is_in_db(title):
with open(db, 'r') as database:
for line in database:
if title in line:
return True
return False
# return true if the title is in the database with a timestamp > limit
def post_is_in_db_with_old_timestamp(title):
with open(db, 'r') as database:
for line in database:
if title in line:
ts_as_string = line.split('|', 1)[1]
ts = long(ts_as_string)
if current_timestamp - ts > limit:
return True
return False
#
# get the feed data from the url
#
feed = feedparser.parse(url)
#
# figure out which posts to print
#
posts_to_print = []
posts_to_skip = []
for post in feed.entries:
# if post is already in the database, skip it
# TODO check the time
title = post.title
if post_is_in_db_with_old_timestamp(title):
posts_to_skip.append(title)
else:
posts_to_print.append(title)
#
# add all the posts we're going to print to the database with the current timestamp
# (but only if they're not already in there)
#
f = open(db, 'a')
for title in posts_to_print:
if not post_is_in_db(title):
f.write(title + "|" + str(current_timestamp) + "\n")
f.close
#
# output all of the new posts
#
count = 1
blockcount = 1
for title in posts_to_print:
if count % 5 == 1:
print("\n" + time.strftime("%a, %b %d %I:%M %p") + ' ((( ' + feed_name + ' - ' + str(blockcount) + ' )))')
print("-----------------------------------------\n")
blockcount += 1
print(title + "\n")
count += 1
The database
When run with the Chicago Tribune RSS feed URL shown, the script writes data like the following to its “database” (which is a text file with the fields separated by a | character):
Tiger Woods won't play in U.S. Open|1401322649189 Photos: Giants 5, Cubs 0|1401322649189 Baker foils Giants' no-hit bid, but Cubs lose 5-0|1401322649189 Sox Game Day: Noesi still searching for 1st Sox victory|1401322649189 Bears claim tackle Ola off waivers|1401322649189 Stanley Cup Final to start Wednesday|1401322649189 Renteria pushing Samardzija for All-Star game|1401322649189 Chicago teen Townsend stuns French star in French Open|1401322649189 Emanuel: No Wrigley Field hearing next month|1401322649189
Motivation
I use this code and database because:
- I’ve started to show RSS feed data on my Raspberry Pi Radio using the xscreensaver ’Phosphor’ screensaver
- The RSS feed data I’m showing doesn’t change that often
- I hate to walk into the kitchen, look at the screensaver, and see the same data today that I saw last night
I currently get data from several different RSS feeds and then rotate that data for the screensaver every two minutes, so if there are no bugs in this script, when I look at the screensaver tomorrow morning I shouldn’t see the same data that I’m seeing this evening (as I write this post).
For example, right now I see a title that says, “Baker foils Giants’ no-hit bid, but Cubs lose 5-0”. That’s okay to see that now, but I don’t need to see it again tomorrow morning. In reality I don’t need to see it more than once, but because the screensaver isn’t interactive -- and I don’t want it to be -- I hope that removing stories after ~12 hours is a good compromise.

