Lately I’ve been getting a kick out of consolidating my newsfeeds. More precisely siphoning everything into RSS and reading it with Newsboat. I know that RSS is for some an obsolete technology, but I feel that despite its lack of glitz and glamour it is one of the best ways to enjoy online content. A major upside of RSS, is that it enables circumventing (some of) the endless recommendation engines, cookie pop-ups and other nefarious aspects of the rampant attention economy. For example, I use newsboat to manage Youtube subscriptions without having to use the standard youtube web interface. It’s easy to pipe the youtube links to mpv using Newsboat which makes the whole experience a lot better (in my opinion). If you are interested in alternate web interfaces for YT alone, Freetube looks interesting. But this post is about RSS so let’s delve in.
❗Sidenote❗
I know that the content creation ecosystem on youtube depends on accurate subscription metrics based on Google accounts and what I am doing slightly hurts the creators since I don’t feed the metrics / view advertisements on the platform. That being said, I feel the manipulative practices & dark patterns adopted by YouTube/ Google strongly discourage me from using the official front-end to view YouTube content. If you have fallen down time-devouring rabbit holes / felt that you are being spied on / manipulated to be glued to the screen you know what I mean. As long as RSS remains officially supported by Youtube, I’ll keep using it.
Inconveniences of conventional RSS feeds
Despite some decrease in popularity, many sites still offer an RSS feed, for example, Hacker News and Hackaday. The problem with many feeds is that the amount of content can be quickly become overbearing. When I see 500 unread posts, I tend to delete all of them rather than start sifting through the pile. Ideally, there would be some programmatic way to tweak and filter existing feeds to suit one’s needs better. Furthermore, it would be nice to transform non-RSS feeds into RSS feeds (while retaining the ability to filter / augment the feed programmatically).
RSS-ification is not a novel topic. There are already many RSS feed generators that allow you to transform websites into RSS feeds. Tools like Zapier enable making RSS feeds from a wide variety of different sources. You can also use e.g. repl.it to produce an RSS feed from an email newsletter. However, in this post, we’ll roll our own using Python and Heroku.
Generating custom RSS feeds
Generating custom RSS feeds involves programmatic content filtration (using e.g. webscraping) with generating and serving XML content. Both of these functionalities are easy to perform in Python. Additionally, services like Heroku make it easy to host such simple python web apps for free. Heroku is pretty versatile and you can even deploy your own TTRSS instance on heroku.
For our custom RSS feed, we need three components
- Flask app that serves the rss feed (xml content)
- Datastore that functions as a cache for the content
- Scheduled scraper that reads a site at set intervals and returns content based on our filters
I’ll refer to an example app named hackaday-videos throughout the rest of this post. This app scrapes Hackaday posts and extracts ones with an embedded youtube link. I decided to write it since I found myself almost always skipping straight to the video when browsing HaD. You can find the source for the app on gitlab.
Flask app
This part is fairly straightforward. It is straightforward to feed Flask pretty much any type of content and it’ll happy serve it up. The XML for the RSS is generated with PyRSS2Gen. This package hasn’t been updated since 2012 (when it was ported to Python 3) which was initially a bit worrying. Nonetheless, I haven’t had any issues with it and it seems rock solid.
import PyRSS2Gen
from flask import Flask
import datastore
app = Flask(__name__)
def outputXML(links):
rss_items = []
for entry in links:
rss_items.append(PyRSS2Gen.RSSItem(title = entry['title'],link = entry['link'], description = entry['description']))
rss = PyRSS2Gen.RSS2(
title = "Hackaday Videos",
link = "",
description = "Videos from Hackaday posts",
items = rss_items)
return rss.to_xml()
@app.route('/')
def root():
links = datastore.readRedis()
return outputXML(links)
if __name__ == "__main__":
app.run(threaded=True,port=5000)
Datastore
Feeds can be generated on a per-request basis or using scheduling combined with a datastore. The per-request option is simpler but is limited both in terms of speed and coverage. For example, in our hackaday-videos example the RSS feed gets the seven latest posts (HaD RSS default) and checks them for embedded videos. The daily post count on HaD ranges from 10 to 12 so unless you update your RSS feed approximately twice a day you might miss some posts. Furthermore, it is far faster to read the Redis datastore compared to scraping at every update. You could naturally replace Redis with Postgres (and psycopg2). Having the free Redis add-on requires adding a credit card to your Heroku account whereas the Postgres one is availabe without adding a card. The example below demonstrates how easy it is to read and write to a Heroku Redis instance. Dumping the links (a list of dicts) into a json is a bit lazy, but it works so why not 🦥.
PRO-TIP: The URL to these datastores changes with time and Heroku handles access to them using an environment variable.
import os
import redis
import json
redisClient = redis.from_url(os.environ.get("REDIS_URL"))
def readRedis():
serialized = redisClient.get('links')
links = json.loads(serialized)
return links
def updateRedis(links):
serialized = json.dumps(links)
redisClient.set('links', serialized)
Scheduled scraper
Similiarly to everything else, scheduling functions on Heroku is straightforward. There are a couple of options for scheduling including the Heroku scheduler as well as several ways of implementing custom clock processes. Since this is a Python project, I went with the Advanced Python scheduler for which there are clear instructions for in the Heroku devcenter. In the example below, the timed_job function is executed every hour.
import scrape
import datastore
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
@sched.scheduled_job('interval', minutes=60)
def timed_job():
oldLinks = datastore.readRedis()
links = scrape.extractVideoLinks()
links.extend(x for x in oldLinks if x not in links)
datastore.updateRedis(links[:100])
sched.start()
The actual scrape function is a bit verbose as it handles some edge cases. You can find it in the gitlab repo.
Closing remarks
Once everything is deployed, you can easily browse all the videos embedded to Hackaday posts as if they were a channel of their own. Note that there might be some edge cases I haven’t taken into account so YMMV if you deploy this as is. I think this type of ‘self-hosted filtration from a media aggregator’ is a very interesting and extensible.
Future RSS feed ideas include
- Combining several scientific journal RSS feeds and filter based on specific search terms
- This should be faster than waiting for journals to be indexed on Google Scholar
- Converting email newsletters from e.g. substack into RSS
- This could probably be done with e.g. the CloudMailIn add-on