Sometimes I like to listen to podcasts using Plex, but an RSS feed is not always available (or if it is available, it may not have all the content I want). So my idea was to build a web scraper that generates an RSS feed from online audio files, and uploads it into an S3 bucket (or any other hosting service). Then, I can just point the app to the feed location and listen to the audios, as if they were part of a “real” podcast.
To make this process more robust and reusable, I wanted to make something that played along with scrapy (which I think is the best web scraping framework in Python). So I decided to make a Python package that implements a pipeline, a feed export1, and two special types of items.
To understand the logic behind the code, here is a small summary of what I wanted:
To solve this, the package implements:
PodcastDataItem
which inherits from scrapy.Item
and stores general information about the content.PodcastEpisodeItem
which also inherits from scrapy.Item
and stores information about each “episode.”PodcastPipeline
that processes the items for the exporter.PodcastBaseItemExporter
which is a metaclass that forces subclasses to implement save_to_storage
and also takes care of all the trouble generating the RSS feed (it uses the package feedgen to create the XML)2.PodcastToFileItemExporter
inherits from PodcastBaseItemExporter
and simply saves the XML locally.PodcastToS3ItemExporter
inherits from PodcastBaseItemExporter
and uploads the XML to an S3 bucket (using boto3).To use the package:
pip install scrapy-podcast-rss
.pip install scrapy-podcast-rss[s3_storage]
.
You will also need to have AWS cli installed (you can read more here).OUTPUT_URI
in your settings.py
3. OUTPUT_URI = './my-podcast.xml' # Local file.
OUTPUT_URI = 's3://my-bucket/my-podcast.xml' # S3 bucket.
PodcastPipeline
in ITEM_PIPELINES
in your settings.py
: ITEM_PIPELINES = {
'scrapy_podcast_rss.pipelines.PodcastPipeline': 300,
}
yield
a PodcastDataItem
: from scrapy_podcast_rss import PodcastDataItem
# (...)
podcast_data_item = PodcastDataItem()
podcast_data_item['title'] = "Podcast title"
# (...)
yield podcast_data_item
yield
a PodcastEpisodeItem
for each “episode” scraped: from scrapy_podcast_rss import PodcastEpisodeItem
# (...)
episode_item = PodcastEpisodeItem()
episode_item['title'] = "Episode title"
# (...)
yield episode_item
Done! Your web scraper should be generating valid RSS files.
The next step to make this more convenient is to run it automatically whenever you want. I wrote a small example that uses AWS Lambda functions to run the code. You can then set up a CloudWatch event to run your scraper daily/weekly/monthly or with a custom trigger.
You can find all the code here .
You can also find a repository with a minimal working example here .
I debated a lot whether this should be a pipeline and a feed export, or only a feed export (which actually makes more sense), but I ended implementing the first option because it was more flexible. ↩
The advantage of this, is that if you want to host the RSS feed on, for example, Google Drive, you simply need to:
PodcastBaseItemExporter
.save_to_storage
(which receives the generated XML content on its first parameter). This should be really straight forward thanks to PyDrive!PodcastPipeline._get_exporter
to decide when to use your exporter (again, this is not ideal, but this was the option that meant more flexibility).Depending on the type of URI, the pipeline will decide which exporter to use. ↩