Web scrape anything into a podcast

 Web-scraping


Sometimes I like to listen to podcasts using Plex, but an RSS feed is not always available (or if it is available, it may not have all the content I want). So my idea was to build a web scraper that generates an RSS feed from online audio files, and uploads it into an S3 bucket (or any other hosting service). Then, I can just point the app to the feed location and listen to the audios, as if they were part of a “real” podcast.

To make this process more robust and reusable, I wanted to make something that played along with scrapy (which I think is the best web scraping framework in Python). So I decided to make a Python package that implements a pipeline, a feed export1, and two special types of items.

To understand the logic behind the code, here is a small summary of what I wanted:

  1. Scrape general information about the content (title, description, image, etc).
  2. Scrape information about each “episode” (title, description, publication, date, audio URL, etc).
  3. Combine the information into a valid RSS feed (XML format).
  4. Export and upload the file.

To solve this, the package implements:

  • PodcastDataItem which inherits from scrapy.Item and stores general information about the content.
  • PodcastEpisodeItem which also inherits from scrapy.Item and stores information about each “episode.”
  • PodcastPipeline that processes the items for the exporter.
  • PodcastBaseItemExporter which is a metaclass that forces subclasses to implement save_to_storage and also takes care of all the trouble generating the RSS feed (it uses the package feedgen to create the XML)2.
  • PodcastToFileItemExporter inherits from PodcastBaseItemExporter and simply saves the XML locally.
  • PodcastToS3ItemExporter inherits from PodcastBaseItemExporter and uploads the XML to an S3 bucket (using boto3).

To use the package:

  1. Install the package with pip install scrapy-podcast-rss.
    If you want to upload your files to an S3 bucket use pip install scrapy-podcast-rss[s3_storage]. You will also need to have AWS cli installed (you can read more here).
  2. Define OUTPUT_URI in your settings.py3.
    For example:
     OUTPUT_URI = './my-podcast.xml'  # Local file.
     OUTPUT_URI = 's3://my-bucket/my-podcast.xml'  # S3 bucket.
    
  3. Add PodcastPipeline in ITEM_PIPELINES in your settings.py:
    For example:
     ITEM_PIPELINES = {
         'scrapy_podcast_rss.pipelines.PodcastPipeline': 300,
     }
    
  4. On your spider, yield a PodcastDataItem:
    For example:
     from scrapy_podcast_rss import PodcastDataItem
     # (...)
     podcast_data_item = PodcastDataItem()
     podcast_data_item['title'] = "Podcast title"
     # (...)
     yield podcast_data_item
    
  5. On your spider, yield a PodcastEpisodeItem for each “episode” scraped:
    For example:
     from scrapy_podcast_rss import PodcastEpisodeItem
     # (...)
     episode_item = PodcastEpisodeItem()
     episode_item['title'] = "Episode title"
     # (...)
     yield episode_item
    

Done! Your web scraper should be generating valid RSS files.

The next step to make this more convenient is to run it automatically whenever you want. I wrote a small example that uses AWS Lambda functions to run the code. You can then set up a CloudWatch event to run your scraper daily/weekly/monthly or with a custom trigger.

 

You can find all the code here .
You can also find a repository with a minimal working example here .

  1. I debated a lot whether this should be a pipeline and a feed export, or only a feed export (which actually makes more sense), but I ended implementing the first option because it was more flexible. 

  2. The advantage of this, is that if you want to host the RSS feed on, for example, Google Drive, you simply need to:

    1. Create a new exporter class and inherit from PodcastBaseItemExporter.
    2. Implement the method save_to_storage (which receives the generated XML content on its first parameter). This should be really straight forward thanks to PyDrive!
    3. Modify PodcastPipeline._get_exporter to decide when to use your exporter (again, this is not ideal, but this was the option that meant more flexibility).

  3. Depending on the type of URI, the pipeline will decide which exporter to use.