A couple of weeks ago I wrote a sort of “podcast maker” web scraper (you can read about it here), and now I wanted to execute it regularly, for free, and without too many problems. So I decided that using AWS Lambda was the best alternative. To make it available for others, I wrote a minimal example that you can clone here .
To configure it and make it work, you will need to install AWS SAM CLI. You will also need to:
podcast_scraper/settings.pyto the bucket you just created.
podcast_scraper/spiders(or just try the
minimalexample which is already in the folder).
scrapy_podcast_rss.pipelines.PodcastPipeline, yield one
PodcastDataItem, and one
PodcastEpisodeItemfor each episode (this is also shown in the
To build and deploy our code, we just need to run:
Scrapy is a great web scraping framework that allows parallel requests, and to achieve this, it was built on top of Twisted and runs inside a Twisted reactor. An important fact about Twisted reactors is that they cannot be restarted. This is not a problem when running it on a server or a local machine, but this can be an issue when running code “serverlessly.” For example, if we run our Lambda function twice (with little time in between), AWS may re-use the container created for the first execution and this will generate an error because we attempted to restart the Twisted reactor.
To solve this issue, I simply ran each new scraper on a new custom process (using multiprocessing), this makes each new execution run separately from past executions.
This may not be ideal, but it is the easiest solution and in the worst case, we just waste a little bit of memory. You can see the custom process here, and where it is being used here1.