Serverless web scraping

 Web-scraping


A couple of weeks ago I wrote a sort of “podcast maker” web scraper (you can read about it here), and now I wanted to execute it regularly, for free, and without too many problems. So I decided that using AWS Lambda was the best alternative. To make it available for others, I wrote a minimal example that you can clone here .

To configure it and make it work, you will need to install AWS SAM CLI. You will also need to:

  • Set stack_name, s3_bucket, s3_prefix, region on samconfig.toml.
  • If necessary, modify Timeout on template.yaml.
    Remember that 900 (i.e. 15 minutes) is the maximum for AWS Lambda.
  • Create an S3 bucket that has public read access. This is where the podcast app will obtain the RSS feed.
  • Set OUTPUT_BUCKET on podcast_scraper/settings.py to the bucket you just created.
  • Create a spider on podcast_scraper/spiders (or just try the minimal example which is already in the folder).
  • Remember to include the pipeline scrapy_podcast_rss.pipelines.PodcastPipeline, yield one PodcastDataItem, and one PodcastEpisodeItem for each episode (this is also shown in the minimal example).

To build and deploy our code, we just need to run:

$ sam build --use-container
$ sam deploy

How the example code works

Scrapy is a great web scraping framework that allows parallel requests, and to achieve this, it was built on top of Twisted and runs inside a Twisted reactor. An important fact about Twisted reactors is that they cannot be restarted. This is not a problem when running it on a server or a local machine, but this can be an issue when running code “serverlessly.” For example, if we run our Lambda function twice (with little time in between), AWS may re-use the container created for the first execution and this will generate an error because we attempted to restart the Twisted reactor.

To solve this issue, I simply ran each new scraper on a new custom process (using multiprocessing), this makes each new execution run separately from past executions.
This may not be ideal, but it is the easiest solution and in the worst case, we just waste a little bit of memory. You can see the custom process here, and where it is being used here1.

 

You can find all the code here .
You can also find how to use scrapy-podcast-rss here .

  1. This, this, and this helped me a lot to solve the ReactorNotRestartable error.