Downloading RSS feeds

RSS (Really Simple Syndication) or Atom feeds are a convenient way to retrieve new information from websites and it was important to put them at disposal on an internal, disconnected network. However, articles are often truncated in RSS or Atom feeds and we also want the linked web page. Things are more complicated to download web pages: the time when there was only an HTML file with a few linked images is over; pages are now often generated with JavaScript, and only a full browser can correctly display such a page.

There are several solutions for downloading web pages with a browser:

  • raw download using the requests Python library, which is fine for downloading binary files,
  • Chrome (chrome –print-to-pdf –headless –disable-gpu),
  • Puppeteer, requiring npm and to generate a JavaScript file to control Chrome,
  • pyppeteer, however this solution is not maintained anymore,
  • playwright, a recent project from Microsoft that has a nice Python API.

We currently use the last solution even if Puppeteer is kept as a fall-back solution. Chrome is launched in a separate thread with a queue of URLs to download: that avoids to launch a fresh new Chrome process for each download.