Walk on a sunday, na, scrape distrowatch
A friend asked me today if I could do some Perl programming for him. I'm a bit rusty in Perl, but I still understand it well enough to port it to Python. The source module was on CPAN and basically scraped all of the metadata off of distrowatch.com. The new modules lives on PyPi as pulldistros. It will basically scrape most of the metadata.
The main libraries used are BeautifulSoup
and requests
.
One nice thing to note, BeautifulSoup allows you to use different parsers. The default Python HTML parser will interpret something like
<ul>
<li>foo
<li>bar
</ul>
As a tree, where each <li>
element sits below the previous <li>
element. However the html5lib
parser will be much closer to how a browser interprets this. This means that its output will list all <li>
tags as siblings.
A nice gem Tobi pointed me to is requests_cache
, which will create a transparent cache for any requests that are sent out by requests
. This speeds up testing a lot, since I had to run around 1000 requests, to test the whole setup.
Web scraping with Python is surprisingly fun.
The source lives on Github here.
Cover image, Matthew Harrigan [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0) or CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Sandcastle1.jpg