Back in 2012, nearly two years ago, I wrote a post about crawling the web for images. To be more specific I explained how I went about crawling the Boston Globe’s Big Picture archive and storing all their pictures on your hard drive. Since then the website has changed and the code obviously doesn’t work anymore. (Go ahead and try: gist link here) Looking at the code nowadays I realize it has many mistakes and issues that don’t make it as portable as it could be. Furthermore there have been new developments in the Python world in regards to crawling and it would be a shame not to get to know them.
Since I had some extra time this week I decided to take on the “challenge” and publish an updated version.
Past
Two years ago I used a rather simple setup. I choose Python to write the script and with the amazing Beautiful Soup library I was able to finish everything rather quickly. You can still find the old and broken script here.
Present
I wanted to implement a much nicer and cleaner version this time. There are a couple of helpful tools and libraries out there:
After hearing lots of good praise for lxml, I thought about using it but since the new version of Beautiful Soup (4.0) supports using lxml behind the curtains I went with the same library again after all. This will make for an even better comparison to the original post.
The Beauty of the Universe
To get started I will choose a simple website: NASA’s picture of the day. To be honest, this website was how it all started for me anyways. I read a Lifehacker post about crawling their page with a simple one-line wget
command. Intrigued by the idea of crawling pictures from a site I went on and wrote my first post about crawling the Boston Globe’s “The Big Picture”.
Indexing
NASA makes it pretty for easy for us to get a list of all images, even dating back to the sites creation. We can simply download the archive page and crawl all of the necessary URLs. Since NASA’s page has a pretty simple format it is rather simple to get all of the content we need. The page basically looks like this:
Based on this simple structure we can write our crawler.
In the next step we are going to use these URLs to actually download the images themselves.
Crawling
This step is a little bit more difficult, but still doable. Lets take a look at an example page. A snippet from the July 19th 2014 page.
Basically the strategy here will be to simply collect the <img>
tag. The page has a really simple structure again. Note that the <img>
contains a low quality thumbnail and the <a>
links to the high quality image on the server. The Lifehacker post that I mentioned above simply downloaded all images, thumbnails and originals alike. We are going to add an option to decide which one we want.
Downloading
This one is pretty self-explanatory. We take the image URLs collected above and download each one.
Crawling Expanded
You can find the complete code at the end of this page. I hope this post is a little bit more clear and easier to understand than my first one. Stay tuned for the second part of this post where I will dive into more complex topics that can make crawling faster and more fun.
While I was writing the script and this post I found some helpful articles and tools that might interest you.
If you want a more complete solution for crawling a website check out scrapy. (Not to be confused with scapy, a python network packet generator) It comes with many tools that make crawling and storing website information pretty easy.
Articles:
- Python concurrency
- Lifehacker article about crawling NASA’s “Astronomy Picture of the day” with wget
- Comparison between lxml and BeatifulSoup on Stackoverflow
- Web Crawling with scrapy
- The Beauty of Big Pictures - cwoebker
Tools:
Also you can find all the images that these two scripts download in the torrent files below.
This helps to take some unnecessary load of their servers 😃
- APOD-2014
- bigPicture-2014 - N/A
Final code and example output
So in the end, after combining the code above, catching some errors that interrupt the script and adding some extra stuff here and there, we have the complete code.