The Beauty of Python Crawling (1/2)

July 19, 2014

Back in 2012, nearly two years ago, I wrote a post about crawling the web for images. To be more specific I explained how I went about crawling the Boston Globe’s Big Picture archive and storing all their pictures on your hard drive. Since then the website has changed and the code obviously doesn’t work anymore. (Go ahead and try: gist link here) Looking at the code nowadays I realize it has many mistakes and issues that don’t make it as portable as it could be. Furthermore there have been new developments in the Python world in regards to crawling and it would be a shame not to get to know them.

Since I had some extra time this week I decided to take on the “challenge” and publish an updated version.

Past

Two years ago I used a rather simple setup. I choose Python to write the script and with the amazing Beautiful Soup library I was able to finish everything rather quickly. You can still find the old and broken script here.

Present

I wanted to implement a much nicer and cleaner version this time. There are a couple of helpful tools and libraries out there:

After hearing lots of good praise for lxml, I thought about using it but since the new version of Beautiful Soup (4.0) supports using lxml behind the curtains I went with the same library again after all. This will make for an even better comparison to the original post.

The Beauty of the Universe

To get started I will choose a simple website: NASA’s picture of the day. To be honest, this website was how it all started for me anyways. I read a Lifehacker post about crawling their page with a simple one-line wget command. Intrigued by the idea of crawling pictures from a site I went on and wrote my first post about crawling the Boston Globe’s “The Big Picture”.

Indexing

NASA makes it pretty for easy for us to get a list of all images, even dating back to the sites creation. We can simply download the archive page and crawl all of the necessary URLs. Since NASA’s page has a pretty simple format it is rather simple to get all of the content we need. The page basically looks like this:

<b>

2014 July 19:  <a href="ap140719.html">Alicante Beach Moonrise</a><br>
2014 July 18:  <a href="ap140718.html">Ou4: A Giant Squid Nebula</a><br>
2014 July 17:  <a href="ap140717.html">3D Homunculus Nebula</a><br>
2014 July 16: ...
</b>

Based on this simple structure we can write our crawler.

ROOT_URL = 'http://apod.nasa.gov/apod/'

def load():
    urls = []
    data = urllib2.urlopen(ROOT_URL + 'archivepix.html').read()
    soup = BeautifulSoup(data, 'lxml')
    results = soup.find('b').findAll('a')
    urls = [result['href'] for result in results]
    return urls

In the next step we are going to use these URLs to actually download the images themselves.

Crawling

This step is a little bit more difficult, but still doable. Lets take a look at an example page. A snippet from the July 19th 2014 page.

<center>
<h1> Astronomy Picture of the Day </h1>
<p>

<a href="archivepix.html">Discover the cosmos!</a>
Each day a different image or photograph of our fascinating universe is
featured, along with a brief explanation written by a professional astronomer.
<p>

2014 July 19
<br>

<a href="image/1407/MoonAlicante_Gonzalez.jpg">
<IMG SRC="image/1407/MoonAlicante_Gonzalez1024.jpg"
alt="See Explanation.  Clicking on the picture will download
 the highest resolution version available."></a>

</center>

Basically the strategy here will be to simply collect the <img> tag. The page has a really simple structure again. Note that the <img> contains a low quality thumbnail and the <a> links to the high quality image on the server. The Lifehacker post that I mentioned above simply downloaded all images, thumbnails and originals alike. We are going to add an option to decide which one we want.

def getPhotos(urls, thumbs=False):
    puts("Locating Photos...")
    photos = {}
    for url in progress.bar(urls):
        data = urllib2.urlopen(ROOT_URL + url).read()
        soup = BeautifulSoup(data, 'lxml')
        result = soup.find('img')
        if thumbs:
            photos[url] = result['src'] # low quality
        else:
            photos[url] = result.parent['href'] # high quality

Downloading

This one is pretty self-explanatory. We take the image URLs collected above and download each one.

def downloadPhoto(folder, photo):
    u = urllib2.urlopen(photo)
    localFile = open(os.path.join(folder, photo.split('/')[-1]), "wb")
    localFile.write(u.read())
    localFile.close()
    u.close()

Crawling Expanded

You can find the complete code at the end of this page. I hope this post is a little bit more clear and easier to understand than my first one. Stay tuned for the second part of this post where I will dive into more complex topics that can make crawling faster and more fun.

While I was writing the script and this post I found some helpful articles and tools that might interest you.

If you want a more complete solution for crawling a website check out scrapy. (Not to be confused with scapy, a python network packet generator) It comes with many tools that make crawling and storing website information pretty easy.

Articles:

Tools:

Also you can find all the images that these two scripts download in the torrent files below.

This helps to take some unnecessary load of their servers 😃

APOD-2014
bigPicture-2014 - N/A

Final code and example output

So in the end, after combining the code above, catching some errors that interrupt the script and adding some extra stuff here and there, we have the complete code.

Output

*********************************************
 NASA's picture of the day since 1995.
 by Cecil Woebker (http://cwoebker.com)
*********************************************

 Usage:
 $ ./apod.py

 Saves all pictures from the Boston Glove Big Picture archive
 to the current directory.

Loading archive...
Opening archive...
[################################] 6967/6967 - 00:00:00
Found 6967 links.
Locating Photos...
[################################] 6967/6967 - 00:00:00
--------------
Downloading...
--------------
[################################] 6967/6967 - 00:00:00
Finished

By Cecil Wöbker

I do science during the day and develop or design at night. If you like my work, hire me.

Feel free to follow me on Twitter or email me with any questions.