Skip to main content

Python web scraping resource

If you need to extract data from a web page, then the chances are you looked for their API. Unfortunately this isn't always available and you sometimes have to fall back to web scraping.

In this article I'm going to cover a lot of the things that apply to all web scraping projects and how to overcome some common gotchas.

Important

Always check for if you can extract information via an API first if they provide one, RSS / Atom feeds are also a great option if provided.

Prerequisites

We'll be using two external python libraries primarily.

Requests

We'll be using the requests library instead of urllib2. It's better in every way. I could go into details and explain why, but I think the requests page sums it all up in this short paragraph:

Python's standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.
lxml

lxml is a XML/HTML parser, so we'll be using it to extract the data from the responses we are returned. Some people prefer BeautifulSoup but I'm confident at writing my own xPaths and so would prefer to stick with lxml for raw speed.

Crawling / Spidering

Crawling is exactly the same as a "general" scraper, except that you find the next URLs to scrape on each page that you parse. Crawling is a technique used to create large databases of information.

Examples:

  • Crawling an eBay category to gather product information. (Don't - they have an API).
  • Crawling a category in the yellow pages for the contact information of a certain profession.
  • etc...
Storing seen URLs

You'll want to store the URLs you've already visited, so that you don't scrape them multiple times. For a small scrape(< 50k URLs), I would suggest simply using a set for this. You can then simply check for the presence of a URL in the set before fetching it / adding it to the "To Fetch Queue".

If you're going to crawling a large site, then the above technique of simply using a set won't hold up. The set will grow huge and will start using a ton of memory. The solution to this is a Bloom Filter. I won't go into much detail, but a bloom filter essentially lets us store which urls we have seen, then ask the question "Have I seen this URL before?" andprobably (very likely) returning the correct answer. It can do this usingvery little memory. pybloomfiltermmap is a good Bloom Filter implementation for Python.

Important: Make sure your normalize your URL before inserting it into the set / bloom filter. Depending on the site you may want to remove all query params etc. You don't want to be storing the same URL a ton of times.

CSS Selectors

If you're coming from a Javascript background, you might be more comfortable querying the DOM with CSS selectors rather than xPath. lxml provides a way to do this, it has a css_to_xpath method. Check out this page for a short guide on how to use it:

http://lxml.de/cssselect.html

Concurrency

We all know you want your data now, but hitting the server with 200 concurrent requests is going to make for one angry webmaster. Don't be that guy. Do everybody a favour and stick to a reasonable amount. 5 is usually enough.

To send requests concurrently easily with requests, check out thegrequests library. It allows you to make your code concurrent in pretty much a line or two.

rs = (grequests.get(u) for u in urls)    responses = grequests.map(rs)    

Robots.txt

Almost every guide out there will tell you to always follow robots.txt rules. I'm not going to do that. The simple fact is they are usually really restrictive and don't even reflect what the webmasters restrictions would be. Most sites just use the default robots.txt for their framework / that their SEO plugin generates. Just make sure you're reasonable.

An exception to this rule is if you are writing a general spider. Something like Google/Bing. In this case, I would follow the rules the disallowed directories and crawl delay. If they don't want something indexing, then don't index it in your search engine or whatever.

Avoiding Detection

Sometimes you don't want to get caught. I get that. It isn't actually that difficult to achieve. The first thing to do it randomize your user agent. Your IP address will still be the same for all the requests, but that isn't unusual. Places like universities / businesses often have a ton of computers that all route traffic via 1 IP.

The next step is to use proxies. You can pick up shared proxies at a very small monthly cost, and when you're using randomized user agents and sending requests from 100 different IP's spread across different IP ranges and even different datacenters, they're gonna have a hard time detecting you.

Recommended Proxy Provider: Proxy51

Common Problems

Installing lxml

You may run into some trouble when trying to install lxml, especially on a linux distro. lxml requires the following packages to be installed before:

apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev

You can then run the pip install lxml command as per usual.

The site uses AJAX, I can't scrape it!

This is simply incorrect. Often if a site uses AJAX, it can make your job even easier. Use the Network tab in Chrome Developer Tools to find the AJAX request, you'll usually be greeted by a response in json. This saves you the effort of having to parse the page, you can just hit the AJAX endpoint directly.

The site is showing different content to my scraper
  1. Check the page source in your browser and make sure the information you want to scrape is actually there. It could be that the content was injected into the page via javascript and our scraper doesn't run javascript. If this is the case, your only real option is to run something like PhantomJS. You can control Phantom from Python though, just use something like selenium or splinter.

  2. Sites often serve different content to different user agents. You should check the user agent that you are sending in your browser and set your scraper to have the same.

  3. It's possible that the site is serving different content to different locations using geotargeting. If you're running your script from a server, then this could be affecting you.

I'm getting 403 errors / rate limited

The simple solution to this is don't hit the site so damn hard. Be conservative with the amount of concurrent requests you use. Do you really need the data within the hour, can you really not wait 24 hours for it?

If you really need to keep sending requests to the site, then you're going to have to invest in some proxies. You can then simply create a simple proxy manager class that makes sure you only use the same proxy once every X interval so that it doesn't get blocked.

Check out the Concurrency and Avoiding Detection sections above.

Code Examples

The following examples aren't really of any use on their own. They're simply here as a reference for you.

Simple requests example

I'm not really going to show much in this example, you can find out all you need over at the requests documentation.

import requests    response = requests.get('http://jakeaustwick.me')    # Response  print response.status_code # Response Code    print response.headers # Response Headers    print response.content # Response Body Content    # Request  print response.request.headers # Headers you sent with the request    
Request with proxy
proxy = {'http' : 'http://102.32.3.1:8080',               'https': 'http://102.32.3.1:4444'}  response = requests.get('http://jakeaustwick.me', proxies=proxy)    
Parsing the responses body

It's easy to parse the html returned with lxml. Once we have it parsed into a tree, we can call xPaths on it to grab the data we are interested in.

import requests    from lxml import html    response = requests.get('http://jakeaustwick.me')    # Parse the body into a tree  parsed_body = html.fromstring(response.text)    # Perform xpaths on the tree  print parsed_body.xpath('//title/text()') # Get page title    print parsed_body.xpath('//a/@href') # Get href attribute of all links    
Download all images on a page

The following script will download all the images, and save them in downloaded_images/. Make sure that directory exists before running it.

import requests    from lxml import html    import sys    import urlparse    response = requests.get('http://imgur.com/')    parsed_body = html.fromstring(response.text)    # Grab links to all images  images = parsed_body.xpath('//img/@src')    if not images:        sys.exit("Found No Images")    # Convert any relative urls to absolute urls  images = [urlparse.urljoin(response.url, url) for url in images]    print 'Found %s images' % len(images)    # Only download first 10  for url in images[0:10]        r = requests.get(url)      f = open('downloaded_images/%s' % url.split('/')[-1], 'w')      f.write(r.content)      f.close()  
Single Threaded Crawler (basic)

This crawler just the basic principle behind crawling. It uses a dequeue so you can pop from the left of the queue and therefore fetch URLs in the order you find them. It doesn't do anything useful with the pages, just printing out the page title.

import requests    from lxml import html    import urlparse    import collections    STARTING_URL = 'http://jakeaustwick.me'    urls_queue = collections.deque()    urls_queue.append(STARTING_URL)    found_urls = set()    found_urls.add(STARTING_URL)    while len(urls_queue):        url = urls_queue.popleft()        response = requests.get(url)      parsed_body = html.fromstring(response.content)        # Prints the page title      print parsed_body.xpath('//title/text()')        # Find all links      links = [urlparse.urljoin(response.url, url) for url in parsed_body.xpath('//a/@href')]      for link in links:          found_urls.add(link)            # Only enqueue if it's a HTTP/HTTPS link          if url not in found_urls and url.startswith('http'):              urls_

Comments

Popular posts from this blog

How to Hack a Website in Four Easy Steps

Every wondered how Anonymous and other hacktivists manage to steal the data or crash the servers of websites belonging to some of the world biggest organisations? Thanks to freely available online tools, hacking is no long the  preserve of geeks , so we've decided to show you how easy it is to do, in just four easy steps. Step 1: Identify your target While  Anonymous  and other online hacktivists may choose their targets in order to protest against perceived wrong-doing, for a beginner wanting to get the taste of success with their first hack, the best thing to do is to identify a any website which has a vulnerability. Recently a hacker posted a list of 5,000 websites online which were vulnerable to attack. How did he/she identify these websites? Well, the key to creating a list of websites which are likely to be more open to attack, is to carry out a search for what is called a Google Dork. Google Dorking , also known as Google Hacking, enables you find sen

How to Hack Facebook Password in 5 Ways

Check out the following post from  fonelovetz blog  on facebook account hacking. This is one of the most popular questions which I'm asked via my email.And today I'm going to solve this problem one it for all.Even though i have already written a few ways of hacking a facebook password.Looks like i got to tidy up the the stuff here.The first thing i want to tell is.You can not hack or crack a facebook password by a click of a button.That's totally impossible and if you find such tools on the internet then please don't waste your time by looking at them! They are all fake.Ok now let me tell you how to hack a facebook account. I'll be telling you 5 of the basic ways in which a beginner hacker would hack.They are: 1.Social Engineering 2.Keylogging 3.Reverting Password / Password Recovery Through Primary Email 4.Facebook Phishing Page/ Softwares 5.Stealers/RATS/Trojans I'll explain each of these one by one in brief.If you want to know more about them just

How to Hack Someone's Cell Phone to Steal Their Pictures

Do you ever wonder how all these celebrities continue to have their private photos spread all over the internet? While celebrities' phones and computers are forever vulnerable to attacks, the common folk must also be wary. No matter how careful you think you were went you sent those "candid" photos to your ex, with a little effort and access to public information, your pictures can be snagged, too. Here's how. Cloud Storage Apple's iCloud service provides a hassle free way to store and transfer photos and other media across multiple devices. While the commercial exemplifies the G-rated community of iPhone users, there are a bunch of non-soccer moms that use their iPhones in a more..."free spirited" mindset. With Photo Stream enabled (requires OS X Lion or later, iOS 5 or later), pictures taken on your iPhone go to directly to your computer and/or tablet, all while being stored in the cloud. If you think the cloud is safe, just ask Gizmodo

How to Hack Samsung Phone Screen Lock

I have discovered  another  security flaw in Samsung Android phones. It is possible to completely disable the lock screen and get access to any app - even when the phone is "securely" locked with a pattern, PIN, password, or face detection. Unlike another recently released flaw, this doesn't rely quite so heavily on ultra-precise timing. Video . Of course, if you are unable to download a screen unlocker, this security vulnerability still allows you to  dial any phone number and run any app ! HOWTO From the lock screen, hit the emergency call button. Dial a non-existent emergency services number - e.g. 0. Press the green dial icon. Dismiss the error message. Press the phone's back button. The app's screen will be briefly displayed. This is just about long enough to interact with the app. Using this, you can run and interact with any app / widget / settings menu. You can also use this to launch the dialler. From there, you can dial any phone