If you need to extract data from a web page, then the chances are you looked for their API. Unfortunately this isn't always available and you sometimes have to fall back to web scraping.
In this article I'm going to cover a lot of the things that apply to all web scraping projects and how to overcome some common gotchas.
- Prerequisites
- Crawling / Spidering
- CSS Selectors
- Concurrency
- Robots.txt
- Avoiding Detection
- Common Problems
- Examples
Important
Always check for if you can extract information via an API first if they provide one, RSS / Atom feeds are also a great option if provided.
Prerequisites
We'll be using two external python libraries primarily.
Requests
We'll be using the requests library instead of urllib2. It's better in every way. I could go into details and explain why, but I think the requests page sums it all up in this short paragraph:
Python's standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.
lxml
lxml is a XML/HTML parser, so we'll be using it to extract the data from the responses we are returned. Some people prefer BeautifulSoup but I'm confident at writing my own xPaths and so would prefer to stick with lxml for raw speed.
Crawling / Spidering
Crawling is exactly the same as a "general" scraper, except that you find the next URLs to scrape on each page that you parse. Crawling is a technique used to create large databases of information.
Examples:
- Crawling an eBay category to gather product information. (Don't - they have an API).
- Crawling a category in the yellow pages for the contact information of a certain profession.
- etc...
Storing seen URLs
You'll want to store the URLs you've already visited, so that you don't scrape them multiple times. For a small scrape(< 50k URLs), I would suggest simply using a set for this. You can then simply check for the presence of a URL in the set before fetching it / adding it to the "To Fetch Queue".
If you're going to crawling a large site, then the above technique of simply using a set won't hold up. The set will grow huge and will start using a ton of memory. The solution to this is a Bloom Filter. I won't go into much detail, but a bloom filter essentially lets us store which urls we have seen, then ask the question "Have I seen this URL before?" andprobably (very likely) returning the correct answer. It can do this usingvery little memory. pybloomfiltermmap is a good Bloom Filter implementation for Python.
Important: Make sure your normalize your URL before inserting it into the set / bloom filter. Depending on the site you may want to remove all query params etc. You don't want to be storing the same URL a ton of times.
CSS Selectors
If you're coming from a Javascript background, you might be more comfortable querying the DOM with CSS selectors rather than xPath. lxml provides a way to do this, it has a css_to_xpath
method. Check out this page for a short guide on how to use it:
Concurrency
We all know you want your data now, but hitting the server with 200 concurrent requests is going to make for one angry webmaster. Don't be that guy. Do everybody a favour and stick to a reasonable amount. 5 is usually enough.
To send requests concurrently easily with requests, check out thegrequests library. It allows you to make your code concurrent in pretty much a line or two.
rs = (grequests.get(u) for u in urls) responses = grequests.map(rs)
Robots.txt
Almost every guide out there will tell you to always follow robots.txt rules. I'm not going to do that. The simple fact is they are usually really restrictive and don't even reflect what the webmasters restrictions would be. Most sites just use the default robots.txt for their framework / that their SEO plugin generates. Just make sure you're reasonable.
An exception to this rule is if you are writing a general spider. Something like Google/Bing. In this case, I would follow the rules the disallowed directories and crawl delay. If they don't want something indexing, then don't index it in your search engine or whatever.
Avoiding Detection
Sometimes you don't want to get caught. I get that. It isn't actually that difficult to achieve. The first thing to do it randomize your user agent. Your IP address will still be the same for all the requests, but that isn't unusual. Places like universities / businesses often have a ton of computers that all route traffic via 1 IP.
The next step is to use proxies. You can pick up shared proxies at a very small monthly cost, and when you're using randomized user agents and sending requests from 100 different IP's spread across different IP ranges and even different datacenters, they're gonna have a hard time detecting you.
Recommended Proxy Provider: Proxy51
Common Problems
Installing lxml
You may run into some trouble when trying to install lxml, especially on a linux distro. lxml requires the following packages to be installed before:
apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev
You can then run the pip install lxml
command as per usual.
The site uses AJAX, I can't scrape it!
This is simply incorrect. Often if a site uses AJAX, it can make your job even easier. Use the Network tab in Chrome Developer Tools to find the AJAX request, you'll usually be greeted by a response in json. This saves you the effort of having to parse the page, you can just hit the AJAX endpoint directly.
The site is showing different content to my scraper
Check the page source in your browser and make sure the information you want to scrape is actually there. It could be that the content was injected into the page via javascript and our scraper doesn't run javascript. If this is the case, your only real option is to run something like PhantomJS. You can control Phantom from Python though, just use something like selenium or splinter.
Sites often serve different content to different user agents. You should check the user agent that you are sending in your browser and set your scraper to have the same.
It's possible that the site is serving different content to different locations using geotargeting. If you're running your script from a server, then this could be affecting you.
I'm getting 403 errors / rate limited
The simple solution to this is don't hit the site so damn hard. Be conservative with the amount of concurrent requests you use. Do you really need the data within the hour, can you really not wait 24 hours for it?
If you really need to keep sending requests to the site, then you're going to have to invest in some proxies. You can then simply create a simple proxy manager class that makes sure you only use the same proxy once every X interval so that it doesn't get blocked.
Check out the Concurrency and Avoiding Detection sections above.
Code Examples
The following examples aren't really of any use on their own. They're simply here as a reference for you.
Simple requests example
I'm not really going to show much in this example, you can find out all you need over at the requests documentation.
import requests response = requests.get('http://jakeaustwick.me') # Response print response.status_code # Response Code print response.headers # Response Headers print response.content # Response Body Content # Request print response.request.headers # Headers you sent with the request
Request with proxy
proxy = {'http' : 'http://102.32.3.1:8080', 'https': 'http://102.32.3.1:4444'} response = requests.get('http://jakeaustwick.me', proxies=proxy)
Parsing the responses body
It's easy to parse the html returned with lxml. Once we have it parsed into a tree, we can call xPaths on it to grab the data we are interested in.
import requests from lxml import html response = requests.get('http://jakeaustwick.me') # Parse the body into a tree parsed_body = html.fromstring(response.text) # Perform xpaths on the tree print parsed_body.xpath('//title/text()') # Get page title print parsed_body.xpath('//a/@href') # Get href attribute of all links
Download all images on a page
The following script will download all the images, and save them in downloaded_images/
. Make sure that directory exists before running it.
import requests from lxml import html import sys import urlparse response = requests.get('http://imgur.com/') parsed_body = html.fromstring(response.text) # Grab links to all images images = parsed_body.xpath('//img/@src') if not images: sys.exit("Found No Images") # Convert any relative urls to absolute urls images = [urlparse.urljoin(response.url, url) for url in images] print 'Found %s images' % len(images) # Only download first 10 for url in images[0:10] r = requests.get(url) f = open('downloaded_images/%s' % url.split('/')[-1], 'w') f.write(r.content) f.close()
Single Threaded Crawler (basic)
This crawler just the basic principle behind crawling. It uses a dequeue so you can pop from the left of the queue and therefore fetch URLs in the order you find them. It doesn't do anything useful with the pages, just printing out the page title.
import requests from lxml import html import urlparse import collections STARTING_URL = 'http://jakeaustwick.me' urls_queue = collections.deque() urls_queue.append(STARTING_URL) found_urls = set() found_urls.add(STARTING_URL) while len(urls_queue): url = urls_queue.popleft() response = requests.get(url) parsed_body = html.fromstring(response.content) # Prints the page title print parsed_body.xpath('//title/text()') # Find all links links = [urlparse.urljoin(response.url, url) for url in parsed_body.xpath('//a/@href')] for link in links: found_urls.add(link) # Only enqueue if it's a HTTP/HTTPS link if url not in found_urls and url.startswith('http'): urls_
Comments
Post a Comment