(Originally posted July 28, 2013)
Web scraping is fun. I think it has something to do with the fact that you’re seemingly stealing information that’s shouldn’t be stolen. In reality, it’s all public anyway so there isn’t really any issue other than possibly private policy issues and making sure you aren’t accidentally ddos-ing the site. Actually, the private policy thing might be a real issues since people are overly protective of their data when they shouldn’t be but that’s another issue. Either way, it’s fun, and you can make some really cool things with it.
Currently, I’m working on a fantasy golf app for some friends. For this, there are two parts in this that need scraping. The first is getting all the information about the players, including their world ranking, into the database. The second is getting up to date information on the tournament results when there is a tournament happening. Theoretically, the first could be done by hand since it’s just inputting names and only takes place once a quarter according to the rules of the fantasy game, but doing things by hand when you can do them programmatically is a giant waste of time. Because of this, I’m going to show how to use web scraping to update the Django database at the same time. Since later we’re going to be scraping another website with auto run using celery, we’re going to write this in tasks.py.
We’re going to need some new installs for the scraping. The first is requests, which makes getting web pages really simple compared to urllib2 or something like that. The other is beautifulsoup, which is what we’re going to use on the dom. Assuming that your virtualenv is running,
(env)$ pip install requests
(env)$ pip install beautifulsoup4
The initial setup of the scraping function is the following.
import requests
from bs4 import BeautifulSoup
def wr_scrape():
r = requests.get('http://www.officialworldgolfranking.com/rankings/default.sps')
soup = BeautifulSoup(r.
1ff8
text)
wr_scrape()
After you have the dom loaded into Beautiful Soup, it sort of turns into an art to get access to the information you’re looking for. I’m by no means an expert, but I’ve done enough to know a few tricks. First, you need to go searching through the dom using something like chrome’s Developer Tools to find a tag that includes everything you want. Hopefully there is a class or id on that which you can use to get access to quickly. Then loop through, or splice out the information from there. In this case, we are looking for the table that has the info for each player, and it looks like it has the title “Click on player names to be taken to their individual tournaments page”. So we’ll use that as the identifier.
tables = soup.find_all('table', title="Click on player names to be taken to their individual tournaments page")
This oddly returns two such tables, only one of which we want. From here we want to loop the rows in the table with the player’s info. I’m going to stop here in describing all the steps I took since I can go on forever with the little details. Like I mentioned earlier, this is an art. There is a bunch of trial and error that goes into getting the info. Practice and you’ll get better. After some finagling, here is the code to get the relevant information. We want the player’s name, their rank, and the unique id that the World Ranking website gives to the player.
import requests
from bs4 import BeautifulSoup
import cgi
def wr_scrape():
import pdb
r = requests.get('http://www.officialworldgolfranking.com/rankings/default.sps')
soup = BeautifulSoup(r.text)
tables = soup.find_all('table', title="Click on player names to be taken to their individual tournaments page")
table = tables[1]
for info in table.contents:
try:
data = info.contents[7]
player_name = data.a.string
player_url = data.a['href']
qs = cgi.parse_qs(data.a['href'])
player_rank = qs['Rank'][0]
player_ID = qs['/players/bio.sps?ID'][0]
print '%s, %s, %s' % (player_name, player_rank, player_ID)
except AttributeError: # incase it's just a string
pass
except IndexError: #this is if there is no [7]
pass
wr_scrape()
We put everything in a try block and use exception catching to filter out the false rows in the table. After we execute the info.contents[7] and data.a[‘href’] lines, anything that doesn’t raise the exception passes through. The print line at the end prints out the info as a check to make sure that we have all the information that we want and that it’s correct.
Now we want to interface with the models. This is the current version of the model.
class Player(models.Model):
name = models.CharField(max_length=40)
current_wr = models.IntegerField()
id_wr = models.IntegerField()
id_pga = models.IntegerField()
We have three of the four attributes that we want. The other, id_pga, is for scraping the leaderboards, which we’ll deal with later. In order to deal with models and other Django elements in the script, we need to set up the Django settings in the script. To do this, we want to do a few things. The first is to create a directory in the same level as manage.py called bin. Inside there, we want to create a file that runs the python command to run the script as a module, so we can use relative imports. This file is called “update_wr”.
#!/bin/bash
python -m fg.apps.players.tasks
Then we
$ chmod u+x bin/update_wr
so we can run the command from the command line. Then, in the tasks.py file, we need to do a few things. The first is make sure that the settings are configured to allow for Django model use. Since we aren’t calling this from manage.py, we need to do the configuration ourselves at the top of tasks.py.
from django.conf import settings
if not settings.configured:
from ... import settings
from django.core.management import setup_environ
setup_environ(settings)
We check to see if the settings are configured, and if they aren’t, we use relative imports to import them and then setup the environment. I should mention that there is another way to make this run with the Django settings by making it callable from the command line in the same way as we do syncdb or runserver. We also want to move the call of wr_update() at the bottom for the file to
if __name__ == '__main__':
wr_scrape()
This is because we only want to script to run if we call it specifically. The last step here is to add the part of the script that adds to the database!
try:
player = Player.objects.get(id_wr=player_ID)
except Player.DoesNotExist:
player = None
if player is None:
player = Player(name=player_name, current_wr=player_rank, id_wr=player_ID, id_pga=-1)
player.save()
else:
player.current_wr = player_rank
player.save()
This snippet should put right before the print statement (I kept that in so I can see that everything is working fine). If you’ve imported the Player model at the top of the file, but under the settings configuration code, then you should now have populated the first 50 players in the world rankings! The final thing to do is to get more than just the top 50, something like the top 250 would be reasonable for now. Looking at the urls when we ask for 50-100, we can see that it just follows an incrementing paging system. If we just change the page number in the url, we get different values.
base_url = 'http://www.officialworldgolfranking.com/rankings/default.sps?region=world&PageCount='
for i in range(1,6):
url = base_url + str(i)
r = requests.get(url)
With put this snippet at the start of the function, and if you make sure that all the lines are indented correctly, then we should have just added players in the world ranking from 1 – 250!
You can see the code in it’s entirety on github here. We still have a little more scraping to do – getting the tournament leaderboards into the database and matching with the players we have. But that’s for next time.