Simple web scraping using Python

I just posted a GitHub gist showing how to use python to scrape a web page to extract a product price.

As you can see, web scraping is pretty simple, but there are some challenges you need to keep in mind:

  • First, the page you are scraping may change without notice, causing your code to fail when it can no longer find the information it was attempting to find.
  • Second, if you request too many pages from a site or request them too frequently, the site you are scraping may start blocking or denying your requests.
  • Finally, depending on the site you are scraping, automated scraping may violate the terms of service.

Here is the python code to scrape a product web page example:

from bs4 import BeautifulSoup
from urllib2 import Request, urlopen
import decimal

def findPrice(url, selector):
	userAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
	req = Request(url, None, {'User-Agent': userAgent})
	html = urlopen(req).read()	
	soup = BeautifulSoup(html, "lxml")
	return decimal.Decimal(soup.select(selector)[0].contents[0].strip().strip("$"))

print findPrice("https://cdn.rawgit.com/brianpursley/661071c026b9bf130971/raw/94a914d15e977150b531c5c44cbee1545f9e70f0/example-scrape-target.html", "#priceRow > div:nth-of-type(2)")

This code requires python and Beautiful Soup in order to run.

On Ubuntu you can use the following commands to install the dependencies, get the python script from the gist, and run it:

$ apt-get install -y wget python python-bs4

$ wget https://gist.githubusercontent.com/brianpursley/c0c56b03f8e0095f77db/raw/f5133f2569da75f070f5a871118d0c1e76dffce0/scrape.py

$ python scrape.py

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *