Tutorial -- Building a Webscraper

In this tutorial you build a basic python web scraper to download and process data. We will build part of the scraper together in class, and you can complete the second part on your own.

Getting Started

If you haven't already done so:

  • Install Python 3.6++
  • Install the following packages: pandas, BeautifulSoup, requests

Files

In-Class Practice - "Wikipedia" Data

Your first task is to scrape and parse a wikipedia page:
To get you started, we have provided an initial script that can pull data from the pages and save it to a CSV file. Download the Tutorial 1 Example Jupyter Notebook above. Then we will go over how to:

  • Modify the script and use it to download pages containing wikipedia pages and save them as HTML files to your local machine.
  • Load and parse the HTML files and use BeautifulSoup to extract the records
  • Save data to a CSV file using a comma as the delimiter (that's the default setting for writing pandas dataframes to csv)

On Your Own - "Publication" Data

Once you have completed the "wikipedia" dataset, you can then modify your scraper to practice with a version of the publication data set that I have used in previous versions of this course. Download Exercise #2 Starter Code from above and extract the data from this URL:
https://www.lri.fr/~isenberg/VA/vispubdata/

Make sure that all data fields are meaningfully separated and that it would be ready for analysis.

Assignment

Head over to the AssignmentCollection page and see how scraping might be part of your next assignment