Tutorial 1 -- Building a Webscraper

In this tutorial you build a basic R web scraper to download and process data that you will use to help solve the challenge over the next few weeks. We will build part of the scraper together in class, and you will complete the second part on your own.

You should submit the completed assignment to us before 23:00 on Wednesday, September 21 (details below)

Getting Started

Install R from this website or from this website (mirrors)
Install RStudio from its website

Files

Tutorial #1 Example Script: ScrapingTutorial
Assignment #1 Starter Code: Tutorial1_scraper_assignment2.R (updated)

In-Class Practice - "Wikipedia" Data

Your first task is to scrape and parse a wikipedia page:
To get you started, we have provided an initial script that can pull data from the pages and save it to a CSV file. Download the Tutorial 1 Example Script above. Then we will go over how to:

Modify the R script and use it to download pages containing wikipedia pages and save them as HTML files to your local machine.
Load and parse the HTML files and use rvest to extract the records
Save data to a CSV file, using a coma as a delimiter.

On Your Own - "Publication" Data

Once you have completed the "wikipedia" dataset, you should then modify your scraper to work with the "publication" data that we will be using throughout the course. Download Assignment #1 Starter Code from above and extract the data from this URL:
https://www.lri.fr/~isenberg/VA/vispubdata/

Make sure that all data fields are meaningfully separated so that we can proceed with analyzing the data in the following classes.

Submitting The Assignment