Tutorial and Assignment 1 -- Building a Webscraper

In this tutorial you build a basic R web scraper to download and process data that you will use to help solve the challenge over the next few weeks. We will build part of the scraper together in class, and you will complete the second part on your own.

You should submit the completed assignment to us before 23:00 on Monday, September 21 (details below).

Getting Started

Files

In-Class Assignment - "Movement" Data

Your first task is to scrape and parse the "movement" dataset:
https://www.lri.fr/~isenberg/VA/movement/

To get you started, we have provided an initial script that can pull data from the pages and save it to a CSV file (see Example script above).

We will go over how to:

  • Modify the R script and use it to download pages containing the movement records from the website and save them as HTML files to your local machine. (Hint: There are multiple pages of records.)
  • Load and parse the HTML files and use rvest to extract the records
  • Save data to a CSV file.

On Your Own - "Communication" Data

Once you have completed the "movement" dataset, you should then modify your scraper to work with the "communication" data:
https://www.lri.fr/~isenberg/VA/communication/

Submitting The Assignment

  • WHAT You should submit a single ZIP file called "{YOUR_NAME}-Assignment1.zip" via email. This should contain:
    • Two CSV files - one containing all of the communication records and the other containing all of the movement data.
    • A file containing the R scripts you used to download, parse, and save the code. This code must be clearly commented.
  • WHERE - You should email the file to petra.isenberg@inria.fr
  • WHEN - Remember that Assignment 1 is due before "23:00 on Monday, February 29.'''