Tutorial 2 - Data Cleaning

In this tutorial you will use Google Refine to parse and clean the "loyalty card" and "credit card" data you scraped in the first assignment. We will perform some cleaning together in class, and you will then perform some additional data cleaning on your own.

You should submit the completed assignment to us before 23:00 on Monday, October 6th (details below).

Getting Started

Install Google Refine. You can install download and install from here:\\ http://openrefine.org/download.html
You should install the version called "Google Refine 2.5 - Stable version" at the top of the page. (In case you're confused about the name - the product was originally created by Google, but in the past year it has been open-sourced - hence the new name "Open Refine". However, we will still be using the last Google-branded version, since it is more reliable than the current open-source builds.)

The documentation for Google Refine / Open Refine is available here.

There are also a set of nice introductory tutorials available on YouTube: Part 1, Part 2, Part 3


Movies-Data.csv - A file containing sample data we will use in the tutorial.

You will also use your two CSV files from Assignment 1.


In the assignment you should clean your the two data files you produced last week.

Load the files into Google Refine and process them to remove any formatting problems, including:

  • Remove any extra whitespace and miscellaneous characters (quotes, square braces, etc.) from your columns.
  • Make sure all numbers and times are formatted correctly
  • Identify and correct any duplicate or misspelled names and locations
  • (Keep a typed list of the errors you fix, and submit it along with the cleaned data files.)

The final files should contain these five columns:
timestamp, location, price, FirstName, LastName
(Timestamp for the credit card data should include the date and time).

When you are done with each file, export it as a new CSV.

Once you've finished, you should also extract JSON scripts containing all of the operations you performed on each of the files. (Select Extract at the top of the Undo/Redo tab. Then copy and paste the JSON script into a new file in your text editor.)

Submitting the Assignment

WHAT - You should submit a single ZIP file called "YOUR_NAME-Assignment2.zip" via email. It should contain:

  1. A text file named "YOUR_NAME-Assignment2.txt" containing (a) your name (b) a list containing a short description for each of the errors you found and corrected in your data.
  2. A CSV file named "YOUR_NAME-Assignment2-loyalty.csv" containing the cleaned loyalty card data.
  3. A JSON file named "YOUR_NAME-Assignment2-loyalty-script.json" containing the operations you used to clean the loyalty card data.
  4. A CSV file named "YOUR_NAME-Assignment2-credit.csv" containing the cleaned credit card data.
  5. A JSON file named "YOUR_NAME-Assignment2-credit-script.json" containing the operations you used to clean the credit card data.

WHERE - You should email the file to wesley.willett@inria.fr with the subject VA-Assignment2.

WHEN - Remember that Assignment 2 is due before "23:00 on Monday, October 6th.'''