Tutorial and Assignment 2 - Data Cleaning

In this tutorial you will use Google Refine to clean a dataset. We will perform some cleaning together in class.

Getting Started


Install Google Refine. You can install download and install from here:

 http://openrefine.org/download.html

You should install the version called "OpenRefine 2.6-rc2 Release Candidate 2" at the top of the page. builds.)''

The documentation for Google Refine / Open Refine is available here.

There are also a set of nice introductory tutorials available on YouTube: Part 1, Part 2, Part 3

Here are helpful pointers to the Open Refine Expression Language

Files


universityData.csv - A file containing sample data we will use in the tutorial.

Assignment


For your assignment you will be working on a slightly expanded version of the csv file you created in the last assignment. This file contains a few more columns:

  • Deduped.author.name: a column that lists author names that have been manually cleaned using Jigsaw
  • OCR.Title: a title extracted from the pdf of each paper using a data extraction tool called grobid
  • OCR.Authors: this field contains authors extracted using grobid from the paper pdfs. If differs from the authors and deduped authors column in that it includes the full first names of the authors. It is ordered like the other two columns as lastname, firstname.

The last two columns are not actually important for this assignment.

For the assignment load the following data file into OpenRefine:

use the following settings upon creating your project (or your work may not be correctly graded):

Your task

Create two new csv files:

File 1 should contain data in this form:

Paper.DOIDeduped Author Name

That means, if you have a paper that has three authors, such as: 10.0001.0001, Isenberg,P;Dragicevic,P.;Fekete,J.D the file should look like this:

10.0001.0001Isenberg,P.
10.0001.0001Dragicevic,P.
10.0001.0001Fekete,J.D.

File 2 should contain:

Deduped Author NameAffiliation

The file2 should not contain rows with empty data. On file 2 also perform at least 3 different types of cleaning operations as we practiced them in class (or any others you may want to apply). These cleaning operations should be performed on multiple cells at once (single cell cleaning does not count).

Submitting the Assignment


WHAT - You should submit a single ZIP file called "YOUR_LASTNAME-Assignment2.zip" via email. It should contain:

  1. Two CSV files named "YOUR_LAST_NAME-Assignment2-File-#.csv" containing the cleaned data.
  2. Two JSON files named "YOUR_LAST_NAME-Assignment2-File-#.json" containing the operations you used to clean the data.
  3. A txt file called YOUR_LAST_NAME-explanation.txt explaining the cleaning operations you performed

WHERE - You should email the file to petra.isenberg@inria.fr with the subject VA-Assignment2.

WHEN - Remember that Assignment 2 is due before "23:00 on Wednesday, September 28th.'''