Assignment 1 - Data Cleaning using Open Refine

At this point you should have followed or looked at page for the OpenRefine tutorial. You will find instructions for what version of OpenRefine to load.

For the assignment download the following dataset: Google Appstore Apps and Ratings

Load the dataset into Google Refine using the default settings.

Tasks:

  • Task 1: Analyse the dataset and find and report errors you observe
  • Task 2: Convert the "Size" column to something analyzable numerically. Describe how you did so.
  • Task 3: How could you modify the dataset so that you can easily see a histogram of which genre is the most common?
  • Task 4: The "current version" format is inconsistent. What multi-edit steps could be performed to clean up this column without having to edit individual entries?

What to submit

  • Submit a .zip archive (no .rar, .tar, ...) called YOURLASTNAME-assignment-1.zip. In this zip archive put the following 6 files:
    • Create a file called YOURLASTNAME-errors.txt in which you list the types of errors you observed. Note that you do not need to list each individual error - if you can group them, do so. For example, instead of "App 1 has no timestamp, App 2 has no timestamp" report "Several apps have no timestamp".
    • To work on Task 2 start OpenRefine again and load the dataset one more time. Perform the required operations. Next, in OpenRefine go to "Undo/Redo" on the left and click "Select All". Copy the selected text on the right into a file called YOURLASTNAME-operations-task3.json Also add a file YOURLASTNAME-operations-task2.txt in which you describe in a couple of bullet points the operations that you performed. I strongly suggest that you double-check the .json file that you submit and try if you can retrace your own steps. Every year students lose points because the submit unreadable .json files.
    • To work on Task 3, again start with a clean copy of OpenRefine. Perform the required steps and export them the same way as for Task 2 above. Files to be created are YOURLASTNAME-operations-task3.json for the steps you took and YOURLASTNAME-operations-task3.txt.
    • For Task 4 create a file called YOURLASTNAME-operations-task4.txt and describe the steps that could be performed. No .json file needed this time. Again, group types of operations into a meaningful category instead of listing each individual step.

How & when to submit

WHERE - You should email the file to petra.isenberg@inria.fr and Anastasia.Bezerianos@lri.fr with the subject InfoVis-Assignment1.

WHEN - Assignment 1 is due before 23:00 on November 28th.