Assignment-Cleaning

Assignment

For your project you should now have data to start and analyze but you might have to tidy and clean it up. You might also have to reduce it to a more manageable size.

Your tasks

Your first task is to now transform your data into something read for analysis - that is something that looks like one ore multiple tidy data tables. This task might involve the following steps:

If your data is huge (>a few hundred MB) then reduce it in size (e.g. filter out some years, remove unnecessary columns, empty rows, etc.). OpenRefine might not be able to help you with this task and you might have to do this in R, Python or another language.
If you found instructured data (tweets, media files, ...) then extract data from them that you might need for analysis. This could require running a sentiment analysis or extracting word counts, metadata, etc.
If you already have a data table then make sure it's in a tidy (see the last section) format.

Your second task is to do some data cleaning. Take your transformed data and load it into OpenRefine. Here, inspect the data as we learned in the tutorial and correct data errors. Keep track of the types of possible data errors you found and what changes you made to the dataset. Also save your operations in a .json file so you can reuse them in case you need to change your dataset again.

FAQ

My data does not or cannot contain errors because of how I obtained the data. What should I do?

If your dataset does not or cannot contain errors, then for Task 2 instead look at the distributions of the data variables and potential outliers, see if the data contains what you need for analysis. In your report describe what you see and post some pictures of the distributions (don't do descriptive statistics yet for your data. We will get to that later).

My dataset contains unstructured data and I cannot load it into a program for cleaning

If you collected unstructured data such as tweets then extract metadata for each tweet that you can turn into a tabular format. For example tweets contain metadata on users, locations, likes, ...

I collected more than one dataset that I want to use. What should I do

For the report to submit just describe the results for one of the datasets. (However, it may be in your best interest to clean all datasets you need for later stages of the project.

Submitting the Assignment

WHAT - You should submit a report about your data transformations called "YOUR_LASTNAMEs-Assignment4.pdf" via email. It should contain the following content:

Your names, your topic, and your focus research question
Roughly one page explaining the types of data transformations you had to perform and what the dataset looked like before and after. Show a table with your final datastructure (no need to print out all observations but describe the general structure).
Roughly one page in which you detail the types of errors in the data you uncovered and how you fixed them. Feel free to add some screenshots. If you find no errors in the data then instead show some distributions of the data values for the different variables in your data.

WHERE - You should email the file to petra.isenberg@inria.fr with the subject VA-Assignment4.

WHEN - Remember that Assignment 4 is due before 23:00 on October 15th.