Data Analysis Challenge

Throughout this course we will be working with a variety of datasets on a larger data analysis task. This task (as explained in the first tutorial) centers around scientific publications of the IEEE Visualization conference. The full dataset is available here: http://vispubdata.org.

Throughout the class we will work with a number of challenges related to this dataset. The first challenge we will prepare for surrounds questions of gender equality. Gender equality is an important problem in science but also to most businesses and enterprises. In order to analyze questions relating to gender equality we will have to be able to identify genders from the author names of scientific publications.

Your Task

  • Form a group up to 4 students (we will have done this in class, if you weren't there, search for a team on Slack)
  • Download the data linked to from http://vispubdata.org as a csv file
  • Create an R project to extract author names. You can use the following code to extract first names from the dataset
data <- read.csv("IEEE VIS papers 1990-2016 - Main dataset.csv")
authors <- data$Author.Names

firstnames <- c()

for (names in authors){
  ns <- strsplit(names,';')
  for (name in ns[[1]]){
    firstname <- strsplit(name,' ')[[1]][1]
    firstnames <- c(firstnames,firstname)
  }
}

firstnameslower <- tolower(firstnames)
firstnamesunique <- unique(firstnameslower)

write.csv(firstnamesunique,"IEEEVIS-authors-firstnames.csv",row.names=FALSE)
  • Scroll through this dataset in any tool you like (Excel, Google Spreadsheets, a text editor, RStudio, ...) and have a look at it to get a feel for the types of names included in the data.
  • As a team think about a way to meaningfully extend this dataset to help you answer questions about gender based on people's first name. Which data column(s) do you need to include? What value should this/these column(s) contain?
  • Each team member should now go out and find one dataset online that will give you a matching (or likely matching) between a first name and a gender. You are welcome to use the dataset you created above to query APIs or use code from the webscraper we built in class to scrape data from webpages (as long as you are not doing anything illegal). You are not allowed to pay for data.

I will give bonus points if you actively try to cover names from different cultures (French, English, Chinese, Indian, Iranian, etc.) and if you attempt to create a file covering many names or at least many of the names from the set you extracted above (e.g. if you decide to query an existing corpus using an API it is ok to only use the names you need from the vispubdata corpus). You can help out your fellow team members with advise, or sit down to search or extract data with them but each team member should primarily be responsible for just one dataset extraction.

  • Combine the datasets from all team members in R to create one big data table using the format of the file you chose (some help here)
  • Remove duplicates (e.g. using https://rdrr.io/cran/dplyr/man/distinct.html)
  • Compare your first names to the IEEE VIS first names to see how many of the first names you found

Submitting the Assignment


WHAT - To complete the assignment you should:

  1. Upload all four data files as well as your combined file to a public dropbox, google drive, or box account
  2. Submit a single ZIP file called "YOUR_TEAMNAME-Assignment-1.zip" via email. It should contain:
    1. A file team.csv that lists the names of your team members as well as their email addresses
    2. A report.pdf document that:
    3. Includes the link to the data files
    4. A description of which team member contributed which dataset
    5. For each dataset a short description (max. 1 paragraph) about a) where it was found and how it was collected, b) when it was downloaded/extracted/scraped, c)how it was processed (if at all), d) what kind of file it is (e.g. a student name list, a census dataset etc.)
    6. A paragraph max describing the final datasets and its columns and data types. Also add a comparison between the IEEE VIS names and the names in your dataset. What percentages of the names did you cover? Which types of names are not included and why?
    7. OPTIONAL: Any supporting material you wish to add, such as code written for a scraper or for querying an API.

WHERE - You should email the file to petra.isenberg@inria.fr with the subject VA-Assignment-1.

WHEN - Assignment 1 is due before "23:00 on Oct 24th.'''