# Scraping Tutorial

* Course: Visual Analytics, CentraleSupelec
* Author: Petra Isenberg, Wesley Willett

## Step 1

We will scrape information from a Wikipedia page like this one: https://fr.wikipedia.org/wiki/Chat
Your first step should always be to familiarize youself with the website/s you want to scrape. Take a look at the cat website.

Look at the URL and see if you can bring up the "chien" website (that's dog in French)


## Step 2

Next we have to download the pages we want to extract data from
Here, we want to get the html code of an arbitrary wikipedia page (we'll start with cats, everyone loves them on the internet)

In [1]:
#We first need a few things

import os.path
from os import path
import requests 
#if you get an error here, call 
# pip3 install requests on your console
#or if you are on Anaconda Python: conda install -c anaconda requests 



In [28]:
#Here are some variables we might want to define

wikipedia_fr_root = 'https://fr.wikipedia.org/wiki/' #that's the root url of the french wikipedia
directory = "scraped_wikipages"

#and we check that we have a directory to save our scraped pages to
if not path.isdir(directory):
 os.mkdir(directory)

Here we define a function that will allow us to download and save a wikipedia page given its page name to a folder called "scraped_wikipages"

In [36]:
# Load the fr.wikipedia.org page with a given page_name and 
# save it as an html file with that same name.

def fetch_and_save_wiki_page(page_name):

 full_url = wikipedia_fr_root+page_name

 #the name of the file to save to
 file_name = directory+"/"+page_name+".html"
 
 page = requests.get(full_url, allow_redirects=True)
 
 if not page.status_code == 200:
 print("Some error occurred loading the page. Status code: " + str(page.status_code))
 else:
 open(file_name, 'wb').write(page.content)
 
 return page

Good, let's see if it works. Try saving a file by calling this:

fetch_and_save_wiki_page('chat')



## Scraping the content of the page

We can use the BeautifulSoup library to parse our downloaded page and extract the text we are interested in. More info on BeautifulSoup here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [68]:
import pandas as pd
from bs4 import BeautifulSoup

#if this doesn't work, install the package like this: pip install beautifulsoup4
# or on Anaconda python conda install -c anaconda beautifulsoup4 

In [46]:
# This function opens a saved file and uses BeautifulSoup to extract the title of the page

def get_wiki_page_title(file_name):
 
 with open(file_name, encoding='utf-8') as fp:
 soup = BeautifulSoup(fp, 'html.parser')
 title = soup.title
 return title.text
 


[TRY IT] Try extracting the title by calling:

title = get_wiki_page_title('scraped_wikipages/chat.html')
print(title)

In [47]:
get_wiki_page_title('scraped_wikipages/chat.html')

'Chat — Wikipédia'

In [51]:
# Opens a saved Wikipedia page and extracts the InfoBox

def get_wiki_info_box(file_name):
 with open(file_name, encoding='utf-8') as fp:
 soup = BeautifulSoup(fp, 'html.parser')
 tags = soup.find("div",class_="infobox_v3")
 return tags 

[TRY IT] Try extracting an infobox by calling:

```python
info_box_html = get_wiki_info_box('scraped_wikipages/chat.html')
print(info_box_html.prettify())
```

Find more info on how to get specific content out of your files in the Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Savings Results to a CSV File

Next we want to create a dataset out of the content we scrape. Here is an example



In [69]:
# Loads a saved Wikipedia page and saves all the links to a CSV file

def process_wiki_links_to_csv(file_name):
 # Read in the file and parse the html
 all_links = []
 
 with open(file_name, encoding='utf-8') as fp:
 soup = BeautifulSoup(fp, 'html.parser') 
 
 # Find all the links in the document
 all_links = soup.find_all('a', href=True)
 
 all_hrefs = [a['href'] for a in all_links]
 link_texts = [a.text for a in all_links]
 
 # Build a data frame (a "table") with info for each link
 df = pd.DataFrame({'link':all_hrefs,
 'text':link_texts
 })
 csv_file_name = file_name+"_links.csv"
 df.to_csv(csv_file_name)
 print("Saved " +csv_file_name)
 
 

 

[TRY IT] Try saving all of the links as a CSV file

```python
process_wiki_links_to_csv('scraped_wikipages/chat.html')
```

Saved scraped_wikipages/chat.html_links.csv


In [71]:
#and try something else
fetch_and_save_wiki_page('mouton')
process_wiki_links_to_csv('scraped_wikipages/mouton.html')

Saved scraped_wikipages/mouton.html_links.csv
