{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping Tutorial\n", "\n", "* Course: Visual Analytics, CentraleSupelec\n", "* Author: Petra Isenberg, Wesley Willett" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1\n", "\n", "We will scrape information from a Wikipedia page like this one: https://fr.wikipedia.org/wiki/Chat\n", "Your first step should always be to familiarize youself with the website/s you want to scrape. Take a look at the cat website.\n", "\n", "Look at the URL and see if you can bring up the \"chien\" website (that's dog in French)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2\n", "\n", "Next we have to download the pages we want to extract data from\n", "Here, we want to get the html code of an arbitrary wikipedia page (we'll start with cats, everyone loves them on the internet)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#We first need a few things\n", "\n", "import os.path\n", "from os import path\n", "import requests \n", "#if you get an error here, call \n", "# pip3 install requests on your console\n", "#or if you are on Anaconda Python: conda install -c anaconda requests \n", "\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "#Here are some variables we might want to define\n", "\n", "wikipedia_fr_root = 'https://fr.wikipedia.org/wiki/' #that's the root url of the french wikipedia\n", "directory = \"scraped_wikipages\"\n", "\n", "#and we check that we have a directory to save our scraped pages to\n", "if not path.isdir(directory):\n", " os.mkdir(directory)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we define a function that will allow us to download and save a wikipedia page given its page name to a folder called \"scraped_wikipages\"" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# Load the fr.wikipedia.org page with a given page_name and \n", "# save it as an html file with that same name.\n", "\n", "def fetch_and_save_wiki_page(page_name):\n", "\n", " full_url = wikipedia_fr_root+page_name\n", "\n", " #the name of the file to save to\n", " file_name = directory+\"/\"+page_name+\".html\"\n", " \n", " page = requests.get(full_url, allow_redirects=True)\n", " \n", " if not page.status_code == 200:\n", " print(\"Some error occurred loading the page. Status code: \" + str(page.status_code))\n", " else:\n", " open(file_name, 'wb').write(page.content)\n", " \n", " return page" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good, let's see if it works. Try saving a file by calling this:\n", "\n", "fetch_and_save_wiki_page('chat')\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scraping the content of the page\n", "\n", "We can use the BeautifulSoup library to parse our downloaded page and extract the text we are interested in. More info on BeautifulSoup here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "\n", "#if this doesn't work, install the package like this: pip install beautifulsoup4\n", "# or on Anaconda python conda install -c anaconda beautifulsoup4 " ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "# This function opens a saved file and uses BeautifulSoup to extract the title of the page\n", "\n", "def get_wiki_page_title(file_name):\n", " \n", " with open(file_name, encoding='utf-8') as fp:\n", " soup = BeautifulSoup(fp, 'html.parser')\n", " title = soup.title\n", " return title.text\n", " \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[TRY IT] Try extracting the title by calling:\n", "\n", "title = get_wiki_page_title('scraped_wikipages/chat.html')\n", "print(title)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Chat — Wikipédia'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_wiki_page_title('scraped_wikipages/chat.html')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "# Opens a saved Wikipedia page and extracts the InfoBox\n", "\n", "def get_wiki_info_box(file_name):\n", " with open(file_name, encoding='utf-8') as fp:\n", " soup = BeautifulSoup(fp, 'html.parser')\n", " tags = soup.find(\"div\",class_=\"infobox_v3\")\n", " return tags " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[TRY IT] Try extracting an infobox by calling:\n", "\n", "```python\n", "info_box_html = get_wiki_info_box('scraped_wikipages/chat.html')\n", "print(info_box_html.prettify())\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find more info on how to get specific content out of your files in the Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Savings Results to a CSV File\n", "\n", "Next we want to create a dataset out of the content we scrape. Here is an example\n", "\n" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "# Loads a saved Wikipedia page and saves all the links to a CSV file\n", "\n", "def process_wiki_links_to_csv(file_name):\n", " # Read in the file and parse the html\n", " all_links = []\n", " \n", " with open(file_name, encoding='utf-8') as fp:\n", " soup = BeautifulSoup(fp, 'html.parser') \n", " \n", " # Find all the links in the document\n", " all_links = soup.find_all('a', href=True)\n", " \n", " all_hrefs = [a['href'] for a in all_links]\n", " link_texts = [a.text for a in all_links]\n", " \n", " # Build a data frame (a \"table\") with info for each link\n", " df = pd.DataFrame({'link':all_hrefs,\n", " 'text':link_texts\n", " })\n", " csv_file_name = file_name+\"_links.csv\"\n", " df.to_csv(csv_file_name)\n", " print(\"Saved \" +csv_file_name)\n", " \n", " \n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[TRY IT] Try saving all of the links as a CSV file\n", "\n", "```python\n", "process_wiki_links_to_csv('scraped_wikipages/chat.html')\n", "```" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved scraped_wikipages/chat.html_links.csv\n" ] } ], "source": [] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved scraped_wikipages/mouton.html_links.csv\n" ] } ], "source": [ "#and try something else\n", "fetch_and_save_wiki_page('mouton')\n", "process_wiki_links_to_csv('scraped_wikipages/mouton.html')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }