{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extended Scraping Tutorial\n", "\n", "* Course: Visual Analytics, CentraleSupelec\n", "* Author: Petra Isenberg, Wesley Willett\n", "\n", "The purpose of this exercise is to get you familiar with scraping across multiple pages and looking at the website for pieces of information to scrape for\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#some imports you might need\n", "\n", "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "\n", "import os.path\n", "from os import path\n", "import requests \n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## TRY IT\n", "Write a function to fetch and save all of the vispubdata pages found here: \n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "vispubdata_root = 'https://www.lri.fr/~isenberg/VA/vispubdata'\n", "vispubdata_first = 'pub_output_0.html'\n", "\n", "\n", "def fetch_vispubdata_pages():\n", " print('TODO: Fetch vispubdata pages and save them locally. Notice that there are multiple pages of records to download.')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a function that processes all of the saved vispubdata files and produces a CSV\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def process_vispubdata_pages():\n", " #TODO: Process vispubdata pages\n", " print('TODO: Process vispubdata pages to a CSV')\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODO: Fetch vispubdata pages and save them locally. Notice that there are multiple pages of records to download.\n", "TODO: Process vispubdata pages to a CSV\n" ] } ], "source": [ "fetch_vispubdata_pages()\n", "process_vispubdata_pages()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }