+ - 0:00:00
Notes for current slide
Notes for next slide

POL 478H1 F

Intro to Webscraping

Olga Chyzh [www.olgachyzh.com]

1 / 16

Outline

  • What is webscraping?

  • Webscraping using rvest

  • Examples

    • IMDB show cast

    • 2020 US election returns

2 / 16

What is Webscraping?

  • Extract data from websites

    • Tables

    • Links to other websites

    • Text

3 / 16

Why Webscrape?

  • Because copy-paste is tedious

  • Because it's fast

  • Because you can automate it

  • Because it helps reduce/catch errors

4 / 16

Webscraping: Broad Strokes

  • All websites are written in HTML (mostly)

  • HTML code is messy and difficult to parse manually

  • We will use R to

    • read the HTML (or other) code
    • clean it up to extract the data we need
  • Need only a very rudimentary understanding of HTML

5 / 16

Webscraping with rvest: Step-by-Step Start Guide

Install all tidyverse packages:

# check if you already have it
library(tidyverse)
library(rvest)
# if not:
install.packages("tidyverse")
library(tidyverse) # only calls the "core" of tidyverse
6 / 16

Step 1: What Website Are You Scraping?

# character variable containing the url you want to scrape
myurl <- "https://www.imdb.com/title/tt0068646/"
7 / 16

Step 2: Read HTML into R

  • HTML is HyperText Markup Language.

  • Go to any website, right click, click "View Page Source" to see the HTML

library(tidyverse)
library(rvest)
myhtml <- read_html(myurl)
myhtml
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
8 / 16

Step 3: Where in the HTML Code Are Your Data?

  • Need to find your data within the myhtml object.

  • In HTML, all objects, such as tables, paragraphs, hyperlinks, and headings, are inside "tags" that are surrounded by <> symbols

  • Examples of tags:

    • <p> This is a paragraph.</p>
    • <h1> This is a heading. </h1>
    • <a> This is a link. </a>
    • <li> item in a list </li>
    • <table>This is a table. </table>
  • Can use Selector Gadget to find the exact location. Enter vignette("selectorgadget") for an overview.

  • Can also skim through the raw html code looking for possible tags.

  • For more on HTML, check out the W3schools' tutorial

  • You don't need to be an expert in HTML to webscrape with rvest!

9 / 16

Step 4:

Give HTML tags into html_nodes() to extract your data of interest. Once you got the content of what you are looking for, use html_text to extract text, html_table to get a table

mysummary<-html_nodes(myhtml, "#titleCast") #Gets everything in the element
mysummary
## {xml_nodeset (1)}
## [1] <div class="article" id="titleCast">\n <span class="rightcornerlink">\ ...
html_text(mysummary)
## [1] "\n \n Edit\n \n Cast\n \n Cast overview, first billed only:\n \n \n Marlon Brando\n \n \n ...\n \n \n Don Vito Corleone \n \n \n \n \n \n Al Pacino\n \n \n ...\n \n \n Michael Corleone \n \n \n \n \n \n James Caan\n \n \n ...\n \n \n Sonny Corleone \n \n \n \n \n \n Richard S. Castellano\n \n \n ...\n \n \n Clemenza \n \n \n (as Richard Castellano)\n \n \n \n \n \n \n Robert Duvall\n \n \n ...\n \n \n Tom Hagen \n \n \n \n \n \n Sterling Hayden\n \n \n ...\n \n \n Capt. McCluskey \n \n \n \n \n \n John Marley\n \n \n ...\n \n \n Jack Woltz \n \n \n \n \n \n Richard Conte\n \n \n ...\n \n \n Barzini \n \n \n \n \n \n Al Lettieri\n \n \n ...\n \n \n Sollozzo \n \n \n \n \n \n Diane Keaton\n \n \n ...\n \n \n Kay Adams \n \n \n \n \n \n Abe Vigoda\n \n \n ...\n \n \n Tessio \n \n \n \n \n \n Talia Shire\n \n \n ...\n \n \n Connie \n \n \n \n \n \n Gianni Russo\n \n \n ...\n \n \n Carlo \n \n \n \n \n \n John Cazale\n \n \n ...\n \n \n Fredo \n \n \n \n \n \n Rudy Bond\n \n \n ...\n \n \n Cuneo \n \n \n \n See full cast&nbsp;»\n \n \n \n \nView production, box office, & company info\n\n \n \n "
#Or you can combine the operations into a pipe:
myhtml %>% html_nodes("#titleCast") %>% html_text()
## [1] "\n \n Edit\n \n Cast\n \n Cast overview, first billed only:\n \n \n Marlon Brando\n \n \n ...\n \n \n Don Vito Corleone \n \n \n \n \n \n Al Pacino\n \n \n ...\n \n \n Michael Corleone \n \n \n \n \n \n James Caan\n \n \n ...\n \n \n Sonny Corleone \n \n \n \n \n \n Richard S. Castellano\n \n \n ...\n \n \n Clemenza \n \n \n (as Richard Castellano)\n \n \n \n \n \n \n Robert Duvall\n \n \n ...\n \n \n Tom Hagen \n \n \n \n \n \n Sterling Hayden\n \n \n ...\n \n \n Capt. McCluskey \n \n \n \n \n \n John Marley\n \n \n ...\n \n \n Jack Woltz \n \n \n \n \n \n Richard Conte\n \n \n ...\n \n \n Barzini \n \n \n \n \n \n Al Lettieri\n \n \n ...\n \n \n Sollozzo \n \n \n \n \n \n Diane Keaton\n \n \n ...\n \n \n Kay Adams \n \n \n \n \n \n Abe Vigoda\n \n \n ...\n \n \n Tessio \n \n \n \n \n \n Talia Shire\n \n \n ...\n \n \n Connie \n \n \n \n \n \n Gianni Russo\n \n \n ...\n \n \n Carlo \n \n \n \n \n \n John Cazale\n \n \n ...\n \n \n Fredo \n \n \n \n \n \n Rudy Bond\n \n \n ...\n \n \n Cuneo \n \n \n \n See full cast&nbsp;»\n \n \n \n \nView production, box office, & company info\n\n \n \n "
10 / 16

Most Often, We Want to Extract a Table

myhtml %>% html_nodes("table") %>% html_table(header = TRUE)
## [[1]]
## Cast overview, first billed only: Cast overview, first billed only:
## 1 NA Marlon Brando
## 2 NA Al Pacino
## 3 NA James Caan
## 4 NA Richard S. Castellano
## 5 NA Robert Duvall
## 6 NA Sterling Hayden
## 7 NA John Marley
## 8 NA Richard Conte
## 9 NA Al Lettieri
## 10 NA Diane Keaton
## 11 NA Abe Vigoda
## 12 NA Talia Shire
## 13 NA Gianni Russo
## 14 NA John Cazale
## 15 NA Rudy Bond
## Cast overview, first billed only:
## 1 ...
## 2 ...
## 3 ...
## 4 ...
## 5 ...
## 6 ...
## 7 ...
## 8 ...
## 9 ...
## 10 ...
## 11 ...
## 12 ...
## 13 ...
## 14 ...
## 15 ...
## Cast overview, first billed only:
## 1 Don Vito Corleone
## 2 Michael Corleone
## 3 Sonny Corleone
## 4 Clemenza \n \n \n (as Richard Castellano)
## 5 Tom Hagen
## 6 Capt. McCluskey
## 7 Jack Woltz
## 8 Barzini
## 9 Sollozzo
## 10 Kay Adams
## 11 Tessio
## 12 Connie
## 13 Carlo
## 14 Fredo
## 15 Cuneo
11 / 16

Step 5: Save and Clean the Data

  • You may want to remove all columns except Actor and Role.

  • Here is some sample code to clean this, but there are many ways to do the same:

library(stringr)
library(magrittr)
mydat <- myhtml %>%
html_nodes("table") %>%
extract2(1) %>% #our table is actually nested within a list element [[]]
html_table(header = TRUE)
mydat <- mydat[,c(2,4)]
names(mydat) <- c("Actor", "Role")
mydat <- mydat %>%
mutate(Actor = Actor,
Role = str_extract(Role,"[^\\n]+")) #anything but [^] one or more instances + of \n
mydat
## Actor Role
## 1 Marlon Brando Don Vito Corleone
## 2 Al Pacino Michael Corleone
## 3 James Caan Sonny Corleone
## 4 Richard S. Castellano Clemenza
## 5 Robert Duvall Tom Hagen
## 6 Sterling Hayden Capt. McCluskey
## 7 John Marley Jack Woltz
## 8 Richard Conte Barzini
## 9 Al Lettieri Sollozzo
## 10 Diane Keaton Kay Adams
## 11 Abe Vigoda Tessio
## 12 Talia Shire Connie
## 13 Gianni Russo Carlo
## 14 John Cazale Fredo
## 15 Rudy Bond Cuneo
12 / 16

Your Turn (5 min)

  • Follow the same steps to get the cast of the Wizard of Oz movie.

  • Clean up the output the best you can. Feel free to consult the stringr cheatsheet

13 / 16

Why Is This Useful?

  • Can write a loop to get cast of a long list of movies

  • Can write a loop to get any tables from any website/websites

14 / 16

Key Functions: html_text

  • html_text(x) extracts all text from the nodeset x
  • Good for cleaning output
read_html(myurl) %>%
html_nodes("p") %>% # first get all the paragraphs
html_nodes("a") %>% # then get all the links in those paragraphs
html_text() # get the linked text only
myurl <- "https://www.tripadvisor.ca/Attraction_Review-g155019-d155483-Reviews-CN_Tower-Toronto_Ontario.html"
read_html(myurl) %>%
html_nodes(".cPQsENeY") %>%
html_text()
15 / 16

Key Functions: html_table

  • html_table(x, header, fill) - parse html table(s) from x into a data frame or list of data frames
  • Structure of HTML makes finding and extracting tables easy!
myurl<-"https://electionresults.utah.gov/elections/countyCount/399789495"
read_html(myurl) %>%
html_nodes("table") %>% # get the tables
html_table(header=F) %>%
head()
16 / 16

Outline

  • What is webscraping?

  • Webscraping using rvest

  • Examples

    • IMDB show cast

    • 2020 US election returns

2 / 16
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow