POL 478H1 F

POL 478H1 FIntro to WebscrapingOlga Chyzh [www.olgachyzh.com]1 / 16

Outline

What is webscraping?
Webscraping using rvest
Examples
- IMDB show cast
- 2020 US election returns

2 / 16

What is Webscraping?

Extract data from websites
- Tables
- Links to other websites
- Text

3 / 16

Why Webscrape?

Because copy-paste is tedious
Because it's fast
Because you can automate it
Because it helps reduce/catch errors

4 / 16

Webscraping: Broad Strokes

All websites are written in HTML (mostly)
HTML code is messy and difficult to parse manually
We will use R to
- read the HTML (or other) code
- clean it up to extract the data we need
Need only a very rudimentary understanding of HTML

5 / 16

Webscraping with `rvest`: Step-by-Step Start Guide

Install all tidyverse packages:

# check if you already have it
library(tidyverse)
library(rvest)
# if not:
install.packages("tidyverse")
library(tidyverse) # only calls the "core" of tidyverse

6 / 16

Step 1: What Website Are You Scraping?

# character variable containing the url you want to scrape
myurl <- "https://www.imdb.com/title/tt0068646/"

7 / 16

Step 2: Read `HTML` into R

HTML is HyperText Markup Language.
Go to any website, right click, click "View Page Source" to see the HTML

library(tidyverse)
library(rvest)
myhtml <- read_html(myurl)
myhtml

## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

8 / 16

Step 3: Where in the HTML Code Are Your Data?

Need to find your data within the myhtml object.
In HTML, all objects, such as tables, paragraphs, hyperlinks, and headings, are inside "tags" that are surrounded by <> symbols
Examples of tags:
- <p> This is a paragraph.</p>
- <h1> This is a heading. </h1>
- <a> This is a link. </a>
- <li> item in a list </li>
- <table>This is a table. </table>
Can use Selector Gadget to find the exact location. Enter vignette("selectorgadget") for an overview.
Can also skim through the raw html code looking for possible tags.
For more on HTML, check out the W3schools' tutorial
You don't need to be an expert in HTML to webscrape with rvest!

9 / 16

Step 4:

Give HTML tags into html_nodes() to extract your data of interest. Once you got the content of what you are looking for, use html_text to extract text, html_table to get a table

mysummary<-html_nodes(myhtml, "#titleCast") #Gets everything in the element
mysummary

## {xml_nodeset (1)}
## [1] <div class="article" id="titleCast">\n    <span class="rightcornerlink">\ ...

html_text(mysummary)

## [1] "\n    \n            Edit\n    \n        Cast\n        \n        Cast overview, first billed only:\n          \n          \n Marlon Brando\n          \n          \n              ...\n          \n          \n            Don Vito Corleone \n                  \n          \n      \n          \n          \n Al Pacino\n          \n          \n              ...\n          \n          \n            Michael Corleone \n                  \n          \n      \n          \n          \n James Caan\n          \n          \n              ...\n          \n          \n            Sonny Corleone \n                  \n          \n      \n          \n          \n Richard S. Castellano\n          \n          \n              ...\n          \n          \n            Clemenza \n  \n  \n  (as Richard Castellano)\n  \n                  \n          \n      \n          \n          \n Robert Duvall\n          \n          \n              ...\n          \n          \n            Tom Hagen \n                  \n          \n      \n          \n          \n Sterling Hayden\n          \n          \n              ...\n          \n          \n            Capt. McCluskey \n                  \n          \n      \n          \n          \n John Marley\n          \n          \n              ...\n          \n          \n            Jack Woltz \n                  \n          \n      \n          \n          \n Richard Conte\n          \n          \n              ...\n          \n          \n            Barzini \n                  \n          \n      \n          \n          \n Al Lettieri\n          \n          \n              ...\n          \n          \n            Sollozzo \n                  \n          \n      \n          \n          \n Diane Keaton\n          \n          \n              ...\n          \n          \n            Kay Adams \n                  \n          \n      \n          \n          \n Abe Vigoda\n          \n          \n              ...\n          \n          \n            Tessio \n                  \n          \n      \n          \n          \n Talia Shire\n          \n          \n              ...\n          \n          \n            Connie \n                  \n          \n      \n          \n          \n Gianni Russo\n          \n          \n              ...\n          \n          \n            Carlo \n                  \n          \n      \n          \n          \n John Cazale\n          \n          \n              ...\n          \n          \n            Fredo \n                  \n          \n      \n          \n          \n Rudy Bond\n          \n          \n              ...\n          \n          \n            Cuneo \n                  \n          \n      \n            See full cast&nbsp;»\n        \n        \n    \n \nView production, box office, & company info\n\n    \n        \n    "

#Or you can combine the operations into a pipe:
myhtml %>% html_nodes("#titleCast") %>% html_text()

## [1] "\n    \n            Edit\n    \n        Cast\n        \n        Cast overview, first billed only:\n          \n          \n Marlon Brando\n          \n          \n              ...\n          \n          \n            Don Vito Corleone \n                  \n          \n      \n          \n          \n Al Pacino\n          \n          \n              ...\n          \n          \n            Michael Corleone \n                  \n          \n      \n          \n          \n James Caan\n          \n          \n              ...\n          \n          \n            Sonny Corleone \n                  \n          \n      \n          \n          \n Richard S. Castellano\n          \n          \n              ...\n          \n          \n            Clemenza \n  \n  \n  (as Richard Castellano)\n  \n                  \n          \n      \n          \n          \n Robert Duvall\n          \n          \n              ...\n          \n          \n            Tom Hagen \n                  \n          \n      \n          \n          \n Sterling Hayden\n          \n          \n              ...\n          \n          \n            Capt. McCluskey \n                  \n          \n      \n          \n          \n John Marley\n          \n          \n              ...\n          \n          \n            Jack Woltz \n                  \n          \n      \n          \n          \n Richard Conte\n          \n          \n              ...\n          \n          \n            Barzini \n                  \n          \n      \n          \n          \n Al Lettieri\n          \n          \n              ...\n          \n          \n            Sollozzo \n                  \n          \n      \n          \n          \n Diane Keaton\n          \n          \n              ...\n          \n          \n            Kay Adams \n                  \n          \n      \n          \n          \n Abe Vigoda\n          \n          \n              ...\n          \n          \n            Tessio \n                  \n          \n      \n          \n          \n Talia Shire\n          \n          \n              ...\n          \n          \n            Connie \n                  \n          \n      \n          \n          \n Gianni Russo\n          \n          \n              ...\n          \n          \n            Carlo \n                  \n          \n      \n          \n          \n John Cazale\n          \n          \n              ...\n          \n          \n            Fredo \n                  \n          \n      \n          \n          \n Rudy Bond\n          \n          \n              ...\n          \n          \n            Cuneo \n                  \n          \n      \n            See full cast&nbsp;»\n        \n        \n    \n \nView production, box office, & company info\n\n    \n        \n    "

10 / 16

Most Often, We Want to Extract a Table

myhtml %>% html_nodes("table") %>% html_table(header = TRUE)

## [[1]]
##    Cast overview, first billed only: Cast overview, first billed only:
## 1                                 NA                     Marlon Brando
## 2                                 NA                         Al Pacino
## 3                                 NA                        James Caan
## 4                                 NA             Richard S. Castellano
## 5                                 NA                     Robert Duvall
## 6                                 NA                   Sterling Hayden
## 7                                 NA                       John Marley
## 8                                 NA                     Richard Conte
## 9                                 NA                       Al Lettieri
## 10                                NA                      Diane Keaton
## 11                                NA                        Abe Vigoda
## 12                                NA                       Talia Shire
## 13                                NA                      Gianni Russo
## 14                                NA                       John Cazale
## 15                                NA                         Rudy Bond
##    Cast overview, first billed only:
## 1                                ...
## 2                                ...
## 3                                ...
## 4                                ...
## 5                                ...
## 6                                ...
## 7                                ...
## 8                                ...
## 9                                ...
## 10                               ...
## 11                               ...
## 12                               ...
## 13                               ...
## 14                               ...
## 15                               ...
##               Cast overview, first billed only:
## 1                             Don Vito Corleone
## 2                              Michael Corleone
## 3                                Sonny Corleone
## 4  Clemenza \n  \n  \n  (as Richard Castellano)
## 5                                     Tom Hagen
## 6                               Capt. McCluskey
## 7                                    Jack Woltz
## 8                                       Barzini
## 9                                      Sollozzo
## 10                                    Kay Adams
## 11                                       Tessio
## 12                                       Connie
## 13                                        Carlo
## 14                                        Fredo
## 15                                        Cuneo

11 / 16

Step 5: Save and Clean the Data

You may want to remove all columns except Actor and Role.
Here is some sample code to clean this, but there are many ways to do the same:

library(stringr)
library(magrittr)
mydat <- myhtml %>% 
  html_nodes("table") %>%
  extract2(1) %>% #our table is actually nested within a list element [[]]
  html_table(header = TRUE)
mydat <- mydat[,c(2,4)]
names(mydat) <- c("Actor", "Role")
mydat <- mydat %>% 
  mutate(Actor = Actor,
         Role = str_extract(Role,"[^\\n]+")) #anything but [^] one or more instances + of \n
mydat

##                    Actor              Role
## 1          Marlon Brando Don Vito Corleone
## 2              Al Pacino  Michael Corleone
## 3             James Caan    Sonny Corleone
## 4  Richard S. Castellano         Clemenza 
## 5          Robert Duvall         Tom Hagen
## 6        Sterling Hayden   Capt. McCluskey
## 7            John Marley        Jack Woltz
## 8          Richard Conte           Barzini
## 9            Al Lettieri          Sollozzo
## 10          Diane Keaton         Kay Adams
## 11            Abe Vigoda            Tessio
## 12           Talia Shire            Connie
## 13          Gianni Russo             Carlo
## 14           John Cazale             Fredo
## 15             Rudy Bond             Cuneo

12 / 16

Your Turn (5 min)

Follow the same steps to get the cast of the Wizard of Oz movie.
Clean up the output the best you can. Feel free to consult the stringr cheatsheet

13 / 16

Why Is This Useful?

Can write a loop to get cast of a long list of movies
Can write a loop to get any tables from any website/websites

14 / 16

Key Functions: `html_text`

html_text(x) extracts all text from the nodeset x
Good for cleaning output

read_html(myurl) %>% 
  html_nodes("p") %>% # first get all the paragraphs 
  html_nodes("a") %>% # then get all the links in those paragraphs
  html_text() # get the linked text only 
myurl <- "https://www.tripadvisor.ca/Attraction_Review-g155019-d155483-Reviews-CN_Tower-Toronto_Ontario.html"
read_html(myurl) %>% 
  html_nodes(".cPQsENeY") %>%
  html_text()

15 / 16

Key Functions: `html_table`

html_table(x, header, fill) - parse html table(s) from x into a data frame or list of data frames
Structure of HTML makes finding and extracting tables easy!

myurl<-"https://electionresults.utah.gov/elections/countyCount/399789495"
read_html(myurl) %>% 
  html_nodes("table") %>% # get the tables 
  html_table(header=F) %>%
  head()

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

POL 478H1 F

Intro to Webscraping

Olga Chyzh [www.olgachyzh.com]

Outline

What is Webscraping?

Why Webscrape?

Webscraping: Broad Strokes

Webscraping with rvest: Step-by-Step Start Guide

Step 1: What Website Are You Scraping?

Step 2: Read HTML into R

Step 3: Where in the HTML Code Are Your Data?

Step 4:

Most Often, We Want to Extract a Table

Step 5: Save and Clean the Data

Your Turn (5 min)

Why Is This Useful?

Key Functions: html_text

Key Functions: html_table

Outline

Help

Webscraping with `rvest`: Step-by-Step Start Guide

Step 2: Read `HTML` into R

Key Functions: `html_text`

Key Functions: `html_table`