What is webscraping?
Webscraping using rvest
Examples
IMDB show cast
2020 US election returns
Extract data from websites
Tables
Links to other websites
Text
Because copy-paste is tedious
Because it's fast
Because you can automate it
Because it helps reduce/catch errors
All websites are written in HTML
(mostly)
HTML
code is messy and difficult to parse manually
We will use R to
HTML
(or other) code Need only a very rudimentary understanding of HTML
rvest
: Step-by-Step Start GuideInstall all tidyverse packages:
# check if you already have itlibrary(tidyverse)library(rvest)# if not:install.packages("tidyverse")library(tidyverse) # only calls the "core" of tidyverse
# character variable containing the url you want to scrapemyurl <- "https://www.imdb.com/title/tt0068646/"
HTML
into RHTML
is HyperText Markup Language.
Go to any website, right click, click "View Page Source" to see the HTML
library(tidyverse)library(rvest)myhtml <- read_html(myurl)myhtml
## {html_document}## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
Need to find your data within the myhtml
object.
In HTML
, all objects, such as tables, paragraphs, hyperlinks, and headings, are inside "tags" that are surrounded by <> symbols
Examples of tags:
<p>
This is a paragraph.</p>
<h1>
This is a heading. </h1>
<a>
This is a link. </a>
<li>
item in a list </li>
<table>
This is a table. </table>
Can use Selector Gadget to find the exact location. Enter vignette("selectorgadget")
for an overview.
Can also skim through the raw html code looking for possible tags.
For more on HTML, check out the W3schools' tutorial
You don't need to be an expert in HTML to webscrape with rvest
!
Give HTML tags into html_nodes() to extract your data of interest. Once you got the content of what you are looking for, use html_text to extract text, html_table to get a table
mysummary<-html_nodes(myhtml, "#titleCast") #Gets everything in the elementmysummary
## {xml_nodeset (1)}## [1] <div class="article" id="titleCast">\n <span class="rightcornerlink">\ ...
html_text(mysummary)
## [1] "\n \n Edit\n \n Cast\n \n Cast overview, first billed only:\n \n \n Marlon Brando\n \n \n ...\n \n \n Don Vito Corleone \n \n \n \n \n \n Al Pacino\n \n \n ...\n \n \n Michael Corleone \n \n \n \n \n \n James Caan\n \n \n ...\n \n \n Sonny Corleone \n \n \n \n \n \n Richard S. Castellano\n \n \n ...\n \n \n Clemenza \n \n \n (as Richard Castellano)\n \n \n \n \n \n \n Robert Duvall\n \n \n ...\n \n \n Tom Hagen \n \n \n \n \n \n Sterling Hayden\n \n \n ...\n \n \n Capt. McCluskey \n \n \n \n \n \n John Marley\n \n \n ...\n \n \n Jack Woltz \n \n \n \n \n \n Richard Conte\n \n \n ...\n \n \n Barzini \n \n \n \n \n \n Al Lettieri\n \n \n ...\n \n \n Sollozzo \n \n \n \n \n \n Diane Keaton\n \n \n ...\n \n \n Kay Adams \n \n \n \n \n \n Abe Vigoda\n \n \n ...\n \n \n Tessio \n \n \n \n \n \n Talia Shire\n \n \n ...\n \n \n Connie \n \n \n \n \n \n Gianni Russo\n \n \n ...\n \n \n Carlo \n \n \n \n \n \n John Cazale\n \n \n ...\n \n \n Fredo \n \n \n \n \n \n Rudy Bond\n \n \n ...\n \n \n Cuneo \n \n \n \n See full cast »\n \n \n \n \nView production, box office, & company info\n\n \n \n "
#Or you can combine the operations into a pipe:myhtml %>% html_nodes("#titleCast") %>% html_text()
## [1] "\n \n Edit\n \n Cast\n \n Cast overview, first billed only:\n \n \n Marlon Brando\n \n \n ...\n \n \n Don Vito Corleone \n \n \n \n \n \n Al Pacino\n \n \n ...\n \n \n Michael Corleone \n \n \n \n \n \n James Caan\n \n \n ...\n \n \n Sonny Corleone \n \n \n \n \n \n Richard S. Castellano\n \n \n ...\n \n \n Clemenza \n \n \n (as Richard Castellano)\n \n \n \n \n \n \n Robert Duvall\n \n \n ...\n \n \n Tom Hagen \n \n \n \n \n \n Sterling Hayden\n \n \n ...\n \n \n Capt. McCluskey \n \n \n \n \n \n John Marley\n \n \n ...\n \n \n Jack Woltz \n \n \n \n \n \n Richard Conte\n \n \n ...\n \n \n Barzini \n \n \n \n \n \n Al Lettieri\n \n \n ...\n \n \n Sollozzo \n \n \n \n \n \n Diane Keaton\n \n \n ...\n \n \n Kay Adams \n \n \n \n \n \n Abe Vigoda\n \n \n ...\n \n \n Tessio \n \n \n \n \n \n Talia Shire\n \n \n ...\n \n \n Connie \n \n \n \n \n \n Gianni Russo\n \n \n ...\n \n \n Carlo \n \n \n \n \n \n John Cazale\n \n \n ...\n \n \n Fredo \n \n \n \n \n \n Rudy Bond\n \n \n ...\n \n \n Cuneo \n \n \n \n See full cast »\n \n \n \n \nView production, box office, & company info\n\n \n \n "
myhtml %>% html_nodes("table") %>% html_table(header = TRUE)
## [[1]]## Cast overview, first billed only: Cast overview, first billed only:## 1 NA Marlon Brando## 2 NA Al Pacino## 3 NA James Caan## 4 NA Richard S. Castellano## 5 NA Robert Duvall## 6 NA Sterling Hayden## 7 NA John Marley## 8 NA Richard Conte## 9 NA Al Lettieri## 10 NA Diane Keaton## 11 NA Abe Vigoda## 12 NA Talia Shire## 13 NA Gianni Russo## 14 NA John Cazale## 15 NA Rudy Bond## Cast overview, first billed only:## 1 ...## 2 ...## 3 ...## 4 ...## 5 ...## 6 ...## 7 ...## 8 ...## 9 ...## 10 ...## 11 ...## 12 ...## 13 ...## 14 ...## 15 ...## Cast overview, first billed only:## 1 Don Vito Corleone## 2 Michael Corleone## 3 Sonny Corleone## 4 Clemenza \n \n \n (as Richard Castellano)## 5 Tom Hagen## 6 Capt. McCluskey## 7 Jack Woltz## 8 Barzini## 9 Sollozzo## 10 Kay Adams## 11 Tessio## 12 Connie## 13 Carlo## 14 Fredo## 15 Cuneo
You may want to remove all columns except Actor and Role.
Here is some sample code to clean this, but there are many ways to do the same:
library(stringr)library(magrittr)mydat <- myhtml %>% html_nodes("table") %>% extract2(1) %>% #our table is actually nested within a list element [[]] html_table(header = TRUE)mydat <- mydat[,c(2,4)]names(mydat) <- c("Actor", "Role")mydat <- mydat %>% mutate(Actor = Actor, Role = str_extract(Role,"[^\\n]+")) #anything but [^] one or more instances + of \nmydat
## Actor Role## 1 Marlon Brando Don Vito Corleone## 2 Al Pacino Michael Corleone## 3 James Caan Sonny Corleone## 4 Richard S. Castellano Clemenza ## 5 Robert Duvall Tom Hagen## 6 Sterling Hayden Capt. McCluskey## 7 John Marley Jack Woltz## 8 Richard Conte Barzini## 9 Al Lettieri Sollozzo## 10 Diane Keaton Kay Adams## 11 Abe Vigoda Tessio## 12 Talia Shire Connie## 13 Gianni Russo Carlo## 14 John Cazale Fredo## 15 Rudy Bond Cuneo
Follow the same steps to get the cast of the Wizard of Oz movie.
Clean up the output the best you can. Feel free to consult the stringr
cheatsheet
Can write a loop to get cast of a long list of movies
Can write a loop to get any tables from any website/websites
html_text
html_text(x)
extracts all text from the nodeset x
read_html(myurl) %>% html_nodes("p") %>% # first get all the paragraphs html_nodes("a") %>% # then get all the links in those paragraphs html_text() # get the linked text only myurl <- "https://www.tripadvisor.ca/Attraction_Review-g155019-d155483-Reviews-CN_Tower-Toronto_Ontario.html"read_html(myurl) %>% html_nodes(".cPQsENeY") %>% html_text()
html_table
html_table(x, header, fill)
- parse html table(s) from x
into a data frame or list of data frames myurl<-"https://electionresults.utah.gov/elections/countyCount/399789495"read_html(myurl) %>% html_nodes("table") %>% # get the tables html_table(header=F) %>% head()
What is webscraping?
Webscraping using rvest
Examples
IMDB show cast
2020 US election returns
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |