class: center, middle, inverse, title-slide # POL 478H1 F ## Intro to Webscraping ### Olga Chyzh [www.olgachyzh.com] --- ## Outline - What is webscraping? - Webscraping using `rvest` - Examples + IMDB show cast + 2020 US election returns --- ## What is Webscraping? - Extract data from websites + Tables + Links to other websites + Text <img src="./images/USHouse.png" width="100%" /> --- ## Why Webscrape? - Because copy-paste is tedious - Because it's fast - Because you can automate it - Because it helps reduce/catch errors <img src="./images/copypaste.png" width="50%" style="display: block; margin: auto;" /> --- ## Webscraping: Broad Strokes - All websites are written in `HTML` (mostly) - `HTML` code is messy and difficult to parse manually - We will use R to - read the `HTML` (or other) code - clean it up to extract the data we need - Need only a very rudimentary understanding of `HTML` --- ## Webscraping with `rvest`: Step-by-Step Start Guide Install all tidyverse packages: ```r # check if you already have it library(tidyverse) library(rvest) # if not: install.packages("tidyverse") library(tidyverse) # only calls the "core" of tidyverse ``` --- ## Step 1: What Website Are You Scraping? ```r # character variable containing the url you want to scrape myurl <- "https://www.imdb.com/title/tt0068646/" ``` --- ## Step 2: Read `HTML` into R - `HTML` is HyperText Markup Language. - Go to any [website](https://www.imdb.com/title/tt0068646/), right click, click "View Page Source" to see the HTML ```r library(tidyverse) library(rvest) myhtml <- read_html(myurl) myhtml ``` ``` ## {html_document} ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ... ``` --- ## Step 3: Where in the HTML Code Are Your Data? - Need to find your data within the `myhtml` object. - In `HTML`, all objects, such as tables, paragraphs, hyperlinks, and headings, are inside "tags" that are surrounded by <> symbols - Examples of tags: - `<p>` This is a paragraph.`</p>` - `<h1>` This is a heading. `</h1>` - `<a>` This is a link. `</a>` - `<li>` item in a list `</li>` - `<table>`This is a table. `</table>` - Can use [Selector Gadget](http://selectorgadget.com/) to find the exact location. Enter `vignette("selectorgadget")` for an overview. - Can also skim through the raw html code looking for possible tags. - For more on HTML, check out the [W3schools' tutorial](http://www.w3schools.com/html/html_intro.asp) - You don't need to be an expert in HTML to webscrape with `rvest`! --- ## Step 4: Give HTML tags into html_nodes() to extract your data of interest. Once you got the content of what you are looking for, use html_text to extract text, html_table to get a table ```r mysummary<-html_nodes(myhtml, "#titleCast") #Gets everything in the element mysummary ``` ``` ## {xml_nodeset (1)} ## [1] <div class="article" id="titleCast">\n <span class="rightcornerlink">\ ... ``` ```r html_text(mysummary) ``` ``` ## [1] "\n \n Edit\n \n Cast\n \n Cast overview, first billed only:\n \n \n Marlon Brando\n \n \n ...\n \n \n Don Vito Corleone \n \n \n \n \n \n Al Pacino\n \n \n ...\n \n \n Michael Corleone \n \n \n \n \n \n James Caan\n \n \n ...\n \n \n Sonny Corleone \n \n \n \n \n \n Richard S. Castellano\n \n \n ...\n \n \n Clemenza \n \n \n (as Richard Castellano)\n \n \n \n \n \n \n Robert Duvall\n \n \n ...\n \n \n Tom Hagen \n \n \n \n \n \n Sterling Hayden\n \n \n ...\n \n \n Capt. McCluskey \n \n \n \n \n \n John Marley\n \n \n ...\n \n \n Jack Woltz \n \n \n \n \n \n Richard Conte\n \n \n ...\n \n \n Barzini \n \n \n \n \n \n Al Lettieri\n \n \n ...\n \n \n Sollozzo \n \n \n \n \n \n Diane Keaton\n \n \n ...\n \n \n Kay Adams \n \n \n \n \n \n Abe Vigoda\n \n \n ...\n \n \n Tessio \n \n \n \n \n \n Talia Shire\n \n \n ...\n \n \n Connie \n \n \n \n \n \n Gianni Russo\n \n \n ...\n \n \n Carlo \n \n \n \n \n \n John Cazale\n \n \n ...\n \n \n Fredo \n \n \n \n \n \n Rudy Bond\n \n \n ...\n \n \n Cuneo \n \n \n \n See full cast »\n \n \n \n \nView production, box office, & company info\n\n \n \n " ``` ```r #Or you can combine the operations into a pipe: myhtml %>% html_nodes("#titleCast") %>% html_text() ``` ``` ## [1] "\n \n Edit\n \n Cast\n \n Cast overview, first billed only:\n \n \n Marlon Brando\n \n \n ...\n \n \n Don Vito Corleone \n \n \n \n \n \n Al Pacino\n \n \n ...\n \n \n Michael Corleone \n \n \n \n \n \n James Caan\n \n \n ...\n \n \n Sonny Corleone \n \n \n \n \n \n Richard S. Castellano\n \n \n ...\n \n \n Clemenza \n \n \n (as Richard Castellano)\n \n \n \n \n \n \n Robert Duvall\n \n \n ...\n \n \n Tom Hagen \n \n \n \n \n \n Sterling Hayden\n \n \n ...\n \n \n Capt. McCluskey \n \n \n \n \n \n John Marley\n \n \n ...\n \n \n Jack Woltz \n \n \n \n \n \n Richard Conte\n \n \n ...\n \n \n Barzini \n \n \n \n \n \n Al Lettieri\n \n \n ...\n \n \n Sollozzo \n \n \n \n \n \n Diane Keaton\n \n \n ...\n \n \n Kay Adams \n \n \n \n \n \n Abe Vigoda\n \n \n ...\n \n \n Tessio \n \n \n \n \n \n Talia Shire\n \n \n ...\n \n \n Connie \n \n \n \n \n \n Gianni Russo\n \n \n ...\n \n \n Carlo \n \n \n \n \n \n John Cazale\n \n \n ...\n \n \n Fredo \n \n \n \n \n \n Rudy Bond\n \n \n ...\n \n \n Cuneo \n \n \n \n See full cast »\n \n \n \n \nView production, box office, & company info\n\n \n \n " ``` --- ## Most Often, We Want to Extract a Table ```r myhtml %>% html_nodes("table") %>% html_table(header = TRUE) ``` ``` ## [[1]] ## Cast overview, first billed only: Cast overview, first billed only: ## 1 NA Marlon Brando ## 2 NA Al Pacino ## 3 NA James Caan ## 4 NA Richard S. Castellano ## 5 NA Robert Duvall ## 6 NA Sterling Hayden ## 7 NA John Marley ## 8 NA Richard Conte ## 9 NA Al Lettieri ## 10 NA Diane Keaton ## 11 NA Abe Vigoda ## 12 NA Talia Shire ## 13 NA Gianni Russo ## 14 NA John Cazale ## 15 NA Rudy Bond ## Cast overview, first billed only: ## 1 ... ## 2 ... ## 3 ... ## 4 ... ## 5 ... ## 6 ... ## 7 ... ## 8 ... ## 9 ... ## 10 ... ## 11 ... ## 12 ... ## 13 ... ## 14 ... ## 15 ... ## Cast overview, first billed only: ## 1 Don Vito Corleone ## 2 Michael Corleone ## 3 Sonny Corleone ## 4 Clemenza \n \n \n (as Richard Castellano) ## 5 Tom Hagen ## 6 Capt. McCluskey ## 7 Jack Woltz ## 8 Barzini ## 9 Sollozzo ## 10 Kay Adams ## 11 Tessio ## 12 Connie ## 13 Carlo ## 14 Fredo ## 15 Cuneo ``` --- ## Step 5: Save and Clean the Data - You may want to remove all columns except Actor and Role. - Here is some sample code to clean this, but there are many ways to do the same: ```r library(stringr) library(magrittr) mydat <- myhtml %>% html_nodes("table") %>% extract2(1) %>% #our table is actually nested within a list element [[]] html_table(header = TRUE) mydat <- mydat[,c(2,4)] names(mydat) <- c("Actor", "Role") mydat <- mydat %>% mutate(Actor = Actor, Role = str_extract(Role,"[^\\n]+")) #anything but [^] one or more instances + of \n mydat ``` ``` ## Actor Role ## 1 Marlon Brando Don Vito Corleone ## 2 Al Pacino Michael Corleone ## 3 James Caan Sonny Corleone ## 4 Richard S. Castellano Clemenza ## 5 Robert Duvall Tom Hagen ## 6 Sterling Hayden Capt. McCluskey ## 7 John Marley Jack Woltz ## 8 Richard Conte Barzini ## 9 Al Lettieri Sollozzo ## 10 Diane Keaton Kay Adams ## 11 Abe Vigoda Tessio ## 12 Talia Shire Connie ## 13 Gianni Russo Carlo ## 14 John Cazale Fredo ## 15 Rudy Bond Cuneo ``` --- ## Your Turn (5 min) - Follow the same steps to get the cast of the Wizard of Oz movie. - Clean up the output the best you can. Feel free to consult the [`stringr` cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf) --- ## Why Is This Useful? - Can write a loop to get cast of a long list of movies - Can write a loop to get any tables from any website/websites --- ## Key Functions: `html_text` - `html_text(x)` extracts all text from the nodeset `x` - Good for cleaning output ```r read_html(myurl) %>% html_nodes("p") %>% # first get all the paragraphs html_nodes("a") %>% # then get all the links in those paragraphs html_text() # get the linked text only myurl <- "https://www.tripadvisor.ca/Attraction_Review-g155019-d155483-Reviews-CN_Tower-Toronto_Ontario.html" read_html(myurl) %>% html_nodes(".cPQsENeY") %>% html_text() ``` --- ## Key Functions: `html_table` - `html_table(x, header, fill)` - parse html table(s) from `x` into a data frame or list of data frames - Structure of HTML makes finding and extracting tables easy! ```r myurl<-"https://electionresults.utah.gov/elections/countyCount/399789495" read_html(myurl) %>% html_nodes("table") %>% # get the tables html_table(header=F) %>% head() ```