class: center, middle, inverse, title-slide # POL 478H1 F ## Loops ### Olga Chyzh [www.olgachyzh.com] --- ## Writing Reproducible Code 1. Why is it important to write easy-to-follow code? 2. What are some ways to enhance the readability of your code? More Reading: - [Karl Browman's website](https://kbroman.org/steps2rr/) - [Christine Bahlai's blog](https://practicaldatamanagement.wordpress.com/2014/10/23/baby-steps-for-the-open-curious/) --- ## Why Use Loops? Loops are a way to shorten repetitive code like this ```r ALA<-read.table("ALA_PctResults20161108.txt", sep="\t", header=FALSE, fill=TRUE) BRA<-read.table("BRA_PctResults20161108.txt", sep="\t", header=FALSE, fill=TRUE) CAL<-read.table("CAL_PctResults20161108.txt", sep="\t", header=FALSE, fill=TRUE) mydata<-rbind(ALA,BRA,CAL) ``` as ```r myfilenames<-c("ALA_PctResults20161108.txt","BRA_PctResults20161108.txt","CAL_PctResults20161108.txt") mydata<-NULL for (i in myfilenames){ d<-read.table(i, sep="\t", header=FALSE, fill=TRUE) mydata<-rbind(mydata,d) } ``` --- ## Loops Help - Shorten/clarify the code - Reduce the probability of typing errors - Speed up coding - Can loop over indices, names, and values --- ## Loop Components 1. The wrapper `for (variable in vector){}` 2. An (initially empty) object to store the result, usually outside the loop 3. A series of commands that will be applied to each element in the `vector` indexed by `variable` 4. The last line usually (but not always) appends the result to the empty object we started with in 2. --- ## Example 1: Florida Elections Returns 2016 1. Download the zip file that contains Florida 2016 Elections Returns [here](https://pol478.netlify.app/materials/florida2016.zip); 2. Unzip the data, make sure to set your working directory to the folder where you saved the data; 3. We are going to write a loop that opens each of the 68 files in this folder and combines them into a single dataset. ```r myfilenames<-list.files() mydata<-NULL for (i in myfilenames){ d<-read.table(i, quote="",comment.char="", sep="\t", header=FALSE) mydata<-rbind(mydata,d) } ``` --- ## Options for `read.table` - `header`--whether the first row of the data contains variable names - `quote`--whether to interpret quotes as a part of text (e.g., a name with an apostrophe) or are quotes used to denote character variables. If not specified, the function tries to read text within quotes as character. ```r d<-read.table(myfilenames[18], quote="",sep="\t", header=FALSE, fill=TRUE) ``` - `comment.char`--the default behavior for read.table is to treat \# as the beginning of a comment and ignore what follows. We need to turn this option off, as some of files contain \# to denote district number. ```r d<-read.table(myfilenames[62], quote="", comment.char="",sep="\t", header=FALSE, fill=TRUE) ``` - `sep`--the column separator; the default is white space, but in this case it is tab. --- ## Your Turn Edit our loop, so that we only keep data on `Trump Vote` and `Clinton Vote` for each county. Use a pipe. --- ## Example 3: World Bank Data 1. Change your working directory to where you stored WDI data from last week. 2. We can get these data on the long form the way we did in the homework or using a loop with the indicator name as our variable. The old way: ```r d<-read_csv("WDIData.csv", col_names=T) %>% filter(`Indicator Name` %in% c("GDP (constant 2010 US$)","Foreign direct investment, net inflows (% of GDP)")) %>% select(-`Indicator Code`,-`Country Code`,-`2020`,-X66) %>% slice(-(1:94)) %>% pivot_longer(`1960`:`2019`,names_to="year", values_to="Indicator") %>% pivot_wider(names_from="Indicator Name", values_from="Indicator") ``` The new way: ```r indname<-c("GDP (constant 2010 US$)","Foreign direct investment, net inflows (% of GDP)") mydata<-NULL for (ind in indname){ d<-read_csv("WDIData.csv", col_names=T) %>% filter(`Indicator Name`==ind) %>% select(-`Indicator Code`,-`Country Code`,-`2020`,-X66) %>% slice(-(1:47)) %>% pivot_longer(`1960`:`2019`,names_to="year", values_to="Indicator") mydata<-rbind(mydata,d) } mydata<-mydata %>% pivot_wider(names_from="Indicator Name", values_from="Indicator") ``` --- ## Automate Indicator Names ```r d<-read_csv("WDIData.csv", col_names=T) indname<-unique(d$`Indicator Name`) mydata<-NULL for (ind in indname[1:5]){ d<-read_csv("WDIData.csv", col_names=T) %>% filter(`Indicator Name`==ind) %>% select(-`Indicator Code`,-`Country Code`,-`2020`,-X66) %>% slice(-(1:47)) %>% pivot_longer(`1960`:`2019`,names_to="year", values_to="Indicator") mydata<-rbind(mydata,d) } mydata<-mydata %>% pivot_wider(names_from="Indicator Name", values_from="Indicator") ``` --- ## A Word of Caution - Loops are not always faster (above example) - R built-in loop functions, such as `apply` are generally faster, but (beyond the simplest cases) require more advanced programming. --- ## Loops Are Invaluable - Long repetitive scripts - Working with network data --- ## Your Turn 1 - Convert the following repeated code into a loop: ```r library(classdata) data("terr_attacks.wide") a<-mean(terr_attacks.wide[,5],na.rm=T) b<-mean(terr_attacks.wide[,6],na.rm=T) d<-mean(terr_attacks.wide[,7],na.rm=T) e<-mean(terr_attacks.wide[,8],na.rm=T) f<-mean(terr_attacks.wide[,9],na.rm=T) g<-mean(terr_attacks.wide[,10],na.rm=T) h<-mean(terr_attacks.wide[,11],na.rm=T) i<-mean(terr_attacks.wide[,12],na.rm=T) j<-mean(terr_attacks.wide[,13],na.rm=T) k<-mean(terr_attacks.wide[,14],na.rm=T) l<-mean(terr_attacks.wide[,15],na.rm=T) m<-mean(terr_attacks.wide[,16],na.rm=T) mymeans<-c(a,b,d,e,f,g,h,i,j,k,l,m) ``` - Now get the means of these variables using `summarise` - Which one is easier/faster? --- ## Your Turn 2 Convert the following repeated code into a loop: ```r library(classdata) data("terr_attacks.wide") a<-mean(terr_attacks.wide$GDPpc,na.rm=T) b<-mean(terr_attacks.wide$population,na.rm=T) d<-mean(terr_attacks.wide$tradeofgdp,na.rm=T) e<-mean(terr_attacks.wide$`Hostage Taking (Kidnapping)`,na.rm=T) f<-mean(terr_attacks.wide$Hijacking,na.rm=T) mymeans<-c(a,b,d,e,f) ```