class: center, middle, inverse, title-slide # POL 478H1 F ## Reshaping ### Olga Chyzh [www.olgachyzh.com] --- # Reshaping Data - `tidyr: pivot_longer` and `pivot_wider` (former `gather` and `spread`) --- ## Multiple ways to store the same information Option #1 ``` ## province X2017 X2010 X2000 ## 2 Alberta 81.5 81.3 79.4 ## 3 British Columbia 82.2 82.1 80.3 ## 4 Manitoba 80.0 79.9 78.0 ## 5 New Brunswick 80.7 80.7 78.9 ## 6 Newfoundland and Labrador 79.8 79.6 77.3 ## 7 Northwest Territories 77.0 77.8 75.5 ``` Option #2 ``` ## province year life_exp ## 1 Alberta 2017 81.5 ## 2 Alberta 2010 81.3 ## 3 Alberta 2000 79.4 ## 4 Alberta 1990 77.8 ## 5 Alberta 1980 75.0 ## 6 British Columbia 2017 82.2 ``` Source: [Wikipedia](https://en.wikipedia.org/wiki/List_of_Canadian_provinces_and_territories_by_life_expectancy) --- ## Example: Canada Life Expectancy Data - Copy the data on Canada's life expectancy available [here](https://github.com/ochyzh/POLS478/blob/master/public/materials/canada_le_long.txt) into a text doc. - Since this dataset is stored in txt format, you can load it using `read.table()` ```r canada_le<-read.table("canada_le_long.txt", header=T) ``` Or, you can skip the copy/paste: ```r myurl<-"https://raw.githubusercontent.com/ochyzh/POLS478/master/public/materials/canada_le_long.txt" canada_le<-read.table(myurl, header=T) ``` --- ## Reshape from long to wide ``` ## province year life_exp ## 1 Alberta 2017 81.5 ## 2 Alberta 2010 81.3 ## 3 Alberta 2000 79.4 ## 4 Alberta 1990 77.8 ## 5 Alberta 1980 75.0 ## 6 British Columbia 2017 82.2 ``` ```r canada_le %>% pivot_wider(names_from=year, values_from=life_exp) ``` Or: ```r canada_le %>% spread(key=year, value=life_exp) ``` ``` ## province 1980 1990 2000 2010 2017 ## 1 Alberta 75.0 77.8 79.4 81.3 81.5 ## 2 British Columbia 76.0 78.1 80.3 82.1 82.2 ## 3 Manitoba 75.3 77.4 78.0 79.9 80.0 ## 4 New Brunswick 74.5 77.3 78.9 80.7 80.7 ## 5 Newfoundland and Labrador 74.8 76.1 77.3 79.6 79.8 ## 6 Northwest Territories 68.7 72.4 75.5 77.8 77.0 ## 7 Nova Scotia 74.6 76.7 78.5 80.3 80.5 ## 8 Nunavut 62.4 65.5 67.5 70.4 71.6 ## 9 Ontario 75.4 77.8 79.4 81.7 82.4 ## 10 Prince Edward Island 76.2 76.9 78.4 81.3 81.8 ## 11 Quebec 74.4 77.0 79.0 81.4 82.6 ## 12 Saskatchewan 75.9 77.9 78.5 79.6 80.3 ## 13 Yukon 69.8 74.4 75.0 77.3 77.8 ``` ```r canada_le.wide<-canada_le %>% spread(key=year, value=life_exp) ``` --- ## Reshape from wide to long ``` ## province 1980 1990 2000 2010 2017 ## 1 Alberta 75.0 77.8 79.4 81.3 81.5 ## 2 British Columbia 76.0 78.1 80.3 82.1 82.2 ## 3 Manitoba 75.3 77.4 78.0 79.9 80.0 ## 4 New Brunswick 74.5 77.3 78.9 80.7 80.7 ## 5 Newfoundland and Labrador 74.8 76.1 77.3 79.6 79.8 ## 6 Northwest Territories 68.7 72.4 75.5 77.8 77.0 ``` ```r canada_le.wide %>% pivot_longer(cols=c(2:6),names_to="year",values_to="life_exp") ``` Or: ```r canada_le.wide %>% gather(key="year",value="life_exp",-province) ``` ``` ## province year life_exp ## 1 Alberta 1980 75.0 ## 2 British Columbia 1980 76.0 ## 3 Manitoba 1980 75.3 ## 4 New Brunswick 1980 74.5 ## 5 Newfoundland and Labrador 1980 74.8 ## 6 Northwest Territories 1980 68.7 ## 7 Nova Scotia 1980 74.6 ## 8 Nunavut 1980 62.4 ## 9 Ontario 1980 75.4 ## 10 Prince Edward Island 1980 76.2 ## 11 Quebec 1980 74.4 ## 12 Saskatchewan 1980 75.9 ## 13 Yukon 1980 69.8 ## 14 Alberta 1990 77.8 ## 15 British Columbia 1990 78.1 ## 16 Manitoba 1990 77.4 ## 17 New Brunswick 1990 77.3 ## 18 Newfoundland and Labrador 1990 76.1 ## 19 Northwest Territories 1990 72.4 ## 20 Nova Scotia 1990 76.7 ## 21 Nunavut 1990 65.5 ## 22 Ontario 1990 77.8 ## 23 Prince Edward Island 1990 76.9 ## 24 Quebec 1990 77.0 ## 25 Saskatchewan 1990 77.9 ## 26 Yukon 1990 74.4 ## 27 Alberta 2000 79.4 ## 28 British Columbia 2000 80.3 ## 29 Manitoba 2000 78.0 ## 30 New Brunswick 2000 78.9 ## 31 Newfoundland and Labrador 2000 77.3 ## 32 Northwest Territories 2000 75.5 ## 33 Nova Scotia 2000 78.5 ## 34 Nunavut 2000 67.5 ## 35 Ontario 2000 79.4 ## 36 Prince Edward Island 2000 78.4 ## 37 Quebec 2000 79.0 ## 38 Saskatchewan 2000 78.5 ## 39 Yukon 2000 75.0 ## 40 Alberta 2010 81.3 ## 41 British Columbia 2010 82.1 ## 42 Manitoba 2010 79.9 ## 43 New Brunswick 2010 80.7 ## 44 Newfoundland and Labrador 2010 79.6 ## 45 Northwest Territories 2010 77.8 ## 46 Nova Scotia 2010 80.3 ## 47 Nunavut 2010 70.4 ## 48 Ontario 2010 81.7 ## 49 Prince Edward Island 2010 81.3 ## 50 Quebec 2010 81.4 ## 51 Saskatchewan 2010 79.6 ## 52 Yukon 2010 77.3 ## 53 Alberta 2017 81.5 ## 54 British Columbia 2017 82.2 ## 55 Manitoba 2017 80.0 ## 56 New Brunswick 2017 80.7 ## 57 Newfoundland and Labrador 2017 79.8 ## 58 Northwest Territories 2017 77.0 ## 59 Nova Scotia 2017 80.5 ## 60 Nunavut 2017 71.6 ## 61 Ontario 2017 82.4 ## 62 Prince Edward Island 2017 81.8 ## 63 Quebec 2017 82.6 ## 64 Saskatchewan 2017 80.3 ## 65 Yukon 2017 77.8 ``` --- ## Your Turn 1. Use `pivot_wider` or `spread` to reshape `terr_attacks` data into a wide format. 2. Use `pivot_longer` or `gather` to reshape `terr_attacks.wide` into a long format. 3. Is `terr_attacks.wide` the only way to reshape `terr_attacks` into a wide form? What are some other ways? Try them. --- ## Tips for Reshaping Wide (Advanced) 1. Identify the reason for repeating observations (e.g., repeating countries over multiple years, repeating country-years for each attack type). 2. Determine what variables vary and what variables stay the same. For example, in the `terr_data` dataset, variables `country`, `ccode`, and `cabb` do not vary by `year` (for the same country), but variables `num_attacks`,`GDPpc`, `population`, `tradeofgdp`, and `polity` do. So if you wanted to spread values across columns so that each year is in its own column, you would have to spread values of not just `num_attacks`, but all those other time-varying variables as well. 3. Decide what variable you would like to spread across multiple columns (e.g., `num_attacks` so that each year is in its own column )---then the variable you are spreading becomes your `values_from` argument, and the variable that describes these new columns is your `names_from` argument. If more than one variable varies by year, in this example, you must specify all of the varying variables as `values_from` (e.g., `values_from=c(num_attacks,population,GDPpc,tradeofgdp,polity2)`) or remove them from the data (`select(-population,-GDPpc,-tradeofgdp,-polity2)`). If you do this incorrectly, there will be a lot of missing values in the new dataset. --- ## Tips for Reshaping Long 1. Determine the reason for repeating columns (e.g., each attack type is in a separate column). 2. Specify the repeating columns as your `cols` argument, your `names_to` argument is just the name you want to give to the variable that will contain the names of the repeating columns, and your `values_to` argument is the name you want to give to the variable that will contain all the values from these columns. 3. Try your code without saving the output into a new object (view it on the screen only) until it works. --- ## Messy Data 1: World Development Indicators - Download the World Development Indicators Data directly from [the World Bank](http://datatopics.worldbank.org/world-development-indicators/). Click on `CSV download` under `Bulk Downloads.` The file is quite large, do give it some time. The file you need is `WDIdata.csv` (feel free to delete the rest). - As you can see if you open it, this file is very large and unwieldy, yet it contains information on key variables used in Social Sciences. - We are going to clean up these data in the following ways: 1. Extract just the data on GDP (constant 2010 USD). Remove all the unnecessary rows and columns. 2. Convert these data from wide to long form. 3. Add a variable `ccode` that will allow us to merge these data with other datasets. --- ## Your Turn Take a minute to open the file and look at it. Can you complete steps 1 and 2? --- ## Adding `ccode` Package `countrycode` ```r library(countrycode) d$ccode<-countrycode(d$`Country Name`,'country.name','cown') ``` ## Your Turn 1. Edit the code we used for the previous exercise, so as to extract GDP and `Foreign direct investment, net inflows (% of GDP)`. 2. Use `pivot_wider` and/or `pivot_longer` functions to transform these data into long format (i.e., country-years as rows; country names, country codes, year, GDP and FDI as columns)