+ - 0:00:00
Notes for current slide
Notes for next slide

POL 478H1 F

Reshaping

Olga Chyzh [www.olgachyzh.com]

1 / 12

Reshaping Data

  • tidyr: pivot_longer and pivot_wider (former gather and spread)
2 / 12

Multiple ways to store the same information

Option #1

## province X2017 X2010 X2000
## 2 Alberta 81.5 81.3 79.4
## 3 British Columbia 82.2 82.1 80.3
## 4 Manitoba 80.0 79.9 78.0
## 5 New Brunswick 80.7 80.7 78.9
## 6 Newfoundland and Labrador 79.8 79.6 77.3
## 7 Northwest Territories 77.0 77.8 75.5

Option #2

## province year life_exp
## 1 Alberta 2017 81.5
## 2 Alberta 2010 81.3
## 3 Alberta 2000 79.4
## 4 Alberta 1990 77.8
## 5 Alberta 1980 75.0
## 6 British Columbia 2017 82.2

Source: Wikipedia

3 / 12

Example: Canada Life Expectancy Data

  • Copy the data on Canada's life expectancy available here into a text doc.
  • Since this dataset is stored in txt format, you can load it using read.table()
canada_le<-read.table("canada_le_long.txt", header=T)

Or, you can skip the copy/paste:

myurl<-"https://raw.githubusercontent.com/ochyzh/POLS478/master/public/materials/canada_le_long.txt"
canada_le<-read.table(myurl, header=T)
4 / 12

Reshape from long to wide

## province year life_exp
## 1 Alberta 2017 81.5
## 2 Alberta 2010 81.3
## 3 Alberta 2000 79.4
## 4 Alberta 1990 77.8
## 5 Alberta 1980 75.0
## 6 British Columbia 2017 82.2
canada_le %>% pivot_wider(names_from=year, values_from=life_exp)

Or:

canada_le %>% spread(key=year, value=life_exp)
## province 1980 1990 2000 2010 2017
## 1 Alberta 75.0 77.8 79.4 81.3 81.5
## 2 British Columbia 76.0 78.1 80.3 82.1 82.2
## 3 Manitoba 75.3 77.4 78.0 79.9 80.0
## 4 New Brunswick 74.5 77.3 78.9 80.7 80.7
## 5 Newfoundland and Labrador 74.8 76.1 77.3 79.6 79.8
## 6 Northwest Territories 68.7 72.4 75.5 77.8 77.0
## 7 Nova Scotia 74.6 76.7 78.5 80.3 80.5
## 8 Nunavut 62.4 65.5 67.5 70.4 71.6
## 9 Ontario 75.4 77.8 79.4 81.7 82.4
## 10 Prince Edward Island 76.2 76.9 78.4 81.3 81.8
## 11 Quebec 74.4 77.0 79.0 81.4 82.6
## 12 Saskatchewan 75.9 77.9 78.5 79.6 80.3
## 13 Yukon 69.8 74.4 75.0 77.3 77.8
canada_le.wide<-canada_le %>% spread(key=year, value=life_exp)
5 / 12

Reshape from wide to long

## province 1980 1990 2000 2010 2017
## 1 Alberta 75.0 77.8 79.4 81.3 81.5
## 2 British Columbia 76.0 78.1 80.3 82.1 82.2
## 3 Manitoba 75.3 77.4 78.0 79.9 80.0
## 4 New Brunswick 74.5 77.3 78.9 80.7 80.7
## 5 Newfoundland and Labrador 74.8 76.1 77.3 79.6 79.8
## 6 Northwest Territories 68.7 72.4 75.5 77.8 77.0
canada_le.wide %>% pivot_longer(cols=c(2:6),names_to="year",values_to="life_exp")

Or:

canada_le.wide %>% gather(key="year",value="life_exp",-province)
## province year life_exp
## 1 Alberta 1980 75.0
## 2 British Columbia 1980 76.0
## 3 Manitoba 1980 75.3
## 4 New Brunswick 1980 74.5
## 5 Newfoundland and Labrador 1980 74.8
## 6 Northwest Territories 1980 68.7
## 7 Nova Scotia 1980 74.6
## 8 Nunavut 1980 62.4
## 9 Ontario 1980 75.4
## 10 Prince Edward Island 1980 76.2
## 11 Quebec 1980 74.4
## 12 Saskatchewan 1980 75.9
## 13 Yukon 1980 69.8
## 14 Alberta 1990 77.8
## 15 British Columbia 1990 78.1
## 16 Manitoba 1990 77.4
## 17 New Brunswick 1990 77.3
## 18 Newfoundland and Labrador 1990 76.1
## 19 Northwest Territories 1990 72.4
## 20 Nova Scotia 1990 76.7
## 21 Nunavut 1990 65.5
## 22 Ontario 1990 77.8
## 23 Prince Edward Island 1990 76.9
## 24 Quebec 1990 77.0
## 25 Saskatchewan 1990 77.9
## 26 Yukon 1990 74.4
## 27 Alberta 2000 79.4
## 28 British Columbia 2000 80.3
## 29 Manitoba 2000 78.0
## 30 New Brunswick 2000 78.9
## 31 Newfoundland and Labrador 2000 77.3
## 32 Northwest Territories 2000 75.5
## 33 Nova Scotia 2000 78.5
## 34 Nunavut 2000 67.5
## 35 Ontario 2000 79.4
## 36 Prince Edward Island 2000 78.4
## 37 Quebec 2000 79.0
## 38 Saskatchewan 2000 78.5
## 39 Yukon 2000 75.0
## 40 Alberta 2010 81.3
## 41 British Columbia 2010 82.1
## 42 Manitoba 2010 79.9
## 43 New Brunswick 2010 80.7
## 44 Newfoundland and Labrador 2010 79.6
## 45 Northwest Territories 2010 77.8
## 46 Nova Scotia 2010 80.3
## 47 Nunavut 2010 70.4
## 48 Ontario 2010 81.7
## 49 Prince Edward Island 2010 81.3
## 50 Quebec 2010 81.4
## 51 Saskatchewan 2010 79.6
## 52 Yukon 2010 77.3
## 53 Alberta 2017 81.5
## 54 British Columbia 2017 82.2
## 55 Manitoba 2017 80.0
## 56 New Brunswick 2017 80.7
## 57 Newfoundland and Labrador 2017 79.8
## 58 Northwest Territories 2017 77.0
## 59 Nova Scotia 2017 80.5
## 60 Nunavut 2017 71.6
## 61 Ontario 2017 82.4
## 62 Prince Edward Island 2017 81.8
## 63 Quebec 2017 82.6
## 64 Saskatchewan 2017 80.3
## 65 Yukon 2017 77.8
6 / 12

Your Turn

  1. Use pivot_wider or spread to reshape terr_attacks data into a wide format.

  2. Use pivot_longer or gather to reshape terr_attacks.wide into a long format.

  3. Is terr_attacks.wide the only way to reshape terr_attacks into a wide form? What are some other ways? Try them.

7 / 12

Tips for Reshaping Wide (Advanced)

  1. Identify the reason for repeating observations (e.g., repeating countries over multiple years, repeating country-years for each attack type).

  2. Determine what variables vary and what variables stay the same. For example, in the terr_data dataset, variables country, ccode, and cabb do not vary by year (for the same country), but variables num_attacks,GDPpc, population, tradeofgdp, and polity do. So if you wanted to spread values across columns so that each year is in its own column, you would have to spread values of not just num_attacks, but all those other time-varying variables as well.

  3. Decide what variable you would like to spread across multiple columns (e.g., num_attacks so that each year is in its own column )---then the variable you are spreading becomes your values_from argument, and the variable that describes these new columns is your names_from argument. If more than one variable varies by year, in this example, you must specify all of the varying variables as values_from (e.g., values_from=c(num_attacks,population,GDPpc,tradeofgdp,polity2)) or remove them from the data (select(-population,-GDPpc,-tradeofgdp,-polity2)). If you do this incorrectly, there will be a lot of missing values in the new dataset.

8 / 12

Tips for Reshaping Long

  1. Determine the reason for repeating columns (e.g., each attack type is in a separate column).

  2. Specify the repeating columns as your cols argument, your names_to argument is just the name you want to give to the variable that will contain the names of the repeating columns, and your values_to argument is the name you want to give to the variable that will contain all the values from these columns.

  3. Try your code without saving the output into a new object (view it on the screen only) until it works.

9 / 12

Messy Data 1: World Development Indicators

  • Download the World Development Indicators Data directly from the World Bank. Click on CSV download under Bulk Downloads. The file is quite large, do give it some time. The file you need is WDIdata.csv (feel free to delete the rest).

  • As you can see if you open it, this file is very large and unwieldy, yet it contains information on key variables used in Social Sciences.

  • We are going to clean up these data in the following ways:

  1. Extract just the data on GDP (constant 2010 USD). Remove all the unnecessary rows and columns.
  2. Convert these data from wide to long form.
  3. Add a variable ccode that will allow us to merge these data with other datasets.
10 / 12

Your Turn

Take a minute to open the file and look at it. Can you complete steps 1 and 2?

11 / 12

Adding ccode

Package countrycode

library(countrycode)
d$ccode<-countrycode(d$`Country Name`,'country.name','cown')

Your Turn

  1. Edit the code we used for the previous exercise, so as to extract GDP and Foreign direct investment, net inflows (% of GDP).
  2. Use pivot_wider and/or pivot_longer functions to transform these data into long format (i.e., country-years as rows; country names, country codes, year, GDP and FDI as columns)
12 / 12

Reshaping Data

  • tidyr: pivot_longer and pivot_wider (former gather and spread)
2 / 12
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow