10 minute read

Here is our introduction to data frame manipulation using tidyr, one of the many useful packages from the tidyverse!

Data manipulation is often required when we inherit data in an excessively long or wide format.

Tidyr lets us change the format of our data between long and wide formats, so it is more useable to us (e.g., better for plotting using ggplot2, more readible, etc.).

This tutorial uses two datasets gapminder_data and gapminder_wide.

Set-up

Load the required libraries

# If not previously installed, run the following without '#':
# install.packages("tidyverse")

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(here)
## here() starts at /Users/Kribin/Documents/Analytics/R

Set the working environment

here::here()
## [1] "/Users/Kribin/Documents/Analytics/R"

Load the datasets (original data and wide format) and look at gap_wide data

gapminder <- read.csv("gapminder_data.csv") # original data
gap_wide <- read.csv("gapminder_wide.csv") # wide format

head(gap_wide)
##   continent      country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
## 1    Africa      Algeria      2449.0082      3013.9760      2550.8169
## 2    Africa       Angola      3520.6103      3827.9405      4269.2767
## 3    Africa        Benin      1062.7522       959.6011       949.4991
## 4    Africa     Botswana       851.2411       918.2325       983.6540
## 5    Africa Burkina Faso       543.2552       617.1835       722.5120
## 6    Africa      Burundi       339.2965       379.5646       355.2032
##   gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987
## 1      3246.9918      4182.6638      4910.4168      5745.1602      5681.3585
## 2      5522.7764      5473.2880      3008.6474      2756.9537      2430.2083
## 3      1035.8314      1085.7969      1029.1613      1277.8976      1225.8560
## 4      1214.7093      2263.6111      3214.8578      4551.1421      6205.8839
## 5       794.8266       854.7360       743.3870       807.1986       912.0631
## 6       412.9775       464.0995       556.1033       559.6032       621.8188
##   gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007 lifeExp_1952
## 1      5023.2166      4797.2951      5288.0404      6223.3675       43.077
## 2      2627.8457      2277.1409      2773.2873      4797.2313       30.015
## 3      1191.2077      1232.9753      1372.8779      1441.2849       38.223
## 4      7954.1116      8647.1423     11003.6051     12569.8518       47.622
## 5       931.7528       946.2950      1037.6452      1217.0330       31.975
## 6       631.6999       463.1151       446.4035       430.0707       39.031
##   lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 lifeExp_1977 lifeExp_1982
## 1       45.685       48.303       51.407       54.518       58.014       61.368
## 2       31.999       34.000       35.985       37.928       39.483       39.942
## 3       40.358       42.618       44.885       47.014       49.190       50.904
## 4       49.618       51.520       53.298       56.024       59.319       61.484
## 5       34.906       37.814       40.697       43.591       46.137       48.122
## 6       40.533       42.045       43.548       44.057       45.910       47.471
##   lifeExp_1987 lifeExp_1992 lifeExp_1997 lifeExp_2002 lifeExp_2007 pop_1952
## 1       65.799       67.744       69.152       70.994       72.301  9279525
## 2       39.906       40.647       40.963       41.003       42.731  4232095
## 3       52.337       53.919       54.777       54.406       56.728  1738315
## 4       63.622       62.745       52.556       46.634       50.728   442308
## 5       49.557       50.260       50.324       50.650       52.295  4469979
## 6       48.211       44.736       45.326       47.360       49.580  2445618
##   pop_1957 pop_1962 pop_1967 pop_1972 pop_1977 pop_1982 pop_1987 pop_1992
## 1 10270856 11000948 12760499 14760787 17152804 20033753 23254956 26298373
## 2  4561361  4826015  5247469  5894858  6162675  7016384  7874230  8735988
## 3  1925173  2151895  2427334  2761407  3168267  3641603  4243788  4981671
## 4   474639   512764   553541   619351   781472   970347  1151184  1342614
## 5  4713416  4919632  5127935  5433886  5889574  6634596  7586551  8878303
## 6  2667518  2961915  3330989  3529983  3834415  4580410  5126023  5809236
##   pop_1997 pop_2002 pop_2007
## 1 29072015 31287142 33333216
## 2  9875024 10866106 12420476
## 3  6066080  7026113  8078314
## 4  1536536  1630347  1639131
## 5 10352843 12251209 14326203
## 6  6121610  7021078  8390505

Look at dimensions of each dataset

dim(gapminder)
## [1] 1704    6
dim(gap_wide)
## [1] 142  38

Take a look at the wide format gapminder dataset. Often, we receive data in odd formats such as data with a lot of columns, that we would like to convert to a more useable format, such as a long format or long/intermediate format.

To do so, we will first convert this dataset into an entirely long-format dataset, that is more useable.

Pivot wide to long format

Think about which column names you want to ‘pivot’ into a column of column names. In this case, we will be pivotting every column starting with gdpPercap, lifeExp, and pop into a column of column names and a column of values for each.

To do so, we start out by making a vector that selects the columns that we want (using the starts_with() function to make our lives easier), and then set the new column names we are creating to “obstype_year” and “obs_values” for the name and value column names.

gap_wide %>%
  pivot_longer(cols = c(starts_with("gdpPercap"), 
                        starts_with("lifeExp"), 
                        starts_with("pop")),
               names_to = "obstype_year", values_to = "obs_values")
## # A tibble: 5,112 x 4
##    continent country obstype_year   obs_values
##    <chr>     <chr>   <chr>               <dbl>
##  1 Africa    Algeria gdpPercap_1952      2449.
##  2 Africa    Algeria gdpPercap_1957      3014.
##  3 Africa    Algeria gdpPercap_1962      2551.
##  4 Africa    Algeria gdpPercap_1967      3247.
##  5 Africa    Algeria gdpPercap_1972      4183.
##  6 Africa    Algeria gdpPercap_1977      4910.
##  7 Africa    Algeria gdpPercap_1982      5745.
##  8 Africa    Algeria gdpPercap_1987      5681.
##  9 Africa    Algeria gdpPercap_1992      5023.
## 10 Africa    Algeria gdpPercap_1997      4797.
## # … with 5,102 more rows

Another way of doing the same thing would be to select every column except for country and continent. The following returns the same long-format dataset as above:

gap_wide %>%
  pivot_longer(cols = c(-continent, -country),
               names_to = "obstype_year", values_to = "obs_values")
## # A tibble: 5,112 x 4
##    continent country obstype_year   obs_values
##    <chr>     <chr>   <chr>               <dbl>
##  1 Africa    Algeria gdpPercap_1952      2449.
##  2 Africa    Algeria gdpPercap_1957      3014.
##  3 Africa    Algeria gdpPercap_1962      2551.
##  4 Africa    Algeria gdpPercap_1967      3247.
##  5 Africa    Algeria gdpPercap_1972      4183.
##  6 Africa    Algeria gdpPercap_1977      4910.
##  7 Africa    Algeria gdpPercap_1982      5745.
##  8 Africa    Algeria gdpPercap_1987      5681.
##  9 Africa    Algeria gdpPercap_1992      5023.
## 10 Africa    Algeria gdpPercap_1997      4797.
## # … with 5,102 more rows

Finally, we currently have a composite variable of obstype_year, which is the combined name of the observation type (i.e. gdpPercap, lifeExp, pop) and the year (i.e. 1952, 1957, etc.). To separate the two, we can use the separate() function.

gap_wide %>%
  pivot_longer(cols = c(starts_with("gdpPercap"), 
                        starts_with("lifeExp"), 
                        starts_with("pop")),
               names_to = "obstype_year", values_to = "obs_values") %>%
	separate(obstype_year, into = c("obs_type", "year"), sep = "_")
## # A tibble: 5,112 x 5
##    continent country obs_type  year  obs_values
##    <chr>     <chr>   <chr>     <chr>      <dbl>
##  1 Africa    Algeria gdpPercap 1952       2449.
##  2 Africa    Algeria gdpPercap 1957       3014.
##  3 Africa    Algeria gdpPercap 1962       2551.
##  4 Africa    Algeria gdpPercap 1967       3247.
##  5 Africa    Algeria gdpPercap 1972       4183.
##  6 Africa    Algeria gdpPercap 1977       4910.
##  7 Africa    Algeria gdpPercap 1982       5745.
##  8 Africa    Algeria gdpPercap 1987       5681.
##  9 Africa    Algeria gdpPercap 1992       5023.
## 10 Africa    Algeria gdpPercap 1997       4797.
## # … with 5,102 more rows

The separate function takes the obstype_year column and separates it based on the underscore character (via the sep = "_" argument), and finally, we choose the resulting new column names using the into argument. Awesome! Now let’s save a new object called gap_long using the above.

gap_long <- gap_wide %>%
  pivot_longer(cols = c(starts_with("gdpPercap"), 
                        starts_with("lifeExp"), 
                        starts_with("pop")),
               names_to = "obstype_year", values_to = "obs_values") %>%
	separate(obstype_year, into = c("obs_type", "year"), sep = "_")

Here, we can calculate summary statistics, such as the means for life expectancy, population, and gdpPercap for each continent by using the group_by() and summarise() functions seen in the previous lesson on package dplyr.

gap_long %>%
  group_by(continent, obs_type) %>%
  summarise(means = mean(obs_values))
## `summarise()` regrouping output by 'continent' (override with `.groups` argument)
## # A tibble: 15 x 3
## # Groups:   continent [5]
##    continent obs_type       means
##    <chr>     <chr>          <dbl>
##  1 Africa    gdpPercap     2194. 
##  2 Africa    lifeExp         48.9
##  3 Africa    pop        9916003. 
##  4 Americas  gdpPercap     7136. 
##  5 Americas  lifeExp         64.7
##  6 Americas  pop       24504795. 
##  7 Asia      gdpPercap     7902. 
##  8 Asia      lifeExp         60.1
##  9 Asia      pop       77038722. 
## 10 Europe    gdpPercap    14469. 
## 11 Europe    lifeExp         71.9
## 12 Europe    pop       17169765. 
## 13 Oceania   gdpPercap    18622. 
## 14 Oceania   lifeExp         74.3
## 15 Oceania   pop        8874672.

This output looks a bit odd, because the data are in a very long format. Ideally, we would have gdpPercap, lifeExp, and pop be separate columns and values. To get to this intermediate format data, we can pivot_wider() by obs_type!

Pivot long to intermediate format

To get to a wider format, we need two things at minimum: a column of column names that will become the new headers on the new columns we create (names_from argument), and a column of the associated values (values_from argument).

We will get the new column names from the obs_type column, and the values to fil in from the obs_values column. That’s all we need to pivot_wider(), tidyr does the rest!

We will save this new data frame as gap_intermediate.

gap_intermediate <- gap_long %>%
  pivot_wider(names_from = obs_type, values_from = obs_values)

Comparing this to the original dataset, we see it is now exactly the same size

dim(gap_intermediate)
## [1] 1704    6
dim(gapminder) # original gapminder data
## [1] 1704    6

And the same column names

names(gap_intermediate) %in% names(gapminder)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Reordering columns

Currently, the columns are in a different order…

names(gap_intermediate)
## [1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"
names(gapminder)
## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
# all.equal(gap_intermediate, gapminder) # another way of comparing, not run

We need to change the column order using relocate() (but can also use select() insead)

gap_intermediate2 <- gap_intermediate %>%
  relocate(names(gapminder))
names(gap_intermediate2)
## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

This creates the same order of column names as the original gapminder dataset.

Reordering rows

Another difference is that the original dataset has the rows sorted by country, rather than continent first!

head(gap_intermediate2, 3) 
## # A tibble: 3 x 6
##   country year       pop continent lifeExp gdpPercap
##   <chr>   <chr>    <dbl> <chr>       <dbl>     <dbl>
## 1 Algeria 1952   9279525 Africa       43.1     2449.
## 2 Algeria 1957  10270856 Africa       45.7     3014.
## 3 Algeria 1962  11000948 Africa       48.3     2551.
head(gapminder, 3)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007

We can fix this using the arrange() function to arrange by country, then by continent.

gap_intermediate3 <- gap_intermediate2 %>%
	arrange(country, continent)

head(gap_intermediate3, 3) 
## # A tibble: 3 x 6
##   country     year       pop continent lifeExp gdpPercap
##   <chr>       <chr>    <dbl> <chr>       <dbl>     <dbl>
## 1 Afghanistan 1952   8425333 Asia         28.8      779.
## 2 Afghanistan 1957   9240934 Asia         30.3      821.
## 3 Afghanistan 1962  10267083 Asia         32.0      853.
head(gapminder, 3)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007

Changing column class

Finally, if we try comparing our two datasets now using the all.equal() function:

all.equal(gap_intermediate3, gapminder)
## [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"                                
## [3] "Component \"year\": Modes: character, numeric"                                         
## [4] "Component \"year\": target is character, current is numeric"

We see that there is one last difference. The year in our dataset is a numeric, rather than an integer. This can be easily changed using the mutate() function seen previously. Additionally, the original gapminder dataset is not a tibble, so we will revert back to a data frame only using the as.data.frame() function.

gap_intermediate4 <- gap_intermediate3 %>%
	mutate(year = as.integer(year)) %>%
	as.data.frame()

all.equal(gap_intermediate4, gapminder)
## [1] TRUE

Now, (finally!) the data are exactly the same as the original!

Combining all steps together to reduce code

Finally, once we have each step working coherently, we can reduce our code into a single, long pipe that we can run. This will get the data to the exact format we want it to be, that can then be saved for later as a processed dataset, keeping our original, raw, and unprocessed dataset unchanged but easily workable!

First, we gather all code where we re-saved the data frame as a new object:

gap_long <- gap_wide %>%
  pivot_longer(cols = c(starts_with("gdpPercap"), 
                        starts_with("lifeExp"), 
                        starts_with("pop")),
               names_to = "obstype_year", values_to = "obs_values") %>%
	separate(obstype_year, into = c("obs_type", "year"), sep = "_")

gap_intermediate <- gap_long %>%
  pivot_wider(names_from = obs_type, values_from = obs_values)

gap_intermediate2 <- gap_intermediate %>%
  relocate(names(gapminder))

gap_intermediate3 <- gap_intermediate2 %>%
	arrange(country, continent)

gap_intermediate4 <- gap_intermediate3 %>%
	mutate(year = as.integer(year)) %>%
	as.data.frame()

Ensure that this all runs smoothly, from start to finish before proceeding. If it does, we can now combine all steps into a single pipeline, to efficiently manipulate our data frame to match the original. Let’s name this new data frame gap_final.

gap_final <- gap_wide %>%
  pivot_longer(cols = c(starts_with("gdpPercap"), 
                        starts_with("lifeExp"), 
                        starts_with("pop")),
               names_to = "obstype_year", values_to = "obs_values") %>%
	separate(obstype_year, into = c("obs_type", "year"), sep = "_") %>%
  pivot_wider(names_from = obs_type, values_from = obs_values) %>%
  relocate(names(gapminder)) %>%
	arrange(country, continent) %>%
	mutate(year = as.integer(year)) %>%
	as.data.frame()
all.equal(gap_final, gapminder)
## [1] TRUE

And now, we have a concise bit of code that gets the raw data to the perfect format to work with.

Try it yourself on your own wide or long data, by getting it first to a long format, then moving it towards an intermediate format! Just have a good think about which column names you want where, and you should be able to adapt this code for yourself!