Reshaping tables

↓ Practice ← Joining tables User-defined functions →

Change layout of values in a table

Source: data import cheat sheet : tidyr

Two main functions to manipulate the layout of the table, pivot_longer transforms the table from wide to long format and pivot_wider, which does the opposite, converts the table from long to wide format.

`pivot_longer` : wide to long format

This function allows collapsing ‘similar’ variables into one variable while guaranteeing the data set’s consistency. For example take the variables pulse1 and pulse2 in the following subset of the pulse data set:

pulses <- pulse  %>%  
      select(name,pulse1,pulse2)  %>% 
      head(3) 
pulses

# A tibble: 3 × 3
  name     pulse1 pulse2
  <chr>     <dbl>  <dbl>
1 Bonnie       86     88
2 Melanie      82    150
3 Consuelo     96    176

We can transform the table as such that all pulse values are under a single variable, say pulse:

dfLong <- pulses %>% pivot_longer(c(pulse1, pulse2), names_to = "pulse", values_to = "level")
dfLong

# A tibble: 6 × 3
  name     pulse  level
  <chr>    <chr>  <dbl>
1 Bonnie   pulse1    86
2 Bonnie   pulse2    88
3 Melanie  pulse1    82
4 Melanie  pulse2   150
5 Consuelo pulse1    96
6 Consuelo pulse2   176

This is called the long version of the original (wide) table and contains the same information.

Alternatively you achieve the same result using ! (negation operator):

dfLong <- pulses %>% pivot_longer(!name, names_to = "pulse", values_to = "level")
dfLong

# A tibble: 6 × 3
  name     pulse  level
  <chr>    <chr>  <dbl>
1 Bonnie   pulse1    86
2 Bonnie   pulse2    88
3 Melanie  pulse1    82
4 Melanie  pulse2   150
5 Consuelo pulse1    96
6 Consuelo pulse2   176

Can you find ohter variables in the pulse data set which can be transformed to long format?

Answer

Yes, for example smokes and alcohol, both are categorical with values {yes,no}.

Let’s reshape the pulse dataset with variables drug = {smokes,alcohol} and use = {yes,no}.

pulse %>%  pivot_longer(c(smokes,alcohol), names_to = "drug", values_to = "use")

# A tibble: 220 × 13
   id     name     height weight   age gender exercise ran   pulse1 pulse2  year drug    use  
   <chr>  <chr>     <dbl>  <dbl> <dbl> <chr>  <chr>    <chr>  <dbl>  <dbl> <dbl> <chr>   <chr>
 1 1993_A Bonnie      173     57    18 female moderate sat       86     88  1993 smokes  no   
 2 1993_A Bonnie      173     57    18 female moderate sat       86     88  1993 alcohol yes  
 3 1993_B Melanie     179     58    19 female moderate ran       82    150  1993 smokes  no   
 4 1993_B Melanie     179     58    19 female moderate ran       82    150  1993 alcohol yes  
 5 1993_C Consuelo    167     62    18 female high     ran       96    176  1993 smokes  no   
 6 1993_C Consuelo    167     62    18 female high     ran       96    176  1993 alcohol yes  
 7 1993_D Travis      195     84    18 male   high     sat       71     73  1993 smokes  no   
 8 1993_D Travis      195     84    18 male   high     sat       71     73  1993 alcohol yes  
 9 1993_E Lauri       173     64    18 female low      sat       90     88  1993 smokes  no   
10 1993_E Lauri       173     64    18 female low      sat       90     88  1993 alcohol yes  
# ℹ 210 more rows

`pivot_wider` : long to wide format

dfWide <- dfLong %>% pivot_wider(names_from = "pulse", values_from = "level")
dfWide

# A tibble: 3 × 3
  name     pulse1 pulse2
  <chr>     <dbl>  <dbl>
1 Bonnie       86     88
2 Melanie      82    150
3 Consuelo     96    176

Below, pulses tibble has been transformed into pulses2. What can you say about this transformation?

pulses2 <- pulses %>% 
           pivot_longer(!name, names_to = "pulse", values_to = "level") %>% 
           pivot_wider(names_from = "pulse", values_from = "level")

Answer

pulses and pulses2 are identical

ggplot and reshaping

Often you will need to reshape your data for ggplot visualisations. For example we would like to compare pulse1 and pulse2 variables with a boxplot. You may plot pulse1 and pulse2 in separate plots one after the other:

require(gridExtra)  # Make sure gridExtra is installed with install.packages("gridExtra").

bp1 <- ggplot(pulse %>% drop_na()) +  # remove missing
  aes(y=pulse1) +
  geom_boxplot()
bp2 <- ggplot(pulse %>% drop_na()) +  # remove missing
  aes(y=pulse2) +
  geom_boxplot()
grid.arrange(bp1, bp2, ncol=2)

As you can see the y-scale of each plot is set dynamically and can be misleading. To resolve this, we would need a single plot with aesthetic mapping of x being the categorical variable pulse and y the pulse values as was shown above.

ggplot(pulse %>% 
         drop_na() %>%  # remove missing 
         pivot_longer( c(pulse1,pulse2), names_to = "pulse", values_to = "level")
       ) +  
  aes(x=pulse, y=level) +
  geom_boxplot()

↓ Practice ← Joining tables User-defined functions →

Reshaping tables

pivot_longer : wide to long format

pivot_wider : long to wide format

ggplot and reshaping

`pivot_longer` : wide to long format

`pivot_wider` : long to wide format