Joining tables

↓ Practice ← Matrices (not on exam) Reshaping tables →

Combine tibbles

*_join(...) is a family of functions for combining two tibbles on common variable(s) called key.

For example, take a group of people (n=7) for which we have collected age and height data:

# height 
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"),
           height = c(160, 172, 182, 157, 162)) 
df1

# A tibble: 5 × 2
  name    height
  <chr>    <dbl>
1 Isa        160
2 Jaylinn    172
3 Mila       182
4 Milas      157
5 Yara       162

# age
df2 <- tibble(name = c("Fiene","Jaylinn","Mila","Noah","Yara"),
           age = c(20,24,17,23,17)) 
df2

# A tibble: 5 × 2
  name      age
  <chr>   <dbl>
1 Fiene      20
2 Jaylinn    24
3 Mila       17
4 Noah       23
5 Yara       17

The information here is incomplete, for some we have only height and for some only age. Now we want to combine these tibbles into a single table with variables name, age and height. We can do this with a join function which is able to combine observations (rows) by matching on the common variable(s) between the two tibbles. Matching is done by one or more variables, in this case the variable name.

Keep in mind that for a proper joining of data, the matched variable(s), also known as the key, must be unique in both tibbles.

Join : single variable

You may control the layout of the combined table by choosing one of the three join function laid out below.

inner_join

Return all rows in df1 where there are matching values of name in df2 and all columns in both df1 and df2:

inner_join(df1,df2, by = "name")

# A tibble: 3 × 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Jaylinn    172    24
2 Mila       182    17
3 Yara       162    17

left_join

Return all rows from df1 and all columns in both df1 and df2, NA for missing values in df2:

left_join(df1,df2, by = "name")

# A tibble: 5 × 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Isa        160    NA
2 Jaylinn    172    24
3 Mila       182    17
4 Milas      157    NA
5 Yara       162    17

full_join

All rows and all columns in df1 and df2.

full_join(df1,df2, by = "name")

# A tibble: 7 × 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Isa        160    NA
2 Jaylinn    172    24
3 Mila       182    17
4 Milas      157    NA
5 Yara       162    17
6 Fiene       NA    20
7 Noah        NA    23

Join : multiple variables as the key

To illustrate let’s split the pulse dataset into two separate tibbles with common variables name and height:

p1 <- pulse  %>% select(name,height,age,gender,year)
p1

# A tibble: 110 × 5
   name      height   age gender  year
   <chr>      <dbl> <dbl> <chr>  <dbl>
 1 Bonnie       173    18 female  1993
 2 Melanie      179    19 female  1993
 3 Consuelo     167    18 female  1993
 4 Travis       195    18 male    1993
 5 Lauri        173    18 female  1993
 6 George       184    22 male    1993
 7 Cherry       162    20 female  1993
 8 Francesca    169    18 female  1993
 9 Sonja        164    19 female  1993
10 Troy         168    23 male    1993
# … with 100 more rows

p2 <- pulse  %>% select(-age,-gender,-year)
p2

# A tibble: 110 × 10
   id     name      height weight smokes alcohol exercise ran   pulse1 pulse2
   <chr>  <chr>      <dbl>  <dbl> <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl>
 1 1993_A Bonnie       173     57 no     yes     moderate sat       86     88
 2 1993_B Melanie      179     58 no     yes     moderate ran       82    150
 3 1993_C Consuelo     167     62 no     yes     high     ran       96    176
 4 1993_D Travis       195     84 no     yes     high     sat       71     73
 5 1993_E Lauri        173     64 no     yes     low      sat       90     88
 6 1993_F George       184     74 no     yes     low      ran       78    141
 7 1993_G Cherry       162     57 no     yes     moderate sat       68     72
 8 1993_H Francesca    169     55 no     yes     moderate sat       71     77
 9 1993_I Sonja        164     56 no     yes     high     sat       68     68
10 1993_J Troy         168     60 no     yes     moderate ran       88    150
# … with 100 more rows

p1 has the common variables {name,height} (the key) and additional variables {age,gender,year}, and p2 has the same key variables {name,height} with some other measurement variables such as pulse1 pulse2 etc.

Is the combination {name,height} a valid choice as the unique key to join two tibbles p1 and p2?

Answer

Yes. This because the combination is unique in the table. You can check the uniqueness of {name,height} with the command duplicated, see also ?base::duplicated. The statement sum(duplicated(pulse %>% select(name,height)))==0 tests whether there are duplicates in the combination {name,height}

Now we want to join them back together using full_join:

full_join(p1,p2) # common variables are used when no variable is specified.

Joining, by = c("name", "height")

# A tibble: 110 × 13
   name      height   age gender  year id     weight smokes alcohol exercise ran   pulse1 pulse2
   <chr>      <dbl> <dbl> <chr>  <dbl> <chr>   <dbl> <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl>
 1 Bonnie       173    18 female  1993 1993_A     57 no     yes     moderate sat       86     88
 2 Melanie      179    19 female  1993 1993_B     58 no     yes     moderate ran       82    150
 3 Consuelo     167    18 female  1993 1993_C     62 no     yes     high     ran       96    176
 4 Travis       195    18 male    1993 1993_D     84 no     yes     high     sat       71     73
 5 Lauri        173    18 female  1993 1993_E     64 no     yes     low      sat       90     88
 6 George       184    22 male    1993 1993_F     74 no     yes     low      ran       78    141
 7 Cherry       162    20 female  1993 1993_G     57 no     yes     moderate sat       68     72
 8 Francesca    169    18 female  1993 1993_H     55 no     yes     moderate sat       71     77
 9 Sonja        164    19 female  1993 1993_I     56 no     yes     high     sat       68     68
10 Troy         168    23 male    1993 1993_J     60 no     yes     moderate ran       88    150
# … with 100 more rows

It is a better practice to explicitly specify the variables:

pulse_joined <- full_join(p1,p2, by = c("name","height"))
pulse_joined

# A tibble: 110 × 13
   name      height   age gender  year id     weight smokes alcohol exercise ran   pulse1 pulse2
   <chr>      <dbl> <dbl> <chr>  <dbl> <chr>   <dbl> <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl>
 1 Bonnie       173    18 female  1993 1993_A     57 no     yes     moderate sat       86     88
 2 Melanie      179    19 female  1993 1993_B     58 no     yes     moderate ran       82    150
 3 Consuelo     167    18 female  1993 1993_C     62 no     yes     high     ran       96    176
 4 Travis       195    18 male    1993 1993_D     84 no     yes     high     sat       71     73
 5 Lauri        173    18 female  1993 1993_E     64 no     yes     low      sat       90     88
 6 George       184    22 male    1993 1993_F     74 no     yes     low      ran       78    141
 7 Cherry       162    20 female  1993 1993_G     57 no     yes     moderate sat       68     72
 8 Francesca    169    18 female  1993 1993_H     55 no     yes     moderate sat       71     77
 9 Sonja        164    19 female  1993 1993_I     56 no     yes     high     sat       68     68
10 Troy         168    23 male    1993 1993_J     60 no     yes     moderate ran       88    150
# … with 100 more rows

You may check:

all_equal(pulse,pulse_joined) # caution: all_equal is an experimental function

[1] TRUE

⚠️ The function all_equal is an experimental function and may become obsolete!

Bind tibbles by rows and columns

Often we have different data sets with i) the same set of variables or ii) same set of observations but different variables which we would like to combine:

bind_rows(…) function creates a new tibble from a set of tibbles with not necessarily common variables.
bind_cols(…) function creates a new tibble from a set of tibbles with common observations.

bind_rows

Take for example the following two data sets with common variables name,year and colour:

favourite_colour1  <- tibble(name=c("Lucas","Lotte","Noa","Wim"), 
                           year=c(1995,1995,1995,1994), 
                           colour=c("Blue","Green","Yellow","Purple"))
favourite_colour1

# A tibble: 4 × 3
  name   year colour
  <chr> <dbl> <chr> 
1 Lucas  1995 Blue  
2 Lotte  1995 Green 
3 Noa    1995 Yellow
4 Wim    1994 Purple

favourite_colour2  <- tibble(name=c("Raul", "Isaac"), 
                           year=c(1998,1998), 
                           colour=c("Red", "Green"))
favourite_colour2

# A tibble: 2 × 3
  name   year colour
  <chr> <dbl> <chr> 
1 Raul   1998 Red   
2 Isaac  1998 Green

then the following gives the combined tibble:

favourite_colour1_2 <- bind_rows(favourite_colour1, favourite_colour2)
favourite_colour1_2

# A tibble: 6 × 3
  name   year colour
  <chr> <dbl> <chr> 
1 Lucas  1995 Blue  
2 Lotte  1995 Green 
3 Noa    1995 Yellow
4 Wim    1994 Purple
5 Raul   1998 Red   
6 Isaac  1998 Green

The function bind_rows treats tibbles as a collection of unordered variables. Let’s take the same data as in favourite_colour2 but change the order of variables year and colour

favourite_colour3  <- tibble(name=c("Raul", "Isaac"), 
                           colour=c("Red", "Green"), 
                           year=c(1998,1998))
favourite_colour3

# A tibble: 2 × 3
  name  colour  year
  <chr> <chr>  <dbl>
1 Raul  Red     1998
2 Isaac Green   1998

then combining favourite_colour1 and favourite_colour3 will yield the same results:

favourite_colour1_3 <- bind_rows(favourite_colour1, favourite_colour3)
favourite_colour1_3

# A tibble: 6 × 3
  name   year colour
  <chr> <dbl> <chr> 
1 Lucas  1995 Blue  
2 Lotte  1995 Green 
3 Noa    1995 Yellow
4 Wim    1994 Purple
5 Raul   1998 Red   
6 Isaac  1998 Green

What about bind_rows(favourite_colour3, favourite_colour1)?

Answer

The same dataset except the order of variables are taken from favourite_colour3, the first argument to bind_rows.

Another consequence of treating tibbles as a collection of unordered variables is that there is no restriction on the given variables in the tibbles, with other words they might be identical sets of variables as was shown in the examples above but not necessarily:

favourite_colour4  <- tibble(name=c("Raul", "Isaac"), 
                           colour=c("Red", "Green"), 
                           year=c(1998,1998), 
                           height= c(173, 179))
favourite_colour4

# A tibble: 2 × 4
  name  colour  year height
  <chr> <chr>  <dbl>  <dbl>
1 Raul  Red     1998    173
2 Isaac Green   1998    179

favourite_colour1_4 <- bind_rows(favourite_colour1, favourite_colour4)
favourite_colour1_4

# A tibble: 6 × 4
  name   year colour height
  <chr> <dbl> <chr>   <dbl>
1 Lucas  1995 Blue       NA
2 Lotte  1995 Green      NA
3 Noa    1995 Yellow     NA
4 Wim    1994 Purple     NA
5 Raul   1998 Red       173
6 Isaac  1998 Green     179

bind_cols

Let us now combine another tibble with variable ‘height’ for the same set of observations in favourite_colour1_2:

heights <- tibble(height=c(173, 179, 167, 181 , 173, 184))
heights

# A tibble: 6 × 1
  height
   <dbl>
1    173
2    179
3    167
4    181
5    173
6    184

bind_cols(favourite_colour1_2,heights)

# A tibble: 6 × 4
  name   year colour height
  <chr> <dbl> <chr>   <dbl>
1 Lucas  1995 Blue      173
2 Lotte  1995 Green     179
3 Noa    1995 Yellow    167
4 Wim    1994 Purple    181
5 Raul   1998 Red       173
6 Isaac  1998 Green     184

bind_cols expects the same number of obsersavations in each tibble, it is an error otherwise, and you as the user are responsible for the order of observations in each tibble.

↓ Practice ← Matrices (not on exam) Reshaping tables →