Combine tibbles
*_join(...)
is a family of functions for combining two tibbles on common variable(s)
called key
.
For example, take a group of people (n=7) for which we have collected age and height data:
# height
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"),
height = c(160, 172, 182, 157, 162))
df1
# A tibble: 5 × 2
name height
<chr> <dbl>
1 Isa 160
2 Jaylinn 172
3 Mila 182
4 Milas 157
5 Yara 162
# age
df2 <- tibble(name = c("Fiene","Jaylinn","Mila","Noah","Yara"),
age = c(20,24,17,23,17))
df2
# A tibble: 5 × 2
name age
<chr> <dbl>
1 Fiene 20
2 Jaylinn 24
3 Mila 17
4 Noah 23
5 Yara 17
The information here is incomplete, for some we have only height
and for some only age
. Now we want to combine these tibbles into a single table with variables name
, age
and height
. We can do this with a join function which is able to combine observations (rows) by matching on the common variable(s) between the two tibbles. Matching is done by one or more variables, in this case the variable name
.
Keep in mind that for a proper joining of data, the matched variable(s), also known as the key, must be unique in both tibbles.
You may control the layout of the combined table by choosing one of the three join function laid out below.
Return all rows in df1
where there are matching values of name
in df2
and all columns in both df1
and df2
:
inner_join(df1,df2, by = "name")
# A tibble: 3 × 3
name height age
<chr> <dbl> <dbl>
1 Jaylinn 172 24
2 Mila 182 17
3 Yara 162 17
Return all rows from df1
and all columns in both df1
and df2
, NA
for missing values in df2
:
left_join(df1,df2, by = "name")
# A tibble: 5 × 3
name height age
<chr> <dbl> <dbl>
1 Isa 160 NA
2 Jaylinn 172 24
3 Mila 182 17
4 Milas 157 NA
5 Yara 162 17
All rows and all columns in df1
and df2
.
full_join(df1,df2, by = "name")
# A tibble: 7 × 3
name height age
<chr> <dbl> <dbl>
1 Isa 160 NA
2 Jaylinn 172 24
3 Mila 182 17
4 Milas 157 NA
5 Yara 162 17
6 Fiene NA 20
7 Noah NA 23
To illustrate let’s split the pulse dataset into two separate tibbles with common variables name
and height
:
p1 <- pulse %>% select(name,height,age,gender,year)
p1
# A tibble: 110 × 5
name height age gender year
<chr> <dbl> <dbl> <chr> <dbl>
1 Bonnie 173 18 female 1993
2 Melanie 179 19 female 1993
3 Consuelo 167 18 female 1993
4 Travis 195 18 male 1993
5 Lauri 173 18 female 1993
6 George 184 22 male 1993
7 Cherry 162 20 female 1993
8 Francesca 169 18 female 1993
9 Sonja 164 19 female 1993
10 Troy 168 23 male 1993
# … with 100 more rows
p2 <- pulse %>% select(-age,-gender,-year)
p2
# A tibble: 110 × 10
id name height weight smokes alcohol exercise ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1993_A Bonnie 173 57 no yes moderate sat 86 88
2 1993_B Melanie 179 58 no yes moderate ran 82 150
3 1993_C Consuelo 167 62 no yes high ran 96 176
4 1993_D Travis 195 84 no yes high sat 71 73
5 1993_E Lauri 173 64 no yes low sat 90 88
6 1993_F George 184 74 no yes low ran 78 141
7 1993_G Cherry 162 57 no yes moderate sat 68 72
8 1993_H Francesca 169 55 no yes moderate sat 71 77
9 1993_I Sonja 164 56 no yes high sat 68 68
10 1993_J Troy 168 60 no yes moderate ran 88 150
# … with 100 more rows
p1
has the common variables {name,height}
(the key) and additional variables {age,gender,year}
, and p2
has the same key variables {name,height}
with some other measurement variables such as pulse1
pulse2
etc.
AnswerIs the combination
{name,height}
a valid choice as the unique key to join two tibbles p1 and p2?
Now we want to join them back together using full_join
:
full_join(p1,p2) # common variables are used when no variable is specified.
Joining, by = c("name", "height")
# A tibble: 110 × 13
name height age gender year id weight smokes alcohol exercise ran pulse1 pulse2
<chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Bonnie 173 18 female 1993 1993_A 57 no yes moderate sat 86 88
2 Melanie 179 19 female 1993 1993_B 58 no yes moderate ran 82 150
3 Consuelo 167 18 female 1993 1993_C 62 no yes high ran 96 176
4 Travis 195 18 male 1993 1993_D 84 no yes high sat 71 73
5 Lauri 173 18 female 1993 1993_E 64 no yes low sat 90 88
6 George 184 22 male 1993 1993_F 74 no yes low ran 78 141
7 Cherry 162 20 female 1993 1993_G 57 no yes moderate sat 68 72
8 Francesca 169 18 female 1993 1993_H 55 no yes moderate sat 71 77
9 Sonja 164 19 female 1993 1993_I 56 no yes high sat 68 68
10 Troy 168 23 male 1993 1993_J 60 no yes moderate ran 88 150
# … with 100 more rows
It is a better practice to explicitly specify the variables:
pulse_joined <- full_join(p1,p2, by = c("name","height"))
pulse_joined
# A tibble: 110 × 13
name height age gender year id weight smokes alcohol exercise ran pulse1 pulse2
<chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Bonnie 173 18 female 1993 1993_A 57 no yes moderate sat 86 88
2 Melanie 179 19 female 1993 1993_B 58 no yes moderate ran 82 150
3 Consuelo 167 18 female 1993 1993_C 62 no yes high ran 96 176
4 Travis 195 18 male 1993 1993_D 84 no yes high sat 71 73
5 Lauri 173 18 female 1993 1993_E 64 no yes low sat 90 88
6 George 184 22 male 1993 1993_F 74 no yes low ran 78 141
7 Cherry 162 20 female 1993 1993_G 57 no yes moderate sat 68 72
8 Francesca 169 18 female 1993 1993_H 55 no yes moderate sat 71 77
9 Sonja 164 19 female 1993 1993_I 56 no yes high sat 68 68
10 Troy 168 23 male 1993 1993_J 60 no yes moderate ran 88 150
# … with 100 more rows
You may check:
all_equal(pulse,pulse_joined) # caution: all_equal is an experimental function
[1] TRUE
⚠️ The function all_equal is an experimental function and may become obsolete!
Often we have different data sets with i) the same set of variables or ii) same set of observations but different variables which we would like to combine:
bind_rows
Take for example the following two data sets with common variables name,year and colour:
favourite_colour1 <- tibble(name=c("Lucas","Lotte","Noa","Wim"),
year=c(1995,1995,1995,1994),
colour=c("Blue","Green","Yellow","Purple"))
favourite_colour1
# A tibble: 4 × 3
name year colour
<chr> <dbl> <chr>
1 Lucas 1995 Blue
2 Lotte 1995 Green
3 Noa 1995 Yellow
4 Wim 1994 Purple
favourite_colour2 <- tibble(name=c("Raul", "Isaac"),
year=c(1998,1998),
colour=c("Red", "Green"))
favourite_colour2
# A tibble: 2 × 3
name year colour
<chr> <dbl> <chr>
1 Raul 1998 Red
2 Isaac 1998 Green
then the following gives the combined tibble:
favourite_colour1_2 <- bind_rows(favourite_colour1, favourite_colour2)
favourite_colour1_2
# A tibble: 6 × 3
name year colour
<chr> <dbl> <chr>
1 Lucas 1995 Blue
2 Lotte 1995 Green
3 Noa 1995 Yellow
4 Wim 1994 Purple
5 Raul 1998 Red
6 Isaac 1998 Green
The function bind_rows
treats tibbles as a collection of unordered variables. Let’s take the same data as in favourite_colour2
but change the order of variables year and colour
favourite_colour3 <- tibble(name=c("Raul", "Isaac"),
colour=c("Red", "Green"),
year=c(1998,1998))
favourite_colour3
# A tibble: 2 × 3
name colour year
<chr> <chr> <dbl>
1 Raul Red 1998
2 Isaac Green 1998
then combining favourite_colour1
and favourite_colour3
will yield the same results:
favourite_colour1_3 <- bind_rows(favourite_colour1, favourite_colour3)
favourite_colour1_3
# A tibble: 6 × 3
name year colour
<chr> <dbl> <chr>
1 Lucas 1995 Blue
2 Lotte 1995 Green
3 Noa 1995 Yellow
4 Wim 1994 Purple
5 Raul 1998 Red
6 Isaac 1998 Green
AnswerWhat about
bind_rows(favourite_colour3, favourite_colour1)
?
favourite_colour3
, the first argument to bind_rows.
Another consequence of treating tibbles as a collection of unordered variables is that there is no restriction on the given variables in the tibbles, with other words they might be identical sets of variables as was shown in the examples above but not necessarily:
favourite_colour4 <- tibble(name=c("Raul", "Isaac"),
colour=c("Red", "Green"),
year=c(1998,1998),
height= c(173, 179))
favourite_colour4
# A tibble: 2 × 4
name colour year height
<chr> <chr> <dbl> <dbl>
1 Raul Red 1998 173
2 Isaac Green 1998 179
favourite_colour1_4 <- bind_rows(favourite_colour1, favourite_colour4)
favourite_colour1_4
# A tibble: 6 × 4
name year colour height
<chr> <dbl> <chr> <dbl>
1 Lucas 1995 Blue NA
2 Lotte 1995 Green NA
3 Noa 1995 Yellow NA
4 Wim 1994 Purple NA
5 Raul 1998 Red 173
6 Isaac 1998 Green 179
bind_cols
Let us now combine another tibble with variable ‘height’ for the same set of observations in favourite_colour1_2
:
heights <- tibble(height=c(173, 179, 167, 181 , 173, 184))
heights
# A tibble: 6 × 1
height
<dbl>
1 173
2 179
3 167
4 181
5 173
6 184
bind_cols(favourite_colour1_2,heights)
# A tibble: 6 × 4
name year colour height
<chr> <dbl> <chr> <dbl>
1 Lucas 1995 Blue 173
2 Lotte 1995 Green 179
3 Noa 1995 Yellow 167
4 Wim 1994 Purple 181
5 Raul 1998 Red 173
6 Isaac 1998 Green 184
bind_cols expects the same number of obsersavations in each tibble, it is an error otherwise, and you as the user are responsible for the order of observations in each tibble.
Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC