Joining tables (solutions)

Primary exercises

For this exercise we will first split the survey dataset into two separate tables in order to join them again! Call these df1 and df2, these will have disjoint set of variables except name and age, the variables name and age combined are unique in all observations and will be used later for joining. Take for example all variables related to arm or hand in df1 and the rest in df2:

df1 : "name"   "span1" "span2" "hand"  "fold"   "clap"  "age"
df2 : "name"   "gender"  "pulse"  "exercise"   "smokes"  "height" "m.i" "age"

df1 <- survey %>%  select(name,age,span1,span2,hand,fold,clap)
df1

# A tibble: 233 × 7
   name      age span1 span2 hand  fold    clap   
   <chr>   <dbl> <dbl> <dbl> <chr> <chr>   <chr>  
 1 Alyson   18.2  18.5  18   right right   left   
 2 Todd     17.6  19.5  20.5 left  right   left   
 3 Gerald   16.9  18    13.3 right left    neither
 4 Robert   20.3  18.8  18.9 right right   neither
 5 Dustin   23.7  20    20   right neither right  
 6 Abby     21    18    17.7 right left    right  
 7 Andre    18.8  17.7  17.7 right left    right  
 8 Michael  35.8  17    17.3 right right   right  
 9 Edward   19    20    19.5 right right   right  
10 Carl     22.3  18.5  18.5 right right   right  
# ℹ 223 more rows

df2 <- survey %>%  select(name,age,gender,pulse,exercise,smokes,height,m.i)
df2

# A tibble: 233 × 8
   name      age gender pulse exercise smokes height m.i     
   <chr>   <dbl> <chr>  <dbl> <chr>    <chr>   <dbl> <chr>   
 1 Alyson   18.2 female    92 some     never    173  metric  
 2 Todd     17.6 male     104 none     regul    178. imperial
 3 Gerald   16.9 male      87 none     occas     NA  <NA>    
 4 Robert   20.3 male      NA none     never    160  metric  
 5 Dustin   23.7 male      35 some     never    165  metric  
 6 Abby     21   female    64 some     never    173. imperial
 7 Andre    18.8 male      83 freq     never    183. imperial
 8 Michael  35.8 female    74 freq     never    157  metric  
 9 Edward   19   male      72 some     never    175  metric  
10 Carl     22.3 male      90 some     never    167  metric  
# ℹ 223 more rows

Join df1 and df2 by name and age such that you obtain the original survey table.

# In the join the order of the variables are different than the original survey tibble 
# but the content is identical.
inner_join(df1,df2,by=c("name","age"))

# A tibble: 233 × 13
   name      age span1 span2 hand  fold    clap    gender pulse exercise smokes height m.i     
   <chr>   <dbl> <dbl> <dbl> <chr> <chr>   <chr>   <chr>  <dbl> <chr>    <chr>   <dbl> <chr>   
 1 Alyson   18.2  18.5  18   right right   left    female    92 some     never    173  metric  
 2 Todd     17.6  19.5  20.5 left  right   left    male     104 none     regul    178. imperial
 3 Gerald   16.9  18    13.3 right left    neither male      87 none     occas     NA  <NA>    
 4 Robert   20.3  18.8  18.9 right right   neither male      NA none     never    160  metric  
 5 Dustin   23.7  20    20   right neither right   male      35 some     never    165  metric  
 6 Abby     21    18    17.7 right left    right   female    64 some     never    173. imperial
 7 Andre    18.8  17.7  17.7 right left    right   male      83 freq     never    183. imperial
 8 Michael  35.8  17    17.3 right right   right   female    74 freq     never    157  metric  
 9 Edward   19    20    19.5 right right   right   male      72 some     never    175  metric  
10 Carl     22.3  18.5  18.5 right right   right   male      90 some     never    167  metric  
# ℹ 223 more rows

In exercise (a) does it make any difference to choose either of inner_join, left_join or full_join? Hint: compare two tables with function all_equal.

# Answer is no, this because both df1 and df2 come from the same source 
# `survey` with equal number of matching observations with matching keys. 
#

all_equal(left_join(df1,df2,by=c("name","age")), survey)   # left_join

[1] TRUE

all_equal(inner_join(df1,df2,by=c("name","age")), survey)  # inner_join

[1] TRUE

all_equal(full_join(df1,df2,by=c("name","age")), survey)   # full_join

[1] TRUE

Are the pairs name and age also good candidates as the key, i.e. is the combination of name and age uniquely identify each observation in the survey data? What about the combination of name with span1 or span2?

# Yes, there are no duplicates in the pairs (name,age) in survey data.
sum(duplicated(survey %>% select(name,age)))==0

[1] TRUE

# The pair name and span1 will also suffice as a key.
sum(duplicated(survey %>% select(name,span1)))==0

[1] TRUE

# Multiple observations in the survey data have the same 
# name span2 combinations, therefore not a good candidate 
# as a key.
sum(duplicated(survey %>% select(name,span2)))==0

[1] FALSE

Extra exercises

The name and span2 were not good combinations as a key, show the rows with identical name and span2 values.

# step 1 : first find the {name,span2} of duplicated rows as a tibble
dups <- survey %>%  filter(duplicated(survey %>% select(name,span2)))  %>% select(name,span2) 
dups

# A tibble: 2 × 2
  name   span2
  <chr>  <dbl>
1 George  19.5
2 Roland  19

# step 2 : join all observations from survey where there is a match with {name,span2} in dups
inner_join(survey, dups , c("name","span2"))

# A tibble: 4 × 13
  name   gender span1 span2 hand  fold  pulse clap    exercise smokes height m.i      age
  <chr>  <chr>  <dbl> <dbl> <chr> <chr> <dbl> <chr>   <chr>    <chr>   <dbl> <chr>  <dbl>
1 George male    20.5  19.5 right left     56 right   freq     never     179 metric  17.4
2 Roland male    19    19   right left     NA right   some     never     180 metric  19.9
3 George male    20    19.5 right right    68 neither freq     regul     190 metric  19.4
4 Roland male    18.5  19   right left     84 right   freq     regul     187 metric  17.9

Explain the result if you use only the name variable as the key for joining in the primary exercise (1). Why are there more rows in the result and what is the excess number of rows?

# The variable 'name' is not unique, so for all those non-unique names the 
# joining of df1 and df2 will result in a product. For example the name 
# 'Benjamin' appears twice in the survey data, therefore it appears twice
# in each for the tibbles df1 and df2. Each row with the name 'Benjamin' 
# in df1 matches twice to the name 'Benjamin' in df2, so in total there 
# will be 4 rows in the final result.
# 
inner_join(df1,df2,by="name") %>% filter(name=="Benjamin")

# A tibble: 4 × 14
  name     age.x span1 span2 hand  fold  clap  age.y gender pulse exercise smokes height m.i     
  <chr>    <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>  <dbl> <chr>    <chr>   <dbl> <chr>   
1 Benjamin  21    18.5  18.1 right left  left   21   male      66 freq     never    175. imperial
2 Benjamin  21    18.5  18.1 right left  left   19.7 male      NA none     never    188. imperial
3 Benjamin  19.7  20    20.5 right right right  21   male      66 freq     never    175. imperial
4 Benjamin  19.7  20    20.5 right right right  19.7 male      NA none     never    188. imperial

# There are 78 more rows extra as the result of the ambiguities in joining
# with 'name' only. 
#
name_age_key_row_count <- inner_join(df1,df2,by=c("name","age")) %>% nrow() # 'name' and 'age' as the key
name_age_key_row_count

[1] 233

name_key_row_count <- inner_join(df1,df2,by="name") %>% nrow() # only 'name' as the key
name_key_row_count

[1] 311

name_key_row_count - name_age_key_row_count # number of additional rows due to non-unique key 'name'

[1] 78

In the previous exercise, using only name as the key, we obtained 78 ‘excess’ observations due to non-unique key name. Calculate this number without applying the join.

survey %>%  
  count(name) %>%               # count occurrences of 'name' (n)
  filter(n>1) %>%               # filter non-unique 'name' (n>1)
  mutate(excess=(n*n)-n) %>%    # excess per observation : product minus exact match
  summarise(excess=sum(excess)) # total excess

# A tibble: 1 × 1
  excess
   <int>
1     78

Add a new variable height to the favourote_colour with values {175, 183} for names {Lotte, Lucas} respectively using a join functions.

favourite_colour  <- tibble(name=c("Lucas","Lotte","Noa","Wim","Marc","Lucy","Pedro"), 
                           year=c(1995,1995,1995,1994,1990,1993,1992), 
                           colour=c("Blue","Green","Yellow","Purple","Green","red","Blue"))

heights <- tibble(name=c("Lotte","Lucas"), height=c(175,183))
left_join(favourite_colour, heights,"name")

# A tibble: 7 × 4
  name   year colour height
  <chr> <dbl> <chr>   <dbl>
1 Lucas  1995 Blue      183
2 Lotte  1995 Green     175
3 Noa    1995 Yellow     NA
4 Wim    1994 Purple     NA
5 Marc   1990 Green      NA
6 Lucy   1993 red        NA
7 Pedro  1992 Blue       NA

What is the mean height of the joint first 15 and last 30 observations?

# The function 'bind_rows' is used here to concatenate the two tibbles from the results of 
# 'head' and 'tail' functions. The resulting tibble 'survey_first15_last30' has 45 observations.
#
survey_first15_last30 <- bind_rows(head(survey,15), tail(survey,30)) # combined observations
mean(survey_first15_last30$height, na.rm = TRUE)  # mean of the combined observations

[1] 168.9558

↑ Lecture ⇄ Practice