Joining tables (practice)

Primary exercises

For this exercise we will first split the survey dataset into two separate tables in order to join them again! Call these df1 and df2, these will have disjoint set of variables except name and age, the variables name and age combined are unique in all observations and will be used later for joining. Take for example all variables related to arm or hand in df1 and the rest in df2:

df1 : "name"   "span1" "span2" "hand"  "fold"   "clap"  "age"
df2 : "name"   "gender"  "pulse"  "exercise"   "smokes"  "height" "m.i" "age"

Join df1 and df2 by name and age such that you obtain the original survey table.
In exercise (a) does it make any difference to choose either of inner_join, left_join or full_join? Hint: compare two tables with function all_equal.
Are the pairs name and age also good candidates as the key, i.e. is the combination of name and age uniquely identify each observation in the survey data? What about the combination of name with span1 or span2?

Extra exercises

The name and span2 were not good combinations as a key, show the rows with identical name and span2 values.
Explain the result if you use only the name variable as the key for joining in the primary exercise (1). Why are there more rows in the result and what is the excess number of rows?
In the previous exercise, using only name as the key, we obtained 78 ‘excess’ observations due to non-unique key name. Calculate this number without applying the join.
Add a new variable height to the favourote_colour with values {175, 183} for names {Lotte, Lucas} respectively using a join functions.

favourite_colour  <- tibble(name=c("Lucas","Lotte","Noa","Wim","Marc","Lucy","Pedro"), 
                           year=c(1995,1995,1995,1994,1990,1993,1992), 
                           colour=c("Blue","Green","Yellow","Purple","Green","red","Blue"))

What is the mean height of the joint first 15 and last 30 observations?

↑ Lecture ⇄ Solutions