Primary exercises
- For this exercise we will first split the
survey
dataset into two separate tables in order to join them again! Call these
df1
and df2
, these will have disjoint set of
variables except name
and age
, the variables
name
and age
combined are unique in all
observations and will be used later for joining. Take for example all
variables related to arm or hand in df1 and the rest in df2:
df1 : "name" "span1" "span2" "hand" "fold" "clap" "age"
df2 : "name" "gender" "pulse" "exercise" "smokes" "height" "m.i" "age"
df1 <- survey %>% select(name,age,span1,span2,hand,fold,clap)
df1
# A tibble: 233 × 7
name age span1 span2 hand fold clap
<chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 Alyson 18.2 18.5 18 right right left
2 Todd 17.6 19.5 20.5 left right left
3 Gerald 16.9 18 13.3 right left neither
4 Robert 20.3 18.8 18.9 right right neither
5 Dustin 23.7 20 20 right neither right
6 Abby 21 18 17.7 right left right
7 Andre 18.8 17.7 17.7 right left right
8 Michael 35.8 17 17.3 right right right
9 Edward 19 20 19.5 right right right
10 Carl 22.3 18.5 18.5 right right right
# ℹ 223 more rows
df2 <- survey %>% select(name,age,gender,pulse,exercise,smokes,height,m.i)
df2
# A tibble: 233 × 8
name age gender pulse exercise smokes height m.i
<chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
1 Alyson 18.2 female 92 some never 173 metric
2 Todd 17.6 male 104 none regul 178. imperial
3 Gerald 16.9 male 87 none occas NA <NA>
4 Robert 20.3 male NA none never 160 metric
5 Dustin 23.7 male 35 some never 165 metric
6 Abby 21 female 64 some never 173. imperial
7 Andre 18.8 male 83 freq never 183. imperial
8 Michael 35.8 female 74 freq never 157 metric
9 Edward 19 male 72 some never 175 metric
10 Carl 22.3 male 90 some never 167 metric
# ℹ 223 more rows
- Join df1 and df2 by
name
and age
such that
you obtain the original survey table.
# In the join the order of the variables are different than the original survey tibble
# but the content is identical.
inner_join(df1,df2,by=c("name","age"))
# A tibble: 233 × 13
name age span1 span2 hand fold clap gender pulse exercise smokes height m.i
<chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
1 Alyson 18.2 18.5 18 right right left female 92 some never 173 metric
2 Todd 17.6 19.5 20.5 left right left male 104 none regul 178. imperial
3 Gerald 16.9 18 13.3 right left neither male 87 none occas NA <NA>
4 Robert 20.3 18.8 18.9 right right neither male NA none never 160 metric
5 Dustin 23.7 20 20 right neither right male 35 some never 165 metric
6 Abby 21 18 17.7 right left right female 64 some never 173. imperial
7 Andre 18.8 17.7 17.7 right left right male 83 freq never 183. imperial
8 Michael 35.8 17 17.3 right right right female 74 freq never 157 metric
9 Edward 19 20 19.5 right right right male 72 some never 175 metric
10 Carl 22.3 18.5 18.5 right right right male 90 some never 167 metric
# ℹ 223 more rows
- In exercise (a) does it make any difference to choose either of
inner_join
, left_join
or
full_join
? Hint: compare two tables with function
all_equal
.
# Answer is no, this because both df1 and df2 come from the same source
# `survey` with equal number of matching observations with matching keys.
#
all_equal(left_join(df1,df2,by=c("name","age")), survey) # left_join
[1] TRUE
all_equal(inner_join(df1,df2,by=c("name","age")), survey) # inner_join
[1] TRUE
all_equal(full_join(df1,df2,by=c("name","age")), survey) # full_join
[1] TRUE
- Are the pairs
name
and age
also good
candidates as the key, i.e. is the combination of name
and
age
uniquely identify each observation in the survey data?
What about the combination of name
with span1
or span2
?
# Yes, there are no duplicates in the pairs (name,age) in survey data.
sum(duplicated(survey %>% select(name,age)))==0
[1] TRUE
# The pair name and span1 will also suffice as a key.
sum(duplicated(survey %>% select(name,span1)))==0
[1] TRUE
# Multiple observations in the survey data have the same
# name span2 combinations, therefore not a good candidate
# as a key.
sum(duplicated(survey %>% select(name,span2)))==0
[1] FALSE