Group observations into sets and summarise on those sets.
group_by
: group observationsYou can group your data given a set of variables. For example we can
group per age
and gender
:
pulse %>% group_by( age,gender )
# A tibble: 110 × 13
# Groups: age, gender [21]
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 18 female no yes moderate sat 86 88 1993
2 1993_B Melanie 179 58 19 female no yes moderate ran 82 150 1993
3 1993_C Consuelo 167 62 18 female no yes high ran 96 176 1993
4 1993_D Travis 195 84 18 male no yes high sat 71 73 1993
5 1993_E Lauri 173 64 18 female no yes low sat 90 88 1993
6 1993_F George 184 74 22 male no yes low ran 78 141 1993
7 1993_G Cherry 162 57 20 female no yes moderate sat 68 72 1993
8 1993_H Francesca 169 55 18 female no yes moderate sat 71 77 1993
9 1993_I Sonja 164 56 19 female no yes high sat 68 68 1993
10 1993_J Troy 168 60 23 male no yes moderate ran 88 150 1993
# ℹ 100 more rows
Note the tag Groups: age, gender [21] in the output,
meaning that group_by
found 21 {age,gender}
groups in our dataset.
Once groups are marked with group_by
, then we can
analyse those groups with summarise
producing a single row
output per group. For example we can count the number of observation per
group using function n()
:
pulse %>% group_by( age, gender ) %>%
summarise( n = n())
# A tibble: 21 × 3
# Groups: age [13]
age gender n
<dbl> <chr> <int>
1 18 female 15
2 18 male 8
3 19 female 14
4 19 male 15
5 20 female 11
6 20 male 16
7 21 female 3
8 21 male 7
9 22 female 1
10 22 male 5
# ℹ 11 more rows
⚠️You may get a warning like ‘summarise() regrouping … (override with .groups argument)’. It is a reminder to remove the group from the result set. You may ignore this or set the ‘summarise’ argument .groups=‘drop’, seel also ?summarise.
pulse %>% group_by( age, gender ) %>%
summarise( n = n(), .groups='drop')
# A tibble: 21 × 3
age gender n
<dbl> <chr> <int>
1 18 female 15
2 18 male 8
3 19 female 14
4 19 male 15
5 20 female 11
6 20 male 16
7 21 female 3
8 21 male 7
9 22 female 1
10 22 male 5
# ℹ 11 more rows
AnswerWhich function produces the same output as above given {age,gender}?
But of course we want to do more than just count the size of the
groups. We can for example calculate the mean height
and
weight
per {age,gender}
group:
pulse %>% group_by(age,gender) %>%
summarise(size=n(),meanHeight=mean(height), meanWeight=mean(weight))
# A tibble: 21 × 5
# Groups: age [13]
age gender size meanHeight meanWeight
<dbl> <chr> <int> <dbl> <dbl>
1 18 female 15 168. 58.9
2 18 male 8 183. 74.4
3 19 female 14 160. 52
4 19 male 15 173. 72.8
5 20 female 11 166. 58.0
6 20 male 16 178. 74.5
7 21 female 3 172 57.3
8 21 male 7 180. 76
9 22 female 1 151 42
10 22 male 5 178. 76.4
# ℹ 11 more rows
We now have all the tools we need to apply more complex queries on
our data. For example, group per gender
on those that
ran
and summarize on mean age
,
pulse1
and pulse2
. First we need to filter
only those who ran (see explanation on ran
in pulse) and only then group and summarise:
pulse %>% filter(ran == "ran") %>%
group_by( gender ) %>%
summarise( size = n(), meanAge= mean(age), meanPluse1 = mean(pulse1), meanPulse2 = mean( pulse2 ) )
# A tibble: 2 × 5
gender size meanAge meanPluse1 meanPulse2
<chr> <int> <dbl> <dbl> <dbl>
1 female 22 20.6 75.5 126.
2 male 24 19.8 75.5 128.
now for those who sat
:
pulse %>% filter(ran == "sat") %>% group_by( gender ) %>%
summarise( size = n(), meanAge= mean(age), meanPluse1 = mean(pulse1), meanPulse2 = mean( pulse2 ) )
# A tibble: 2 × 5
gender size meanAge meanPluse1 meanPulse2
<chr> <int> <dbl> <dbl> <dbl>
1 female 29 19.6 NA NA
2 male 35 21.9 73.3 72.3
Note that there are missing values, account for it by using
na.rm=TRUE
:
pulse %>% filter(ran == "sat") %>%
group_by( gender ) %>%
summarise( count = n(), meanAge= mean(age),
meanPluse1 = mean(pulse1,na.rm=TRUE), meanPulse2 = mean( pulse2 , na.rm=TRUE) )
# A tibble: 2 × 5
gender count meanAge meanPluse1 meanPulse2
<chr> <int> <dbl> <dbl> <dbl>
1 female 29 19.6 79.1 78
2 male 35 21.9 73.3 72.3
⚠️ It is good practice to check your results. For example, the group sizes in the original survey table who ran and sat must match the sum of sizes in the different summaries shown above under ‘size’ column. For ran==“ran” summary we have 22+24=44 and for ran==“sat” summary we have 29+35=64. We can check them against totals below and we see that they do:
pulse %>% count(ran)
# A tibble: 2 × 2
ran n
<chr> <int>
1 ran 46
2 sat 64
ungroup
: remove groupingTo remove the grouping use we have ungroup
function:
pulse %>% group_by(age,gender) %>% ungroup() # results in the original pulse tibble
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 18 female no yes moderate sat 86 88 1993
2 1993_B Melanie 179 58 19 female no yes moderate ran 82 150 1993
3 1993_C Consuelo 167 62 18 female no yes high ran 96 176 1993
4 1993_D Travis 195 84 18 male no yes high sat 71 73 1993
5 1993_E Lauri 173 64 18 female no yes low sat 90 88 1993
6 1993_F George 184 74 22 male no yes low ran 78 141 1993
7 1993_G Cherry 162 57 20 female no yes moderate sat 68 72 1993
8 1993_H Francesca 169 55 18 female no yes moderate sat 71 77 1993
9 1993_I Sonja 164 56 19 female no yes high sat 68 68 1993
10 1993_J Troy 168 60 23 male no yes moderate ran 88 150 1993
# ℹ 100 more rows
Copyright © 2024 Biomedical Data Sciences (BDS) | LUMC