Group observations into sets and summarise on those sets.

group_by : group observations

You can group your data given a set of variables. For example we can group per age and gender:

pulse %>% group_by( age,gender )
# A tibble: 110 × 13
# Groups:   age, gender [21]
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female no     yes     moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female no     yes     moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female no     yes     high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   no     yes     high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female no     yes     low      sat       90     88  1993
 6 1993_F George       184     74    22 male   no     yes     low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female no     yes     moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female no     yes     high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
# … with 100 more rows

Note the tag Groups: age, gender [21] in the output, meaning that group_by found 21 {age,gender} groups in our dataset.

Once groups are marked with group_by, then we can analyse those groups with summarise producing a single row output per group. For example we can count the number of observation per group using function n():

pulse %>% group_by( age, gender ) %>% 
          summarise( n = n())
# A tibble: 21 × 3
# Groups:   age [13]
     age gender     n
   <dbl> <chr>  <int>
 1    18 female    15
 2    18 male       8
 3    19 female    14
 4    19 male      15
 5    20 female    11
 6    20 male      16
 7    21 female     3
 8    21 male       7
 9    22 female     1
10    22 male       5
# … with 11 more rows

⚠️You may get a warning like ‘summarise() regrouping … (override with .groups argument)’. It is a reminder to remove the group from the result set. You may ignore this or set the ‘summarise’ argument .groups=‘drop’, seel also ?summarise.

pulse %>% group_by( age, gender ) %>% 
          summarise( n = n(), .groups='drop')
# A tibble: 21 × 3
     age gender     n
   <dbl> <chr>  <int>
 1    18 female    15
 2    18 male       8
 3    19 female    14
 4    19 male      15
 5    20 female    11
 6    20 male      16
 7    21 female     3
 8    21 male       7
 9    22 female     1
10    22 male       5
# … with 11 more rows

Which function produces the same output as above given {age,gender}?

‘count’ function:

puls %>% count(age,gnder)


But of course we want to do more than just count the size of the groups. We can for example calculate the mean height and weight per {age,gender} group:

pulse %>% group_by(age,gender) %>% 
          summarise(size=n(),meanHeight=mean(height), meanWeight=mean(weight))
# A tibble: 21 × 5
# Groups:   age [13]
     age gender  size meanHeight meanWeight
   <dbl> <chr>  <int>      <dbl>      <dbl>
 1    18 female    15       168.       58.9
 2    18 male       8       183.       74.4
 3    19 female    14       160.       52  
 4    19 male      15       173.       72.8
 5    20 female    11       166.       58.0
 6    20 male      16       178.       74.5
 7    21 female     3       172        57.3
 8    21 male       7       180.       76  
 9    22 female     1       151        42  
10    22 male       5       178.       76.4
# … with 11 more rows

Queries

We now have all the tools we need to apply more complex queries on our data. For example, group per gender on those that ran and summarize on mean age, pulse1 and pulse2. First we need to filter only those who ran (see explanation on ran in pulse) and only then group and summarise:

pulse %>% filter(ran == "ran") %>% 
          group_by( gender ) %>% 
          summarise( size = n(), meanAge= mean(age), meanPluse1 = mean(pulse1), meanPulse2 = mean( pulse2 ) )
# A tibble: 2 × 5
  gender  size meanAge meanPluse1 meanPulse2
  <chr>  <int>   <dbl>      <dbl>      <dbl>
1 female    22    20.6       75.5       126.
2 male      24    19.8       75.5       128.

now for those who sat:

pulse %>% filter(ran == "sat") %>% group_by( gender ) %>% 
    summarise( size = n(), meanAge= mean(age), meanPluse1 = mean(pulse1), meanPulse2 = mean( pulse2 ) )
# A tibble: 2 × 5
  gender  size meanAge meanPluse1 meanPulse2
  <chr>  <int>   <dbl>      <dbl>      <dbl>
1 female    29    19.6       NA         NA  
2 male      35    21.9       73.3       72.3

Note that there are missing values, account for it by using na.rm=TRUE:

pulse %>% filter(ran == "sat") %>% 
          group_by( gender ) %>% 
          summarise( count = n(), meanAge= mean(age), 
               meanPluse1 = mean(pulse1,na.rm=TRUE), meanPulse2 = mean( pulse2 , na.rm=TRUE) )
# A tibble: 2 × 5
  gender count meanAge meanPluse1 meanPulse2
  <chr>  <int>   <dbl>      <dbl>      <dbl>
1 female    29    19.6       79.1       78  
2 male      35    21.9       73.3       72.3

⚠️ It is good practice to check your results. For example, the group sizes in the original survey table who ran and sat must match the sum of sizes in the different summaries shown above under ‘size’ column. For ran==“ran” summary we have 22+24=44 and for ran==“sat” summary we have 29+35=64. We can check them against totals below and we see that they do:

pulse %>%  count(ran) 
# A tibble: 2 × 2
  ran       n
  <chr> <int>
1 ran      46
2 sat      64

ungroup : remove grouping

To remove the grouping use we have ungroup function:

pulse %>% group_by(age,gender) %>% ungroup() # results in the original pulse tibble
# A tibble: 110 × 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female no     yes     moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female no     yes     moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female no     yes     high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   no     yes     high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female no     yes     low      sat       90     88  1993
 6 1993_F George       184     74    22 male   no     yes     low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female no     yes     moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female no     yes     high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
# … with 100 more rows


Copyright © 2022 Biomedical Data Sciences (BDS) | LUMC