Descriptive statistics

summarise variable(s)

The main function for calculating summaries on variables is summarise. Examples of descriptive functions are mean, median, sum etc. The functions consume a vector and produce a single value. summarise takes a tibble along with specification of descriptives and produces a single row.

For example, let’s say we want to know the mean height and weight of all individuals in the pulse dataset:

pulse %>% summarise(meanHeight=mean(height), meanWeight=mean(weight))
# A tibble: 1 × 2
  meanHeight meanWeight
       <dbl>      <dbl>
1       172.       66.3

The result is a single row with two variables meanHeight and meanWeight with the corresponding mean values of all observations.

We can also summarise on variable’s range, e.g. age :

pulse %>% summarise(minAge = min(age), maxAge=max(age)) # <=> range(pulse$age)
# A tibble: 1 × 2
  minAge maxAge
   <dbl>  <dbl>
1     18     45

n(): convenient function to calculate total number of rows in the summarise context:

pulse %>% summarise( count = n(), meanHeight = mean( height ) )
# A tibble: 1 × 2
  count meanHeight
  <int>      <dbl>
1   110       172.

count : frequency tables

With the count function we can count the frequency of values in a categorical variables:

pulse %>% count(gender)   # frequency of male/female
# A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 female    51
2 male      59
pulse %>% count(smokes)   # frequency of smoking habit 
# A tibble: 2 × 2
  smokes     n
  <chr>  <int>
1 no        99
2 yes       11
pulse %>% count(exercise) # frequency of exercise habit 
# A tibble: 3 × 2
  exercise     n
  <chr>    <int>
1 high        14
2 low         37
3 moderate    59

The result enumerates the distinct values of the variable in the first column and their frequency in a new column n.

Multiple variables are allowed, it is the count of each possible combination of values, also known as contingency table or cross table:

pulse %>% count(gender, exercise)
# A tibble: 6 × 3
  gender exercise     n
  <chr>  <chr>    <int>
1 female high         3
2 female low         20
3 female moderate    28
4 male   high        11
5 male   low         17
6 male   moderate    31
pulse %>% count(year, gender)
# A tibble: 10 × 3
    year gender     n
   <dbl> <chr>  <int>
 1  1993 female    12
 2  1993 male      14
 3  1995 female    11
 4  1995 male      11
 5  1996 female    10
 6  1996 male      11
 7  1997 female     8
 8  1997 male      15
 9  1998 female    10
10  1998 male       8

distinct values in variables

To identify distinct values in a variable or a group of variables we use the function distinct:

pulse %>% distinct(year)
# A tibble: 5 × 1
   year
  <dbl>
1  1993
2  1995
3  1996
4  1997
5  1998
pulse %>% distinct(exercise)
# A tibble: 3 × 1
  exercise
  <chr>   
1 moderate
2 high    
3 low     
pulse %>% distinct(ran) 
# A tibble: 2 × 1
  ran  
  <chr>
1 sat  
2 ran  

Again, multiple variables are allowd. To identify distinct combinations of gender and exercise:

pulse %>% distinct(gender, exercise)
# A tibble: 6 × 2
  gender exercise
  <chr>  <chr>   
1 female moderate
2 female high    
3 male   high    
4 female low     
5 male   low     
6 male   moderate

‘distinct’ produces the same variables combinations as the ‘count’ function except without the frequncy column ‘n’.

You may use distinct also to check whether certain variables have unique values for each observation. Let’s for example check whether all individuals in the pulse dataset have different names, more precisely, each observation is uniquely identifiable by the variable name:

pulse %>% nrow()                    # total number of rows 
[1] 110
pulse %>% distinct(name) %>% nrow() # count the number of distinct names
[1] 106

There are 106 distinct names and there in total 110 observations in the pulse dataset. This could only mean that there are multiple individuals in the pulse dataset with shared names:

nrow(pulse) == nrow( pulse %>% distinct(name)) # is 'name' unique for all observations?
[1] FALSE

arrange

You may sort rows according to one or more variables with the function arrange.

Try sorting the pulse dataset by name:

pulse %>%  arrange(name) # sorts the rows by name in dictionary order 
# A tibble: 110 × 13
   id     name     height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>     <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1996_C Adeline     157     41    20 female no     no      moderate ran       70     95  1996
 2 1996_P Adrian      180    102    20 male   no     yes     moderate sat       76     72  1996
 3 1997_O Albert      194    110    25 male   no     no      moderate sat       75     75  1997
 4 1993_V Arlene      140     50    34 female no     no      low      ran       70     98  1993
 5 1998_O Bettie      161     43    19 female no     no      low      sat       90     89  1998
 6 1995_F Bobby       180     85    19 male   yes    yes     moderate ran       68    125  1995
 7 1995_L Bobby       169     68    19 male   no     no      moderate sat       58     58  1995
 8 1993_A Bonnie      173     57    18 female no     yes     moderate sat       86     88  1993
 9 1996_F Brandie     171     67    18 female no     yes     low      sat       76     74  1996
10 1996_K Bridgett    160     49    19 female no     no      low      sat       80     72  1996
# … with 100 more rows

or by height

pulse %>%  arrange(height) # numerical order
# A tibble: 110 × 13
   id     name     height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>     <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1998_J Raul         68     63    19 male   no     no      moderate ran       88    136  1998
 2 1998_N Lizzie       93     27    19 female no     no      low      sat      119    120  1998
 3 1993_V Arlene      140     50    34 female no     no      low      ran       70     98  1993
 4 1997_A Katrina     151     42    22 female no     no      low      ran       85    130  1997
 5 1993_T Maura       155     50    19 female no     no      moderate sat       78     79  1993
 6 1995_N Tisha       155     49    18 female no     yes     moderate sat      104     92  1995
 7 1998_G Ursula      155     55    20 female no     yes     high     sat       82     87  1998
 8 1996_C Adeline     157     41    20 female no     no      moderate ran       70     95  1996
 9 1996_J Penelope    158     51    18 female no     no      moderate ran       68     84  1996
10 1995_G Laurie      160     57    19 female no     no      moderate ran       75    130  1995
# … with 100 more rows

By default the data is sorted in ascending order, to sort in descending order use desc function:

pulse %>%  arrange(desc(name))
# A tibble: 110 × 13
   id     name    height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>    <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1997_C William    190   82      19 male   no     no      moderate sat       76     73  1997
 2 1997_F Wesley     172   53      20 male   no     no      low      ran       72    136  1997
 3 1998_G Ursula     155   55      20 female no     yes     high     sat       82     87  1998
 4 1993_X Tyrone     182   75      26 male   yes    yes     moderate sat       80     76  1993
 5 1993_J Troy       168   60      23 male   no     yes     moderate ran       88    150  1993
 6 1993_D Travis     195   84      18 male   no     yes     high     sat       71     73  1993
 7 1996_B Travis     167   70      22 male   yes    yes     low      sat       92     84  1996
 8 1995_N Tisha      155   49      18 female no     yes     moderate sat      104     92  1995
 9 1997_I Tim        170   58.5    20 male   no     no      low      sat       80     82  1997
10 1996_M Taylor     180   77      18 female no     no      moderate ran       47    136  1996
# … with 100 more rows

You may also arrange by multiple variables:

pulse %>%  arrange(height,weight)
# A tibble: 110 × 13
   id     name     height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>     <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1998_J Raul         68     63    19 male   no     no      moderate ran       88    136  1998
 2 1998_N Lizzie       93     27    19 female no     no      low      sat      119    120  1998
 3 1993_V Arlene      140     50    34 female no     no      low      ran       70     98  1993
 4 1997_A Katrina     151     42    22 female no     no      low      ran       85    130  1997
 5 1995_N Tisha       155     49    18 female no     yes     moderate sat      104     92  1995
 6 1993_T Maura       155     50    19 female no     no      moderate sat       78     79  1993
 7 1998_G Ursula      155     55    20 female no     yes     high     sat       82     87  1998
 8 1996_C Adeline     157     41    20 female no     no      moderate ran       70     95  1996
 9 1996_J Penelope    158     51    18 female no     no      moderate ran       68     84  1996
10 1996_K Bridgett    160     49    19 female no     no      low      sat       80     72  1996
# … with 100 more rows

Here the data is first ordered by height and then by weight.



Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC