Descriptive statistics
summarise
variable(s)The main function for calculating summaries on variables is
summarise
. Examples of descriptive functions are mean,
median, sum etc. The functions consume a vector and produce a single
value. summarise
takes a tibble along with specification of
descriptives and produces a single row.
For example, let’s say we want to know the mean height and weight of all individuals in the pulse dataset:
pulse %>% summarise(meanHeight=mean(height), meanWeight=mean(weight))
# A tibble: 1 × 2
meanHeight meanWeight
<dbl> <dbl>
1 172. 66.3
The result is a single row with two variables meanHeight
and meanWeight
with the corresponding mean values of all
observations.
We can also summarise on variable’s range, e.g. age
:
pulse %>% summarise(minAge = min(age), maxAge=max(age)) # <=> range(pulse$age)
# A tibble: 1 × 2
minAge maxAge
<dbl> <dbl>
1 18 45
n(): convenient function to calculate total number
of rows in the summarise
context:
pulse %>% summarise( count = n(), meanHeight = mean( height ) )
# A tibble: 1 × 2
count meanHeight
<int> <dbl>
1 110 172.
count
: frequency tablesWith the count
function we can count the frequency of
values in a categorical variables:
pulse %>% count(gender) # frequency of male/female
# A tibble: 2 × 2
gender n
<chr> <int>
1 female 51
2 male 59
pulse %>% count(smokes) # frequency of smoking habit
# A tibble: 2 × 2
smokes n
<chr> <int>
1 no 99
2 yes 11
pulse %>% count(exercise) # frequency of exercise habit
# A tibble: 3 × 2
exercise n
<chr> <int>
1 high 14
2 low 37
3 moderate 59
The result enumerates the distinct values of the variable in the
first column and their frequency in a new column n
.
Multiple variables are allowed, it is the count of each possible
combination of values, also known as contingency table
or
cross table
:
pulse %>% count(gender, exercise)
# A tibble: 6 × 3
gender exercise n
<chr> <chr> <int>
1 female high 3
2 female low 20
3 female moderate 28
4 male high 11
5 male low 17
6 male moderate 31
pulse %>% count(year, gender)
# A tibble: 10 × 3
year gender n
<dbl> <chr> <int>
1 1993 female 12
2 1993 male 14
3 1995 female 11
4 1995 male 11
5 1996 female 10
6 1996 male 11
7 1997 female 8
8 1997 male 15
9 1998 female 10
10 1998 male 8
distinct
values in variablesTo identify distinct values in a variable or a group of variables we
use the function distinct
:
pulse %>% distinct(year)
# A tibble: 5 × 1
year
<dbl>
1 1993
2 1995
3 1996
4 1997
5 1998
pulse %>% distinct(exercise)
# A tibble: 3 × 1
exercise
<chr>
1 moderate
2 high
3 low
pulse %>% distinct(ran)
# A tibble: 2 × 1
ran
<chr>
1 sat
2 ran
Again, multiple variables are allowd. To identify distinct
combinations of gender
and exercise
:
pulse %>% distinct(gender, exercise)
# A tibble: 6 × 2
gender exercise
<chr> <chr>
1 female moderate
2 female high
3 male high
4 female low
5 male low
6 male moderate
‘distinct’ produces the same variables combinations as
the ‘count’ function except without the frequncy column
‘n’.
You may use distinct also to check whether certain variables have
unique values for each observation. Let’s for example check whether all
individuals in the pulse
dataset have different names, more
precisely, each observation is uniquely identifiable by the variable
name
:
pulse %>% nrow() # total number of rows
[1] 110
pulse %>% distinct(name) %>% nrow() # count the number of distinct names
[1] 106
There are 106 distinct names and there in total 110 observations in
the pulse
dataset. This could only mean that there are
multiple individuals in the pulse dataset with shared names:
nrow(pulse) == nrow( pulse %>% distinct(name)) # is 'name' unique for all observations?
[1] FALSE
arrange
You may sort rows according to one or more variables with the
function arrange
.
Try sorting the pulse dataset by name
:
pulse %>% arrange(name) # sorts the rows by name in dictionary order
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1996_C Adeline 157 41 20 female no no moderate ran 70 95 1996
2 1996_P Adrian 180 102 20 male no yes moderate sat 76 72 1996
3 1997_O Albert 194 110 25 male no no moderate sat 75 75 1997
4 1993_V Arlene 140 50 34 female no no low ran 70 98 1993
5 1998_O Bettie 161 43 19 female no no low sat 90 89 1998
6 1995_F Bobby 180 85 19 male yes yes moderate ran 68 125 1995
7 1995_L Bobby 169 68 19 male no no moderate sat 58 58 1995
8 1993_A Bonnie 173 57 18 female no yes moderate sat 86 88 1993
9 1996_F Brandie 171 67 18 female no yes low sat 76 74 1996
10 1996_K Bridgett 160 49 19 female no no low sat 80 72 1996
# ℹ 100 more rows
or by height
pulse %>% arrange(height) # numerical order
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1998_J Raul 68 63 19 male no no moderate ran 88 136 1998
2 1998_N Lizzie 93 27 19 female no no low sat 119 120 1998
3 1993_V Arlene 140 50 34 female no no low ran 70 98 1993
4 1997_A Katrina 151 42 22 female no no low ran 85 130 1997
5 1993_T Maura 155 50 19 female no no moderate sat 78 79 1993
6 1995_N Tisha 155 49 18 female no yes moderate sat 104 92 1995
7 1998_G Ursula 155 55 20 female no yes high sat 82 87 1998
8 1996_C Adeline 157 41 20 female no no moderate ran 70 95 1996
9 1996_J Penelope 158 51 18 female no no moderate ran 68 84 1996
10 1995_G Laurie 160 57 19 female no no moderate ran 75 130 1995
# ℹ 100 more rows
By default the data is sorted in ascending order, to sort in
descending order use desc
function:
pulse %>% arrange(desc(name))
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1997_C William 190 82 19 male no no moderate sat 76 73 1997
2 1997_F Wesley 172 53 20 male no no low ran 72 136 1997
3 1998_G Ursula 155 55 20 female no yes high sat 82 87 1998
4 1993_X Tyrone 182 75 26 male yes yes moderate sat 80 76 1993
5 1993_J Troy 168 60 23 male no yes moderate ran 88 150 1993
6 1993_D Travis 195 84 18 male no yes high sat 71 73 1993
7 1996_B Travis 167 70 22 male yes yes low sat 92 84 1996
8 1995_N Tisha 155 49 18 female no yes moderate sat 104 92 1995
9 1997_I Tim 170 58.5 20 male no no low sat 80 82 1997
10 1996_M Taylor 180 77 18 female no no moderate ran 47 136 1996
# ℹ 100 more rows
You may also arrange by multiple variables:
pulse %>% arrange(height,weight)
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1998_J Raul 68 63 19 male no no moderate ran 88 136 1998
2 1998_N Lizzie 93 27 19 female no no low sat 119 120 1998
3 1993_V Arlene 140 50 34 female no no low ran 70 98 1993
4 1997_A Katrina 151 42 22 female no no low ran 85 130 1997
5 1995_N Tisha 155 49 18 female no yes moderate sat 104 92 1995
6 1993_T Maura 155 50 19 female no no moderate sat 78 79 1993
7 1998_G Ursula 155 55 20 female no yes high sat 82 87 1998
8 1996_C Adeline 157 41 20 female no no moderate ran 70 95 1996
9 1996_J Penelope 158 51 18 female no no moderate ran 68 84 1996
10 1996_K Bridgett 160 49 19 female no no low sat 80 72 1996
# ℹ 100 more rows
Here the data is first ordered by height
and then by
weight
.
Copyright © 2024 Biomedical Data Sciences (BDS) | LUMC