Use
filter
to select observations (rows) from a tibble.
In the previous section we learned how to select variables (columns)
with all its values using the select
function. In this
section we are interested in a collection of observations (rows). For
this we have the function filter
which takes as its first
argument the tibble followed by a logical condition describing
the variable values constraints.
In the context of filter
function a logical condition is
a way to express whether some statement about an observation is true or
false, if true then that observation is returned otherwise it is
ignored. Note that the condition is applied to all observation in the
tibble.
For example, let’s say we want to know which individuals in our
pulse dataset are taller than 190 cm? You can achieve this with
filter
as follows:
filter(pulse, height>190) # filter all observations with height greater than 190 cm
# A tibble: 5 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_D Travis 195 84 18 male no yes high sat 71 73 1993
2 1997_K John 191 78 19 male no yes high ran 68 136 1997
3 1997_O Albert 194 110 25 male no no moderate sat 75 75 1997
4 1997_Q Lance 192 105 21 male no no moderate sat 80 73 1997
5 1997_R Christopher 194 95 18 male no yes moderate ran 84 140 1997
or another example would be, who’s weight is exactly 55 kg?
filter(pulse, weight==55) # filter all observations with weight equal to 55 kg
# A tibble: 5 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_H Francesca 169 55 18 female no yes moderate sat 71 77 1993
2 1993_W Glenna 163 55 20 female no no low sat 78 74 1993
3 1995_J Gwen 163 55 20 female no no moderate ran 70 119 1995
4 1998_G Ursula 155 55 20 female no yes high sat 82 87 1998
5 1998_I Eleanor 168 55 24 female no no moderate sat 74 70 1998
filter
uses logical vectors in the background
to achieve the result. You’ve already seen an example of a
logical vector
in the section on Vectors. The result of a
comparison with relational operators {==
, !=
,
<
, <=
, >
,
>=
} is a logical vector.
‘==’ stands for ‘equal to’ and ‘!=’ is ‘not equal
to’
In the examples above the comparisons height>190
and
weight==55
are in fact logical vectors with the same length
as the number of observations (rows) in the pulse dataset. The function
filter
takes the logical vector and returns only the rows
in positions for which the value is TRUE.
In the previous examples with filter
we only had a
single condition. We can expand the conditions to ask more complex
questions. For example which males weigh less than 70? Here we
have two conditions:
gender=="male"
: male observationsweight<70
: weigh less than 70And : ‘&’
Both these conditions must be met, with other words for each
observation gender=="male"
and
weight<70
must be true and this is expressed with the
symbol & :
filter(pulse, gender=="male" & weight<70)
# A tibble: 24 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_J Troy 168 60 23 male no yes moderate ran 88 150 1993
2 1993_L Frederick 178 58 19 male no no low sat 74 76 1993
3 1993_M Justin 170 68 22 male yes yes moderate sat 70 71 1993
4 1993_N Ernest 187 59 18 male no yes high sat 78 82 1993
5 1993_Q Leslie 170 56 19 male no no low sat 64 63 1993
6 1993_U Jerome 175 60 19 male no no low sat 88 86 1993
7 1993_Y John 176 59 19 male no no moderate sat 68 69 1993
8 1995_A Pedro 170 60 18 male no yes moderate sat 62 59 1995
9 1995_C Oscar 189 60 19 male no yes moderate ran 78 168 1995
10 1995_I Marc 175 65 19 male no yes moderate ran 60 104 1995
# ℹ 14 more rows
Another example: select females with high frequency of exercise:
filter(pulse, gender=="female" & exercise=="high")
# A tibble: 3 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_C Consuelo 167 62 18 female no yes high ran 96 176 1993
2 1993_I Sonja 164 56 19 female no yes high sat 68 68 1993
3 1998_G Ursula 155 55 20 female no yes high sat 82 87 1998
Or : ‘|’
The or conditional operator is true when at least one of the conditions is true. For example, list all rows with individuals who drink or smoke or both:
filter(pulse, alcohol=="yes" | smokes=="yes")
# A tibble: 70 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 18 female no yes moderate sat 86 88 1993
2 1993_B Melanie 179 58 19 female no yes moderate ran 82 150 1993
3 1993_C Consuelo 167 62 18 female no yes high ran 96 176 1993
4 1993_D Travis 195 84 18 male no yes high sat 71 73 1993
5 1993_E Lauri 173 64 18 female no yes low sat 90 88 1993
6 1993_F George 184 74 22 male no yes low ran 78 141 1993
7 1993_G Cherry 162 57 20 female no yes moderate sat 68 72 1993
8 1993_H Francesca 169 55 18 female no yes moderate sat 71 77 1993
9 1993_I Sonja 164 56 19 female no yes high sat 68 68 1993
10 1993_J Troy 168 60 23 male no yes moderate ran 88 150 1993
# ℹ 60 more rows
There are 70 observations who at least do one of the two: drink or smoke.
AnswerWhich group is missing from the result above?
Another example, filter rows from years 1995 and 1997
filter(pulse, year==1995 | year==1997)
# A tibble: 45 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1995_A Pedro 170 60 18 male no yes moderate sat 62 59 1995
2 1995_B Olga 172 60 21 female no no low sat 81 79 1995
3 1995_C Oscar 189 60 19 male no yes moderate ran 78 168 1995
4 1995_D Kelli 178 56 21 female no yes moderate ran 86 150 1995
5 1995_E Scott 175 75 20 male no yes moderate ran 59 92 1995
6 1995_F Bobby 180 85 19 male yes yes moderate ran 68 125 1995
7 1995_G Laurie 160 57 19 female no no moderate ran 75 130 1995
8 1995_H Eliza 164 66 23 female no no low ran 74 168 1995
9 1995_I Marc 175 65 19 male no yes moderate ran 60 104 1995
10 1995_J Gwen 163 55 20 female no no moderate ran 70 119 1995
# ℹ 35 more rows
Here we have the same variable year
in our
|
construct.
AnswerWhat is wrong with this command: filter(pulse, year==1995 & year==1997) ?
is.na()
You have already seen is.na
in action with vectors. It can be used in the
context of filter as a logical condition to test for missing value:
NA
.
Take for example the pulse data, it has missing values in
pulse1
and pulse2
. We can find the row(s) with
missing value for pulse1
with:
filter(pulse, is.na(pulse1))
# A tibble: 1 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1997_G Camille 173 64 20 female no yes moderate sat NA NA 1997
We can see there is a single row with missing value
pulse1
. What about pulse2
?
filter(pulse, is.na(pulse2))
# A tibble: 1 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1997_G Camille 173 64 20 female no yes moderate sat NA NA 1997
Also a single row and it looks like to be the same row as the one
with missing value for pulse1
AnswerCan you think of a logical condition that confirms that the missing
pulse1
andpulse2
values are actually in the same observation?
Equality test with NA, i.e
Negation operator: !
If you are interested only in observation that do not
contain missing values, in this case in pulse1
and
pulse2
:
filter(pulse, ! is.na(pulse1))
# A tibble: 109 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 18 female no yes moderate sat 86 88 1993
2 1993_B Melanie 179 58 19 female no yes moderate ran 82 150 1993
3 1993_C Consuelo 167 62 18 female no yes high ran 96 176 1993
4 1993_D Travis 195 84 18 male no yes high sat 71 73 1993
5 1993_E Lauri 173 64 18 female no yes low sat 90 88 1993
6 1993_F George 184 74 22 male no yes low ran 78 141 1993
7 1993_G Cherry 162 57 20 female no yes moderate sat 68 72 1993
8 1993_H Francesca 169 55 18 female no yes moderate sat 71 77 1993
9 1993_I Sonja 164 56 19 female no yes high sat 68 68 1993
10 1993_J Troy 168 60 23 male no yes moderate ran 88 150 1993
# ℹ 99 more rows
between(x, left, right)
With between
function it is possible to filter rows
based on variable intervals:
# filter rows with height >=165 and height <= 166
filter(pulse, between(height, 165, 166))
# A tibble: 6 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_S Nicole 166 56 21 female yes no moderate sat 83 79 1993
2 1995_M Desiree 165 63 18 female no yes moderate sat 84 84 1995
3 1996_R Lucy 166 50 19 female no yes low ran 76 132 1996
4 1997_E Julianne 165 48 19 female no no low sat 83 84 1997
5 1998_C Dona 165 58 23 female no yes low sat 64 68 1998
6 1998_H Haley 165 60 19 female yes yes low ran 88 120 1998
Note that example above can be explicitly written down with
&
:
filter(pulse, height >= 165 & height <= 166 )
# A tibble: 6 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_S Nicole 166 56 21 female yes no moderate sat 83 79 1993
2 1995_M Desiree 165 63 18 female no yes moderate sat 84 84 1995
3 1996_R Lucy 166 50 19 female no yes low ran 76 132 1996
4 1997_E Julianne 165 48 19 female no no low sat 83 84 1997
5 1998_C Dona 165 58 23 female no yes low sat 64 68 1998
6 1998_H Haley 165 60 19 female yes yes low ran 88 120 1998
Both are valid solutions.
Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC