Use filter to select observations (rows) from a tibble.

Logical conditions

In the previous section we learned how to select variables (columns) with all its values using the select function. In this section we are interested in a collection of observations (rows). For this we have the function filter which takes as its first argument the tibble followed by a logical condition describing the variable values constraints.

In the context of filter function a logical condition is a way to express whether some statement about an observation is true or false, if true then that observation is returned otherwise it is ignored. Note that the condition is applied to all observation in the tibble.

For example, let’s say we want to know which individuals in our pulse dataset are taller than 190 cm? You can achieve this with filter as follows:

filter(pulse, height>190) # filter all observations with height greater than 190 cm 
# A tibble: 5 × 13
  id     name        height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>        <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1993_D Travis         195     84    18 male   no     yes     high     sat       71     73  1993
2 1997_K John           191     78    19 male   no     yes     high     ran       68    136  1997
3 1997_O Albert         194    110    25 male   no     no      moderate sat       75     75  1997
4 1997_Q Lance          192    105    21 male   no     no      moderate sat       80     73  1997
5 1997_R Christopher    194     95    18 male   no     yes     moderate ran       84    140  1997

or another example would be, who’s weight is exactly 55 kg?

filter(pulse, weight==55) # filter all observations with weight equal to 55 kg 
# A tibble: 5 × 13
  id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
2 1993_W Glenna       163     55    20 female no     no      low      sat       78     74  1993
3 1995_J Gwen         163     55    20 female no     no      moderate ran       70    119  1995
4 1998_G Ursula       155     55    20 female no     yes     high     sat       82     87  1998
5 1998_I Eleanor      168     55    24 female no     no      moderate sat       74     70  1998

filter uses logical vectors in the background to achieve the result. You’ve already seen an example of a logical vector in the section on Vectors. The result of a comparison with relational operators {==, !=, <, <=, >, >=} is a logical vector.

‘==’ stands for ‘equal to’ and ‘!=’ is ‘not equal to’

In the examples above the comparisons height>190 and weight==55 are in fact logical vectors with the same length as the number of observations (rows) in the pulse dataset. The function filter takes the logical vector and returns only the rows in positions for which the value is TRUE.

Multiple conditions : and (‘&’), or (‘|’)

In the previous examples with filter we only had a single condition. We can expand the conditions to ask more complex questions. For example which males weigh less than 70? Here we have two conditions:

And : ‘&’

Both these conditions must be met, with other words for each observation gender=="male" and weight<70 must be true and this is expressed with the symbol & :

filter(pulse, gender=="male" & weight<70)
# A tibble: 24 × 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
 2 1993_L Frederick    178     58    19 male   no     no      low      sat       74     76  1993
 3 1993_M Justin       170     68    22 male   yes    yes     moderate sat       70     71  1993
 4 1993_N Ernest       187     59    18 male   no     yes     high     sat       78     82  1993
 5 1993_Q Leslie       170     56    19 male   no     no      low      sat       64     63  1993
 6 1993_U Jerome       175     60    19 male   no     no      low      sat       88     86  1993
 7 1993_Y John         176     59    19 male   no     no      moderate sat       68     69  1993
 8 1995_A Pedro        170     60    18 male   no     yes     moderate sat       62     59  1995
 9 1995_C Oscar        189     60    19 male   no     yes     moderate ran       78    168  1995
10 1995_I Marc         175     65    19 male   no     yes     moderate ran       60    104  1995
# ℹ 14 more rows

Another example: select females with high frequency of exercise:

filter(pulse, gender=="female" & exercise=="high")
# A tibble: 3 × 13
  id     name     height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>     <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1993_C Consuelo    167     62    18 female no     yes     high     ran       96    176  1993
2 1993_I Sonja       164     56    19 female no     yes     high     sat       68     68  1993
3 1998_G Ursula      155     55    20 female no     yes     high     sat       82     87  1998

Or : ‘|’

The or conditional operator is true when at least one of the conditions is true. For example, list all rows with individuals who drink or smoke or both:

filter(pulse, alcohol=="yes" | smokes=="yes")
# A tibble: 70 × 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female no     yes     moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female no     yes     moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female no     yes     high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   no     yes     high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female no     yes     low      sat       90     88  1993
 6 1993_F George       184     74    22 male   no     yes     low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female no     yes     moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female no     yes     high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
# ℹ 60 more rows

There are 70 observations who at least do one of the two: drink or smoke.

Which group is missing from the result above?

Those that neither smoke nor drink.


Another example, filter rows from years 1995 and 1997

filter(pulse, year==1995 | year==1997)
# A tibble: 45 × 13
   id     name   height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1995_A Pedro     170     60    18 male   no     yes     moderate sat       62     59  1995
 2 1995_B Olga      172     60    21 female no     no      low      sat       81     79  1995
 3 1995_C Oscar     189     60    19 male   no     yes     moderate ran       78    168  1995
 4 1995_D Kelli     178     56    21 female no     yes     moderate ran       86    150  1995
 5 1995_E Scott     175     75    20 male   no     yes     moderate ran       59     92  1995
 6 1995_F Bobby     180     85    19 male   yes    yes     moderate ran       68    125  1995
 7 1995_G Laurie    160     57    19 female no     no      moderate ran       75    130  1995
 8 1995_H Eliza     164     66    23 female no     no      low      ran       74    168  1995
 9 1995_I Marc      175     65    19 male   no     yes     moderate ran       60    104  1995
10 1995_J Gwen      163     55    20 female no     no      moderate ran       70    119  1995
# ℹ 35 more rows

Here we have the same variable year in our | construct.

What is wrong with this command: filter(pulse, year==1995 & year==1997) ?

An observation in a variable can not take two values at the same time.


Helper functions

is.na()

You have already seen is.na in action with vectors. It can be used in the context of filter as a logical condition to test for missing value: NA.

Take for example the pulse data, it has missing values in pulse1 and pulse2. We can find the row(s) with missing value for pulse1 with:

filter(pulse, is.na(pulse1))
# A tibble: 1 × 13
  id     name    height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>    <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1997_G Camille    173     64    20 female no     yes     moderate sat       NA     NA  1997

We can see there is a single row with missing value pulse1. What about pulse2?

filter(pulse, is.na(pulse2))
# A tibble: 1 × 13
  id     name    height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>    <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1997_G Camille    173     64    20 female no     yes     moderate sat       NA     NA  1997

Also a single row and it looks like to be the same row as the one with missing value for pulse1

Can you think of a logical condition that confirms that the missing pulse1 and pulse2 values are actually in the same observation?

filter(pulse, is.na(pulse1) | is.na(pulse2)) results in a single row.


Equality test with NA, i.e == NA, always results in NA. Therefore the convinient function is.na().

Negation operator: !

If you are interested only in observation that do not contain missing values, in this case in pulse1 and pulse2:

filter(pulse, ! is.na(pulse1))
# A tibble: 109 × 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female no     yes     moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female no     yes     moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female no     yes     high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   no     yes     high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female no     yes     low      sat       90     88  1993
 6 1993_F George       184     74    22 male   no     yes     low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female no     yes     moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female no     yes     high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
# ℹ 99 more rows

between(x, left, right)

With between function it is possible to filter rows based on variable intervals:

# filter rows with height >=165 and height <= 166
filter(pulse, between(height, 165, 166))
# A tibble: 6 × 13
  id     name     height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>     <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1993_S Nicole      166     56    21 female yes    no      moderate sat       83     79  1993
2 1995_M Desiree     165     63    18 female no     yes     moderate sat       84     84  1995
3 1996_R Lucy        166     50    19 female no     yes     low      ran       76    132  1996
4 1997_E Julianne    165     48    19 female no     no      low      sat       83     84  1997
5 1998_C Dona        165     58    23 female no     yes     low      sat       64     68  1998
6 1998_H Haley       165     60    19 female yes    yes     low      ran       88    120  1998

Note that example above can be explicitly written down with & :

filter(pulse, height >=  165 &  height <= 166 )
# A tibble: 6 × 13
  id     name     height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
  <chr>  <chr>     <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
1 1993_S Nicole      166     56    21 female yes    no      moderate sat       83     79  1993
2 1995_M Desiree     165     63    18 female no     yes     moderate sat       84     84  1995
3 1996_R Lucy        166     50    19 female no     yes     low      ran       76    132  1996
4 1997_E Julianne    165     48    19 female no     no      low      sat       83     84  1997
5 1998_C Dona        165     58    23 female no     yes     low      sat       64     68  1998
6 1998_H Haley       165     60    19 female yes    yes     low      ran       88    120  1998

Both are valid solutions.



Copyright © 2024 Biomedical Data Sciences (BDS) | LUMC