Use the
select
function to select variables (columns) from a tibble.
Given a tibble select
can be used to :
Let’s take the pulse
dataset:
pulse
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 18 female no yes moderate sat 86 88 1993
2 1993_B Melanie 179 58 19 female no yes moderate ran 82 150 1993
3 1993_C Consuelo 167 62 18 female no yes high ran 96 176 1993
4 1993_D Travis 195 84 18 male no yes high sat 71 73 1993
5 1993_E Lauri 173 64 18 female no yes low sat 90 88 1993
6 1993_F George 184 74 22 male no yes low ran 78 141 1993
7 1993_G Cherry 162 57 20 female no yes moderate sat 68 72 1993
8 1993_H Francesca 169 55 18 female no yes moderate sat 71 77 1993
9 1993_I Sonja 164 56 19 female no yes high sat 68 68 1993
10 1993_J Troy 168 60 23 male no yes moderate ran 88 150 1993
# … with 100 more rows
select
takes as it first argument a tibble, followed by a comma separated list of variables of your choice and returns a tibble with those chosen variables:
select(pulse, name, age)
# A tibble: 110 × 2
name age
<chr> <dbl>
1 Bonnie 18
2 Melanie 19
3 Consuelo 18
4 Travis 18
5 Lauri 18
6 George 22
7 Cherry 20
8 Francesca 18
9 Sonja 19
10 Troy 23
# … with 100 more rows
AnswerAfter this selection, does
pulse
tibble still contain the variables ‘name’ and ‘age’?
If you want to keep your selection as a separate tibble you’ll need to assign the result into a new environment variable, e.g. pulse_name_age
:
pulse_name_age <- select(pulse, name, age)
pulse_name_age
# A tibble: 110 × 2
name age
<chr> <dbl>
1 Bonnie 18
2 Melanie 19
3 Consuelo 18
4 Travis 18
5 Lauri 18
6 George 22
7 Cherry 20
8 Francesca 18
9 Sonja 19
10 Troy 23
# … with 100 more rows
The order of the selected variables is reflected in the resulting tibble:
select(pulse, age, name )
# A tibble: 110 × 2
age name
<dbl> <chr>
1 18 Bonnie
2 19 Melanie
3 18 Consuelo
4 18 Travis
5 18 Lauri
6 22 George
7 20 Cherry
8 18 Francesca
9 19 Sonja
10 23 Troy
# … with 100 more rows
You may also deselect variables, with other words the complement of your selection. This is done by the -
sign:
select(pulse, -smokes, -alcohol)
# A tibble: 110 × 11
id name height weight age gender exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 18 female moderate sat 86 88 1993
2 1993_B Melanie 179 58 19 female moderate ran 82 150 1993
3 1993_C Consuelo 167 62 18 female high ran 96 176 1993
4 1993_D Travis 195 84 18 male high sat 71 73 1993
5 1993_E Lauri 173 64 18 female low sat 90 88 1993
6 1993_F George 184 74 22 male low ran 78 141 1993
7 1993_G Cherry 162 57 20 female moderate sat 68 72 1993
8 1993_H Francesca 169 55 18 female moderate sat 71 77 1993
9 1993_I Sonja 164 56 19 female high sat 68 68 1993
10 1993_J Troy 168 60 23 male moderate ran 88 150 1993
# … with 100 more rows
With selection it is possible to change the variable names simultaneously:
select(pulse, FirstName = name, Age = age)
# A tibble: 110 × 2
FirstName Age
<chr> <dbl>
1 Bonnie 18
2 Melanie 19
3 Consuelo 18
4 Travis 18
5 Lauri 18
6 George 22
7 Cherry 20
8 Francesca 18
9 Sonja 19
10 Troy 23
# … with 100 more rows
AnswerWhat is the variable name in the pulse dataset, ‘Age’ or ‘age’?
With select
we can reshuffle the variables in their positions in the tibble. When a data set contains large number of variables, you may want to bring the more ‘important’ variables in front for inspection. You can do this with select
in combination with a helper function evertything()
:
select(pulse, name, age, everything())
# A tibble: 110 × 13
name age id height weight gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Bonnie 18 1993_A 173 57 female no yes moderate sat 86 88 1993
2 Melanie 19 1993_B 179 58 female no yes moderate ran 82 150 1993
3 Consuelo 18 1993_C 167 62 female no yes high ran 96 176 1993
4 Travis 18 1993_D 195 84 male no yes high sat 71 73 1993
5 Lauri 18 1993_E 173 64 female no yes low sat 90 88 1993
6 George 22 1993_F 184 74 male no yes low ran 78 141 1993
7 Cherry 20 1993_G 162 57 female no yes moderate sat 68 72 1993
8 Francesca 18 1993_H 169 55 female no yes moderate sat 71 77 1993
9 Sonja 19 1993_I 164 56 female no yes high sat 68 68 1993
10 Troy 23 1993_J 168 60 male no yes moderate ran 88 150 1993
# … with 100 more rows
everything
function lists all other variable other than name
and age
and select
function places them after name
and age
.
In data sets with large number of variables, finding variables will become tedious. Several helper functions are available to speed up the variable name search.
starts_with(), ends_with() and contains()
The functions help to find fixed patterns in variable names:
# select variables starting with character 'a'
select(pulse, starts_with("a"))
# A tibble: 110 × 2
age alcohol
<dbl> <chr>
1 18 yes
2 19 yes
3 18 yes
4 18 yes
5 18 yes
6 22 yes
7 20 yes
8 18 yes
9 19 yes
10 23 yes
# … with 100 more rows
# select variables ending with 'e'
select(pulse, ends_with("e"))
# A tibble: 110 × 3
name age exercise
<chr> <dbl> <chr>
1 Bonnie 18 moderate
2 Melanie 19 moderate
3 Consuelo 18 high
4 Travis 18 high
5 Lauri 18 low
6 George 22 low
7 Cherry 20 moderate
8 Francesca 18 moderate
9 Sonja 19 high
10 Troy 23 moderate
# … with 100 more rows
# select variables containing character 'i'
select(pulse, contains("i"))
# A tibble: 110 × 4
id height weight exercise
<chr> <dbl> <dbl> <chr>
1 1993_A 173 57 moderate
2 1993_B 179 58 moderate
3 1993_C 167 62 high
4 1993_D 195 84 high
5 1993_E 173 64 low
6 1993_F 184 74 low
7 1993_G 162 57 moderate
8 1993_H 169 55 moderate
9 1993_I 164 56 high
10 1993_J 168 60 moderate
# … with 100 more rows
The helper functions can be used with logical operators {!,|,&} which will be explained later. You have already encountered one in the lecture on Useful R functions, !
, the negation operator. In short it complements the results. For example, above we could select variables which started with character ‘a’ with select(pulse, starts_with("a"))
which resulted into a tibble with the two variables age
and alcohol
. Using !
in front of the helper function in the expression will produce the complement of the previous result, namely all variables that do not start with a
:
select(pulse, ! starts_with("a"))
# A tibble: 110 × 11
id name height weight gender smokes exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 female no moderate sat 86 88 1993
2 1993_B Melanie 179 58 female no moderate ran 82 150 1993
3 1993_C Consuelo 167 62 female no high ran 96 176 1993
4 1993_D Travis 195 84 male no high sat 71 73 1993
5 1993_E Lauri 173 64 female no low sat 90 88 1993
6 1993_F George 184 74 male no low ran 78 141 1993
7 1993_G Cherry 162 57 female no moderate sat 68 72 1993
8 1993_H Francesca 169 55 female no moderate sat 71 77 1993
9 1993_I Sonja 164 56 female no high sat 68 68 1993
10 1993_J Troy 168 60 male no moderate ran 88 150 1993
# … with 100 more rows
Note that age
and alcohol
do not occur in the result.
There are several other helper functions which fall beyond the scope of this lecture, visit here for more details.
Copyright © 2022 Biomedical Data Sciences (BDS) | LUMC