Use the
select
function to select variables (columns) from a tibble.
Given a tibble select
can be used to :
Let’s take the pulse
dataset:
pulse
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 18 female no yes moderate sat 86 88 1993
2 1993_B Melanie 179 58 19 female no yes moderate ran 82 150 1993
3 1993_C Consuelo 167 62 18 female no yes high ran 96 176 1993
4 1993_D Travis 195 84 18 male no yes high sat 71 73 1993
5 1993_E Lauri 173 64 18 female no yes low sat 90 88 1993
6 1993_F George 184 74 22 male no yes low ran 78 141 1993
7 1993_G Cherry 162 57 20 female no yes moderate sat 68 72 1993
8 1993_H Francesca 169 55 18 female no yes moderate sat 71 77 1993
9 1993_I Sonja 164 56 19 female no yes high sat 68 68 1993
10 1993_J Troy 168 60 23 male no yes moderate ran 88 150 1993
# … with 100 more rows
# ℹ Use `print(n = ...)` to see more rows
select
takes as it first argument a tibble, followed by
a comma separated list of variables of your choice and returns a tibble
with those chosen variables:
select(pulse, name, age)
# A tibble: 110 × 2
name age
<chr> <dbl>
1 Bonnie 18
2 Melanie 19
3 Consuelo 18
4 Travis 18
5 Lauri 18
6 George 22
7 Cherry 20
8 Francesca 18
9 Sonja 19
10 Troy 23
# … with 100 more rows
# ℹ Use `print(n = ...)` to see more rows
AnswerAfter this selection, does
pulse
tibble still contain the variables ‘name’ and ‘age’?
If you want to keep your selection as a separate tibble you’ll need
to assign the result into a new environment variable,
e.g. pulse_name_age
:
pulse_name_age <- select(pulse, name, age)
pulse_name_age
# A tibble: 110 × 2
name age
<chr> <dbl>
1 Bonnie 18
2 Melanie 19
3 Consuelo 18
4 Travis 18
5 Lauri 18
6 George 22
7 Cherry 20
8 Francesca 18
9 Sonja 19
10 Troy 23
# … with 100 more rows
# ℹ Use `print(n = ...)` to see more rows
The order of the selected variables is reflected in the resulting tibble:
select(pulse, age, name )
# A tibble: 110 × 2
age name
<dbl> <chr>
1 18 Bonnie
2 19 Melanie
3 18 Consuelo
4 18 Travis
5 18 Lauri
6 22 George
7 20 Cherry
8 18 Francesca
9 19 Sonja
10 23 Troy
# … with 100 more rows
# ℹ Use `print(n = ...)` to see more rows
You may also deselect variables, with other words the complement of
your selection. This is done by the -
sign:
select(pulse, -smokes, -alcohol)
# A tibble: 110 × 11
id name height weight age gender exercise ran pulse1 pulse2 year
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1993_A Bonnie 173 57 18 female moderate sat 86 88 1993
2 1993_B Melanie 179 58 19 female moderate ran 82 150 1993
3 1993_C Consuelo 167 62 18 female high ran 96 176 1993
4 1993_D Travis 195 84 18 male high sat 71 73 1993
5 1993_E Lauri 173 64 18 female low sat 90 88 1993
6 1993_F George 184 74 22 male low ran 78 141 1993
7 1993_G Cherry 162 57 20 female moderate sat 68 72 1993
8 1993_H Francesca 169 55 18 female moderate sat 71 77 1993
9 1993_I Sonja 164 56 19 female high sat 68 68 1993
10 1993_J Troy 168 60 23 male moderate ran 88 150 1993
# … with 100 more rows
# ℹ Use `print(n = ...)` to see more rows
With selection it is possible to change the variable names simultaneously:
select(pulse, FirstName = name, Age = age)
# A tibble: 110 × 2
FirstName Age
<chr> <dbl>
1 Bonnie 18
2 Melanie 19
3 Consuelo 18
4 Travis 18
5 Lauri 18
6 George 22
7 Cherry 20
8 Francesca 18
9 Sonja 19
10 Troy 23
# … with 100 more rows
# ℹ Use `print(n = ...)` to see more rows
AnswerWhat is the variable name in the pulse dataset, ‘Age’ or ‘age’?
With select
we can reshuffle the variables in their
positions in the tibble. When a data set contains large number of
variables, you may want to bring the more ‘important’ variables in front
for inspection. You can do this with select
in combination
with a helper function evertything()
:
select(pulse, name, age, everything())
# A tibble: 110 × 13
name age id height weight gender smokes alcohol exercise ran pulse1 pulse2 year
<chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Bonnie 18 1993_A 173 57 female no yes moderate sat 86 88 1993
2 Melanie 19 1993_B 179 58 female no yes moderate ran 82 150 1993
3 Consuelo 18 1993_C 167 62 female no yes high ran 96 176 1993
4 Travis 18 1993_D 195 84 male no yes high sat 71 73 1993
5 Lauri 18 1993_E 173 64 female no yes low sat 90 88 1993
6 George 22 1993_F 184 74 male no yes low ran 78 141 1993
7 Cherry 20 1993_G 162 57 female no yes moderate sat 68 72 1993
8 Francesca 18 1993_H 169 55 female no yes moderate sat 71 77 1993
9 Sonja 19 1993_I 164 56 female no yes high sat 68 68 1993
10 Troy 23 1993_J 168 60 male no yes moderate ran 88 150 1993
# … with 100 more rows
# ℹ Use `print(n = ...)` to see more rows
everything
function lists all other variable other than
name
and age
and select
function
places them after name
and age
.
Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC