Use the select function to select variables (columns) from a tibble.

Given a tibble select can be used to :

Let’s take the pulse dataset:

pulse 
# A tibble: 110 × 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female no     yes     moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female no     yes     moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female no     yes     high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   no     yes     high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female no     yes     low      sat       90     88  1993
 6 1993_F George       184     74    22 male   no     yes     low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female no     yes     moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female no     yes     high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
# ℹ 100 more rows

select takes as it first argument a tibble, followed by a comma separated list of variables of your choice and returns a tibble with those chosen variables:

select(pulse, name, age)
# A tibble: 110 × 2
   name        age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# ℹ 100 more rows

After this selection, does pulse tibble still contain the variables ‘name’ and ‘age’?

Yes, ‘select’ returns the selection as a tibble and does not modify the underlying tibble. You can check this by entering ‘pulse’ in the R console.


If you want to keep your selection as a separate tibble you’ll need to assign the result into a new environment variable, e.g. pulse_name_age:

pulse_name_age <- select(pulse, name, age)
pulse_name_age
# A tibble: 110 × 2
   name        age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# ℹ 100 more rows

Variable order

The order of the selected variables is reflected in the resulting tibble:

select(pulse, age, name )
# A tibble: 110 × 2
     age name     
   <dbl> <chr>    
 1    18 Bonnie   
 2    19 Melanie  
 3    18 Consuelo 
 4    18 Travis   
 5    18 Lauri    
 6    22 George   
 7    20 Cherry   
 8    18 Francesca
 9    19 Sonja    
10    23 Troy     
# ℹ 100 more rows

Deselect variables

You may also deselect variables, with other words the complement of your selection. This is done by the - sign:

select(pulse, -smokes, -alcohol)
# A tibble: 110 × 11
   id     name      height weight   age gender exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female low      sat       90     88  1993
 6 1993_F George       184     74    22 male   low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   moderate ran       88    150  1993
# ℹ 100 more rows

Select and rename

With selection it is possible to change the variable names simultaneously:

select(pulse, FirstName = name, Age = age)
# A tibble: 110 × 2
   FirstName   Age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# ℹ 100 more rows

What is the variable name in the pulse dataset, ‘Age’ or ‘age’?

age, this because we only run select and do not store its result with assignment (‘<-’) back into pulse tibble.


Reshuffle variables

With select we can reshuffle the variables in their positions in the tibble. When a data set contains large number of variables, you may want to bring the more ‘important’ variables in front for inspection. You can do this with select in combination with a helper function evertything():

select(pulse, name, age, everything()) 
# A tibble: 110 × 13
   name        age id     height weight gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>     <dbl> <chr>   <dbl>  <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 Bonnie       18 1993_A    173     57 female no     yes     moderate sat       86     88  1993
 2 Melanie      19 1993_B    179     58 female no     yes     moderate ran       82    150  1993
 3 Consuelo     18 1993_C    167     62 female no     yes     high     ran       96    176  1993
 4 Travis       18 1993_D    195     84 male   no     yes     high     sat       71     73  1993
 5 Lauri        18 1993_E    173     64 female no     yes     low      sat       90     88  1993
 6 George       22 1993_F    184     74 male   no     yes     low      ran       78    141  1993
 7 Cherry       20 1993_G    162     57 female no     yes     moderate sat       68     72  1993
 8 Francesca    18 1993_H    169     55 female no     yes     moderate sat       71     77  1993
 9 Sonja        19 1993_I    164     56 female no     yes     high     sat       68     68  1993
10 Troy         23 1993_J    168     60 male   no     yes     moderate ran       88    150  1993
# ℹ 100 more rows

everything function lists all other variable other than name and age and select function places them after name and age.



Copyright © 2024 Biomedical Data Sciences (BDS) | LUMC