Use the select function to select variables (columns) from a tibble.

Given a tibble select can be used to :

Let’s take the pulse dataset:

pulse 
# A tibble: 110 × 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female no     yes     moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female no     yes     moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female no     yes     high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   no     yes     high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female no     yes     low      sat       90     88  1993
 6 1993_F George       184     74    22 male   no     yes     low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female no     yes     moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female no     yes     moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female no     yes     high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   no     yes     moderate ran       88    150  1993
# … with 100 more rows

select takes as it first argument a tibble, followed by a comma separated list of variables of your choice and returns a tibble with those chosen variables:

select(pulse, name, age)
# A tibble: 110 × 2
   name        age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# … with 100 more rows

After this selection, does pulse tibble still contain the variables ‘name’ and ‘age’?

Yes, ‘select’ returns the selection as a tibble and does not modify the underlying tibble. You can check this by entering ‘pulse’ in the R console.


If you want to keep your selection as a separate tibble you’ll need to assign the result into a new environment variable, e.g. pulse_name_age:

pulse_name_age <- select(pulse, name, age)
pulse_name_age
# A tibble: 110 × 2
   name        age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# … with 100 more rows

Variable order

The order of the selected variables is reflected in the resulting tibble:

select(pulse, age, name )
# A tibble: 110 × 2
     age name     
   <dbl> <chr>    
 1    18 Bonnie   
 2    19 Melanie  
 3    18 Consuelo 
 4    18 Travis   
 5    18 Lauri    
 6    22 George   
 7    20 Cherry   
 8    18 Francesca
 9    19 Sonja    
10    23 Troy     
# … with 100 more rows

Deselect variables

You may also deselect variables, with other words the complement of your selection. This is done by the - sign:

select(pulse, -smokes, -alcohol)
# A tibble: 110 × 11
   id     name      height weight   age gender exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female low      sat       90     88  1993
 6 1993_F George       184     74    22 male   low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   moderate ran       88    150  1993
# … with 100 more rows

Select and rename

With selection it is possible to change the variable names simultaneously:

select(pulse, FirstName = name, Age = age)
# A tibble: 110 × 2
   FirstName   Age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# … with 100 more rows

What is the variable name in the pulse dataset, ‘Age’ or ‘age’?

age, this because we only run select and do not store its result with assignment (‘<-’) back into pulse tibble.


Reshuffle variables

With select we can reshuffle the variables in their positions in the tibble. When a data set contains large number of variables, you may want to bring the more ‘important’ variables in front for inspection. You can do this with select in combination with a helper function evertything():

select(pulse, name, age, everything()) 
# A tibble: 110 × 13
   name        age id     height weight gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>     <dbl> <chr>   <dbl>  <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 Bonnie       18 1993_A    173     57 female no     yes     moderate sat       86     88  1993
 2 Melanie      19 1993_B    179     58 female no     yes     moderate ran       82    150  1993
 3 Consuelo     18 1993_C    167     62 female no     yes     high     ran       96    176  1993
 4 Travis       18 1993_D    195     84 male   no     yes     high     sat       71     73  1993
 5 Lauri        18 1993_E    173     64 female no     yes     low      sat       90     88  1993
 6 George       22 1993_F    184     74 male   no     yes     low      ran       78    141  1993
 7 Cherry       20 1993_G    162     57 female no     yes     moderate sat       68     72  1993
 8 Francesca    18 1993_H    169     55 female no     yes     moderate sat       71     77  1993
 9 Sonja        19 1993_I    164     56 female no     yes     high     sat       68     68  1993
10 Troy         23 1993_J    168     60 male   no     yes     moderate ran       88    150  1993
# … with 100 more rows

everything function lists all other variable other than name and age and select function places them after name and age.

Selection by pattern matching

In data sets with large number of variables, finding variables will become tedious. Several helper functions are available to speed up the variable name search.

starts_with(), ends_with() and contains()

The functions help to find fixed patterns in variable names:

# select variables starting with character 'a'
select(pulse, starts_with("a"))
# A tibble: 110 × 2
     age alcohol
   <dbl> <chr>  
 1    18 yes    
 2    19 yes    
 3    18 yes    
 4    18 yes    
 5    18 yes    
 6    22 yes    
 7    20 yes    
 8    18 yes    
 9    19 yes    
10    23 yes    
# … with 100 more rows
# select variables ending with 'e'
select(pulse, ends_with("e"))
# A tibble: 110 × 3
   name        age exercise
   <chr>     <dbl> <chr>   
 1 Bonnie       18 moderate
 2 Melanie      19 moderate
 3 Consuelo     18 high    
 4 Travis       18 high    
 5 Lauri        18 low     
 6 George       22 low     
 7 Cherry       20 moderate
 8 Francesca    18 moderate
 9 Sonja        19 high    
10 Troy         23 moderate
# … with 100 more rows
# select variables containing character 'i' 
select(pulse, contains("i"))
# A tibble: 110 × 4
   id     height weight exercise
   <chr>   <dbl>  <dbl> <chr>   
 1 1993_A    173     57 moderate
 2 1993_B    179     58 moderate
 3 1993_C    167     62 high    
 4 1993_D    195     84 high    
 5 1993_E    173     64 low     
 6 1993_F    184     74 low     
 7 1993_G    162     57 moderate
 8 1993_H    169     55 moderate
 9 1993_I    164     56 high    
10 1993_J    168     60 moderate
# … with 100 more rows

The helper functions can be used with logical operators {!,|,&} which will be explained later. You have already encountered one in the lecture on Useful R functions, !, the negation operator. In short it complements the results. For example, above we could select variables which started with character ‘a’ with select(pulse, starts_with("a")) which resulted into a tibble with the two variables age and alcohol. Using ! in front of the helper function in the expression will produce the complement of the previous result, namely all variables that do not start with a:

select(pulse, ! starts_with("a"))
# A tibble: 110 × 11
   id     name      height weight gender smokes exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <chr>  <chr>  <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57 female no     moderate sat       86     88  1993
 2 1993_B Melanie      179     58 female no     moderate ran       82    150  1993
 3 1993_C Consuelo     167     62 female no     high     ran       96    176  1993
 4 1993_D Travis       195     84 male   no     high     sat       71     73  1993
 5 1993_E Lauri        173     64 female no     low      sat       90     88  1993
 6 1993_F George       184     74 male   no     low      ran       78    141  1993
 7 1993_G Cherry       162     57 female no     moderate sat       68     72  1993
 8 1993_H Francesca    169     55 female no     moderate sat       71     77  1993
 9 1993_I Sonja        164     56 female no     high     sat       68     68  1993
10 1993_J Troy         168     60 male   no     moderate ran       88    150  1993
# … with 100 more rows

Note that age and alcohol do not occur in the result.

There are several other helper functions which fall beyond the scope of this lecture, visit here for more details.



Copyright © 2022 Biomedical Data Sciences (BDS) | LUMC