Factor concept

A factor is a vector of categorical values – from a limited set of possibilities (levels).

For example, a factor can be used to store hair color of some people.
Let’s assume that we can classify each person to one of the following hair colors.
This will be the levels of the factor:

c( "black", "brown", "ginger", "blond", "gray", "white" )

Now, let’s assume that for 10 people we observed the following hair colors.
NA (not available, missing value) stands for a person whose hair color could not be determined.
This will be the values of the factor:

c( "brown", "ginger", "black", "brown", NA, "brown", "black", "brown", "blond", "gray" ) 

Create a factor

Enter the following to create the factor and store it in a variable hairColors:

hairColors <- factor(
  c( "brown", "ginger", "black", "brown", NA, "brown", "black", "brown", "blond", "gray" ),
  levels = c( "black", "brown", "ginger", "blond", "gray", "white" )
)

When printed, both values and levels of a factor are shown:

hairColors
 [1] brown  ginger black  brown  <NA>   brown  black  brown  blond  gray  
Levels: black brown ginger blond gray white

Class of a factor, as expected, is:

class( hairColors )
[1] "factor"

Levels and coding

In R the factor is stored as two vectors.
The first vector codes the levels with subsequent integers: 1=black, 2=brown, …
Type the following to get this vector:

levels( hairColors )
[1] "black"  "brown"  "ginger" "blond"  "gray"   "white" 

and to find the number of factor levels enter:

nlevels( hairColors )
[1] 6

The second vector keeps integers corresponding to the values.
Try the following and understand the relation of the integers and the levels:

as.integer( hairColors )
 [1]  2  3  1  2 NA  2  1  2  4  5

Count occurrences

Type the following function to produce a tibble with counts of occurrences of all levels in a factor:

# library( forcats )   # fct_* functions are in this library
# library( tidyverse ) # might be used instead of library( forcats ) 
fct_count( hairColors )
# A tibble: 7 × 2
  f          n
  <fct>  <int>
1 black      2
2 brown      4
3 ginger     1
4 blond      1
5 gray       1
6 white      0
7 <NA>       1

Rename levels

ℹ️The need to rename levels of a factor often arises when factors are plotted and the levels appear in legends.

Try to use fct_recode to change levels by providing providing pairs of new_level = "old_level".
For example, compare hairColors before and after fct_recode:

hairColors
 [1] brown  ginger black  brown  <NA>   brown  black  brown  blond  gray  
Levels: black brown ginger blond gray white
fct_recode( hairColors, BROWN = "brown", GRAY = "gray" )
 [1] BROWN  ginger black  BROWN  <NA>   BROWN  black  BROWN  blond  GRAY  
Levels: black BROWN ginger blond GRAY white

Change levels order

ℹ️The need to reorder levels of a factor often arises when factors are plotted and elements of plot are drawn in levels order. Also, when fitting models, the first level of the factor is often taken as the baseline/reference.

Try to use fct_relevel to move provided levels to the front.
For example, compare hairColors before and after fct_relevel:

hairColors
 [1] brown  ginger black  brown  <NA>   brown  black  brown  blond  gray  
Levels: black brown ginger blond gray white
fct_relevel( hairColors, c( "gray", "white" ) )
 [1] brown  ginger black  brown  <NA>   brown  black  brown  blond  gray  
Levels: gray white black brown ginger blond

Collapse (merge) levels

ℹ️The need to collapse levels of a factor often arises when there are too many rare levels.

Try to use fct_collapse to collapse several old levels into one new level using notation new_level = c( "old_level_1", "old_level2" ).
For example, compare hairColors before and after fct_collapse:

hairColors
 [1] brown  ginger black  brown  <NA>   brown  black  brown  blond  gray  
Levels: black brown ginger blond gray white
fct_collapse( hairColors, dark = c( "brown", "black" ), light = c( "blond", "white" ) )
 [1] dark   ginger dark   dark   <NA>   dark   dark   dark   light  gray  
Levels: dark ginger light gray

Notes

In this course we use functionality of the forcats library from the tidyverse package to perform factor operations.
➡️Go to cheat sheet section to obtain a pdf with summary of functions for factors.



Copyright © 2024 Biomedical Data Sciences (BDS) | LUMC