A factor is a vector of categorical values – from a limited set of possibilities (levels).
For example, a factor can be used to store hair color of some
people.
Let’s assume that we can classify each person to one of the following
hair colors.
This will be the levels of the factor:
c( "black", "brown", "ginger", "blond", "gray", "white" )
Now, let’s assume that for 10 people we observed the following hair
colors.
NA
(not available, missing value) stands for a person whose
hair color could not be determined.
This will be the values of the factor:
c( "brown", "ginger", "black", "brown", NA, "brown", "black", "brown", "blond", "gray" )
Enter the following to create the factor and store
it in a variable hairColors
:
hairColors <- factor(
c( "brown", "ginger", "black", "brown", NA, "brown", "black", "brown", "blond", "gray" ),
levels = c( "black", "brown", "ginger", "blond", "gray", "white" )
)
When printed, both values and levels of a factor are shown:
hairColors
[1] brown ginger black brown <NA> brown black brown blond gray
Levels: black brown ginger blond gray white
Class of a factor, as expected, is:
class( hairColors )
[1] "factor"
In R the factor is stored as two vectors.
The first vector codes the levels with subsequent integers:
1=black
, 2=brown
, …
Type the following to get this vector:
levels( hairColors )
[1] "black" "brown" "ginger" "blond" "gray" "white"
and to find the number of factor levels enter:
nlevels( hairColors )
[1] 6
The second vector keeps integers corresponding to the values.
Try the following and understand the relation of the integers and the
levels:
as.integer( hairColors )
[1] 2 3 1 2 NA 2 1 2 4 5
Type the following function to produce a tibble with counts of occurrences of all levels in a factor:
# library( forcats ) # fct_* functions are in this library
# library( tidyverse ) # might be used instead of library( forcats )
fct_count( hairColors )
# A tibble: 7 × 2
f n
<fct> <int>
1 black 2
2 brown 4
3 ginger 1
4 blond 1
5 gray 1
6 white 0
7 <NA> 1
ℹ️The need to rename levels of a factor often arises when factors are plotted and the levels appear in legends.
Try to use fct_recode
to change levels by providing
providing pairs of new_level = "old_level"
.
For example, compare hairColors
before and after
fct_recode
:
hairColors
[1] brown ginger black brown <NA> brown black brown blond gray
Levels: black brown ginger blond gray white
fct_recode( hairColors, BROWN = "brown", GRAY = "gray" )
[1] BROWN ginger black BROWN <NA> BROWN black BROWN blond GRAY
Levels: black BROWN ginger blond GRAY white
ℹ️The need to reorder levels of a factor often arises when factors are plotted and elements of plot are drawn in levels order. Also, when fitting models, the first level of the factor is often taken as the baseline/reference.
Try to use fct_relevel
to move provided levels to the
front.
For example, compare hairColors
before and after
fct_relevel
:
hairColors
[1] brown ginger black brown <NA> brown black brown blond gray
Levels: black brown ginger blond gray white
fct_relevel( hairColors, c( "gray", "white" ) )
[1] brown ginger black brown <NA> brown black brown blond gray
Levels: gray white black brown ginger blond
ℹ️The need to collapse levels of a factor often arises when there are too many rare levels.
Try to use fct_collapse
to collapse several old levels
into one new level using notation
new_level = c( "old_level_1", "old_level2" )
.
For example, compare hairColors
before and after
fct_collapse
:
hairColors
[1] brown ginger black brown <NA> brown black brown blond gray
Levels: black brown ginger blond gray white
fct_collapse( hairColors, dark = c( "brown", "black" ), light = c( "blond", "white" ) )
[1] dark ginger dark dark <NA> dark dark dark light gray
Levels: dark ginger light gray
In this course we use functionality of the forcats
library
from the tidyverse
package to perform factor
operations.
➡️Go to cheat sheet section to obtain a pdf with summary of
functions for factors.
Copyright © 2024 Biomedical Data Sciences (BDS) | LUMC