A formula is a way to tell R that one variable
depends on another.
A formula has a form of an expression. For example, to specify a
(statistical) model in which y
depends on x
,
say:
y ~ x
The above code does not directly use values of y
and
x
. It only declares the form of relation between these two
variables.
The relation might be passed as an argument to functions.
Such a function, in order to perform a calculation needs a
tibble
(data.frame
) containing y
and x
columns. Typically, the tibble
will be
provided to the function by the argument called data
.
A formula can be stored in a variable; try:
form <- y ~ x
form
y ~ x
<environment: 0x556c998939f8>
class(form)
[1] "formula"
Let’s use the pulse
data to test with Student’s
t-test the hypothesis whether between the two
gender
groups there is a significant difference of the mean
pulses before the exercise (pulse1
).
The t.test
function might be called with two
vectors of numbers.
Let’s put female pulses to the first vector and the male pulses to the
second vector. Try:
femalePulse1 <- pulse %>% filter( gender == "female" ) %>% pull( pulse1 )
malePulse1 <- pulse %>% filter( gender == "male" ) %>% pull( pulse1 )
Then, we may call t.test
on these two vectors:
t.test( femalePulse1, malePulse1 )
Welch Two Sample t-test
data: femalePulse1 and malePulse1
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.667268 8.362184
sample estimates:
mean of x mean of y
77.50000 74.15254
But the manual extraction of the malePulse1
and
femalePulse1
vectors is not necessary for the format of
data available in the pulse
tibble.
Try the code below to see the t.test
function used with the
formula notation.
You will see the same result as above:
t.test( pulse1 ~ gender, data = pulse )
Welch Two Sample t-test
data: pulse1 by gender
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-1.667268 8.362184
sample estimates:
mean in group female mean in group male
77.50000 74.15254
Sometimes, it might be convenient to have the formula stored in a
separate variable.
The following code is also possible:
form <- pulse1 ~ gender
t.test( form, data = pulse )
Welch Two Sample t-test
data: pulse1 by gender
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-1.667268 8.362184
sample estimates:
mean in group female mean in group male
77.50000 74.15254
Note, that the result of the t.test
function has a form
of a list. Try:
h <- t.test( pulse1 ~ gender, data = pulse )
names( h )
[1] "statistic" "parameter" "p.value" "conf.int" "estimate" "null.value" "stderr" "alternative" "method" "data.name"
h$p.value
[1] 0.1885446
Let’s use the pulse
data and try to model individual
weight
as a linear function of
height
.
The command for linear regression is lm
(for
linear model).
The arguments needs to specify the a tibble containing the
data
and a formula specifying dependence of the variables
to be modelled. Try:
fit <- lm(weight ~ height, data = pulse)
fit
Call:
lm(formula = weight ~ height, data = pulse)
Coefficients:
(Intercept) height
-27.4398 0.5465
Again, the returned object resembles a list. Check its
names
:
names( fit )
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" "df.residual" "xlevels" "call"
[11] "terms" "model"
The formulas might have a more complex form. The exact meaning of such formulas is beyond the scope of this introduction. Here just an example:
lm(pulse2 ~ pulse1 + exercise + pulse1:exercise, data=pulse)
Call:
lm(formula = pulse2 ~ pulse1 + exercise + pulse1:exercise, data = pulse)
Coefficients:
(Intercept) pulse1 exerciselow exercisemoderate pulse1:exerciselow pulse1:exercisemoderate
25.65586 0.99045 15.52877 0.77866 -0.27842 -0.05222
Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC