A formula is a way to tell R that one variable depends on another.
A formula has a form of an expression. For example, to specify a (statistical) model in which y
depends on x
, say:
y ~ x
The above code does not directly use values of y
and x
. It only declares the form of relation between these two variables.
The relation might be passed as an argument to functions.
Such a function, in order to perform a calculation needs a tibble
(data.frame
) containing y
and x
columns. Typically, the tibble
will be provided to the function by the argument called data
.
A formula can be stored in a variable; try:
form <- y ~ x
form
y ~ x
<environment: 0x55aebd992d28>
class(form)
[1] "formula"
Let’s use the pulse
data to test with Student’s t-test the hypothesis whether between the two gender
groups there is a significant difference of the mean pulses before the exercise (pulse1
).
The t.test
function might be called with two vectors of numbers.
Let’s put female pulses to the first vector and the male pulses to the second vector. Try:
femalePulse1 <- pulse %>% filter( gender == "female" ) %>% pull( pulse1 )
malePulse1 <- pulse %>% filter( gender == "male" ) %>% pull( pulse1 )
Then, we may call t.test
on these two vectors:
t.test( femalePulse1, malePulse1 )
Welch Two Sample t-test
data: femalePulse1 and malePulse1
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.667268 8.362184
sample estimates:
mean of x mean of y
77.50000 74.15254
But the manual extraction of the malePulse1
and femalePulse1
vectors is not necessary for the format of data available in the pulse
tibble.
Try the code below to see the t.test
function used with the formula notation.
You will see the same result as above:
t.test( pulse1 ~ gender, data = pulse )
Welch Two Sample t-test
data: pulse1 by gender
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-1.667268 8.362184
sample estimates:
mean in group female mean in group male
77.50000 74.15254
Sometimes, it might be convenient to have the formula stored in a separate variable.
The following code is also possible:
form <- pulse1 ~ gender
t.test( form, data = pulse )
Welch Two Sample t-test
data: pulse1 by gender
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-1.667268 8.362184
sample estimates:
mean in group female mean in group male
77.50000 74.15254
Note, that the result of the t.test
function has a form of a list. Try:
h <- t.test( pulse1 ~ gender, data = pulse )
names( h )
[1] "statistic" "parameter" "p.value" "conf.int" "estimate" "null.value" "stderr"
[8] "alternative" "method" "data.name"
h$p.value
[1] 0.1885446
Let’s use the pulse
data and try to model individual weight
as a linear function of height
.
The command for linear regression is lm
(for linear model).
The arguments needs to specify the a tibble containing the data
and a formula specifying dependence of the variables to be modelled. Try:
fit <- lm(weight ~ height, data = pulse)
fit
Call:
lm(formula = weight ~ height, data = pulse)
Coefficients:
(Intercept) height
-27.4398 0.5465
Again, the returned object resembles a list. Check its names
:
names( fit )
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels" "call" "terms" "model"
The formulas might have a more complex form. The exact meaning of such formulas is beyond the scope of this introduction. Here just an example:
lm(pulse2 ~ pulse1 + exercise + pulse1:exercise, data=pulse)
Call:
lm(formula = pulse2 ~ pulse1 + exercise + pulse1:exercise, data = pulse)
Coefficients:
(Intercept) pulse1 exerciselow exercisemoderate
25.65586 0.99045 15.52877 0.77866
pulse1:exerciselow pulse1:exercisemoderate
-0.27842 -0.05222
Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC