A formula is a way to tell R that one variable
depends on another.
A formula has a form of an expression. For example, to specify a
(statistical) model in which y depends on x,
say:
y ~ xThe above code does not directly use values of y and
x. It only declares the form of relation between these two
variables.
The relation might be passed as an argument to functions.
Such a function, in order to perform a calculation needs a
tibble (data.frame) containing y
and x columns. Typically, the tibble will be
provided to the function by the argument called data.
A formula can be stored in a variable; try:
form <- y ~ x
formy ~ x
<environment: 0x55dea1daaa80>class(form)[1] "formula"Let’s use the pulse data to test with Student’s
t-test the hypothesis whether between the two
gender groups there is a significant difference of the mean
pulses before the exercise (pulse1).
The t.test function might be called with two
vectors of numbers.
Let’s put female pulses to the first vector and the male pulses to the
second vector. Try:
femalePulse1 <- pulse %>% filter( gender == "female" ) %>% pull( pulse1 )
malePulse1 <- pulse %>% filter( gender == "male" ) %>% pull( pulse1 )Then, we may call t.test on these two vectors:
t.test( femalePulse1, malePulse1 )
    Welch Two Sample t-test
data:  femalePulse1 and malePulse1
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.667268  8.362184
sample estimates:
mean of x mean of y 
 77.50000  74.15254 But the manual extraction of the malePulse1 and
femalePulse1 vectors is not necessary for the format of
data available in the pulse tibble.
Try the code below to see the t.test function used with the
formula notation.
You will see the same result as above:
t.test( pulse1 ~ gender, data = pulse )
    Welch Two Sample t-test
data:  pulse1 by gender
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -1.667268  8.362184
sample estimates:
mean in group female   mean in group male 
            77.50000             74.15254 Sometimes, it might be convenient to have the formula stored in a
separate variable.
The following code is also possible:
form <- pulse1 ~ gender
t.test( form, data = pulse )
    Welch Two Sample t-test
data:  pulse1 by gender
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -1.667268  8.362184
sample estimates:
mean in group female   mean in group male 
            77.50000             74.15254 Note, that the result of the t.test function has a form
of a list. Try:
h <- t.test( pulse1 ~ gender, data = pulse )
names( h ) [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"    "null.value"  "stderr"      "alternative" "method"      "data.name"  h$p.value[1] 0.1885446Let’s use the pulse data and try to model individual
weight as a linear function of
height.
The command for linear regression is lm (for
linear model).
The arguments needs to specify the a tibble containing the
data and a formula specifying dependence of the variables
to be modelled. Try:
fit <- lm(weight ~ height, data = pulse)
fit
Call:
lm(formula = weight ~ height, data = pulse)
Coefficients:
(Intercept)       height  
   -27.4398       0.5465  Again, the returned object resembles a list. Check its
names:
names( fit ) [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"            "df.residual"   "xlevels"       "call"         
[11] "terms"         "model"        The formulas might have a more complex form. The exact meaning of such formulas is beyond the scope of this introduction. Here just an example:
lm(pulse2 ~ pulse1 + exercise + pulse1:exercise, data=pulse)
Call:
lm(formula = pulse2 ~ pulse1 + exercise + pulse1:exercise, data = pulse)
Coefficients:
            (Intercept)                   pulse1              exerciselow         exercisemoderate       pulse1:exerciselow  pulse1:exercisemoderate  
               25.65586                  0.99045                 15.52877                  0.77866                 -0.27842                 -0.05222  Copyright © 2024 Biomedical Data Sciences (BDS) | LUMC