Formula concept

A formula is a way to tell R that one variable depends on another.
A formula has a form of an expression. For example, to specify a (statistical) model in which y depends on x, say:

y ~ x

The above code does not directly use values of y and x. It only declares the form of relation between these two variables.
The relation might be passed as an argument to functions.
Such a function, in order to perform a calculation needs a tibble (data.frame) containing y and x columns. Typically, the tibble will be provided to the function by the argument called data.

A formula can be stored in a variable; try:

form <- y ~ x
form
y ~ x
<environment: 0x55aebd992d28>
class(form)
[1] "formula"

Example: Student’s t-test

Let’s use the pulse data to test with Student’s t-test the hypothesis whether between the two gender groups there is a significant difference of the mean pulses before the exercise (pulse1).

The t.test function might be called with two vectors of numbers.
Let’s put female pulses to the first vector and the male pulses to the second vector. Try:

femalePulse1 <- pulse %>% filter( gender == "female" ) %>% pull( pulse1 )
malePulse1 <- pulse %>% filter( gender == "male" ) %>% pull( pulse1 )

Then, we may call t.test on these two vectors:

t.test( femalePulse1, malePulse1 )

    Welch Two Sample t-test

data:  femalePulse1 and malePulse1
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.667268  8.362184
sample estimates:
mean of x mean of y 
 77.50000  74.15254 

But the manual extraction of the malePulse1 and femalePulse1 vectors is not necessary for the format of data available in the pulse tibble.
Try the code below to see the t.test function used with the formula notation.
You will see the same result as above:

t.test( pulse1 ~ gender, data = pulse )

    Welch Two Sample t-test

data:  pulse1 by gender
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -1.667268  8.362184
sample estimates:
mean in group female   mean in group male 
            77.50000             74.15254 

Sometimes, it might be convenient to have the formula stored in a separate variable.
The following code is also possible:

form <- pulse1 ~ gender
t.test( form, data = pulse )

    Welch Two Sample t-test

data:  pulse1 by gender
t = 1.3234, df = 106.3, p-value = 0.1885
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -1.667268  8.362184
sample estimates:
mean in group female   mean in group male 
            77.50000             74.15254 

Note, that the result of the t.test function has a form of a list. Try:

h <- t.test( pulse1 ~ gender, data = pulse )
names( h )
 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"    "null.value"  "stderr"     
 [8] "alternative" "method"      "data.name"  
h$p.value
[1] 0.1885446

Example: Simple linear regression

Let’s use the pulse data and try to model individual weight as a linear function of height.

The command for linear regression is lm (for linear model).
The arguments needs to specify the a tibble containing the data and a formula specifying dependence of the variables to be modelled. Try:

fit <- lm(weight ~ height, data = pulse)
fit

Call:
lm(formula = weight ~ height, data = pulse)

Coefficients:
(Intercept)       height  
   -27.4398       0.5465  

Again, the returned object resembles a list. Check its names:

names( fit )
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"       
 [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"        

The formulas might have a more complex form. The exact meaning of such formulas is beyond the scope of this introduction. Here just an example:

lm(pulse2 ~ pulse1 + exercise + pulse1:exercise, data=pulse)

Call:
lm(formula = pulse2 ~ pulse1 + exercise + pulse1:exercise, data = pulse)

Coefficients:
            (Intercept)                   pulse1              exerciselow         exercisemoderate  
               25.65586                  0.99045                 15.52877                  0.77866  
     pulse1:exerciselow  pulse1:exercisemoderate  
               -0.27842                 -0.05222  


Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC