Formulas (not on exam) (solutions)

Primary exercises

A formula in a function; simple linear regression.
Load the survey.csv data to variable survey.
You may make a simple scatter plot with plot( survey$span2, survey$span1 ).
Perform a linear regression of span1 being dependent on span2; store the result in fit variable.
fit behaves also like a list so show names of elements stored in fit.
Find and print the element corresponding to the coefficients of the regression – do you see there a coefficient for span2?
An intended method to access the coefficients is through coef function applied to fit. Try it. Do you get the same result as observed above?

survey <- read_csv( "survey.csv" )

plot( survey$span2, survey$span1 )

fit <- lm( span1 ~ span2, data = survey )
fit


Call:
lm(formula = span1 ~ span2, data = survey)

Coefficients:
(Intercept)        span2  
     1.9367       0.9001

names( fit )

 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"            "df.residual"   "xlevels"       "call"         
[11] "terms"         "model"

fit$coefficients

(Intercept)       span2 
  1.9367408   0.9000552

coef( fit )

(Intercept)       span2 
  1.9367408   0.9000552

Extra exercises

Simple regression diagnostic plots (demo).
Apply plot function to the fit object calculated in the above exercise.
These are diagnostic plots for the model produced by lm.

plot( fit )

Another test for association/correlation.
In one of the extra list exercises a test for correlation between two random vectors was performed.
Follow the same steps to test correlation between span1 and span2 of survey.
Is correlation between these vectors significant?

h <- cor.test( survey$span2, survey$span1 ) # vector notation
h <- cor.test( formula = ~ span2 + span1, data = survey )  # formula notation
h


    Pearson's product-moment correlation

data:  span2 and span1
t = 45.203, df = 231, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9329904 0.9594917
sample estimates:
      cor 
0.9478552

names( h )

[1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value"  "alternative" "method"      "data.name"   "conf.int"

h[[ 'estimate' ]]   # correlation coefficient

      cor 
0.9478552

h[[ 'p.value' ]]    # significance, is < 0.05, strongly significant association

[1] 1.057707e-116

Multitopic exercises

(ADV) Fitting multiple simple linear regressions to parts of a table.

Load the pulse.csv dataset into the pulse variable.
Let’s define the following goal: separately for each gender calculate a linear fit of weight as a function of height.

First split the pulse table by gender into the pulseByGender variable.

pulseByGender <- pulse %>% split( .$gender )

Now, pulseByGender is a list and can be accessed with [[...]] operator.
Write the code to perform a linear fit of weight as a function of height for females only (obtained from pulseByGender list).

lm( weight ~ height, data = pulseByGender[[ "female" ]] )


Call:
lm(formula = weight ~ height, data = pulseByGender[["female"]])

Coefficients:
(Intercept)       height  
    -22.691        0.482

The next goal is to perform the above fit multiple times, separately for each gender.
Define a variable genders with all genders present in the split object.

genders <- names( pulseByGender )
genders

[1] "female" "male"

Now a loop is needed.
A variable gender should iterate over all genders.
For each value of gender the respective linear fit should be performed.
Understand and try the following code:

for( gender in genders ) {
  lm( weight ~ height, data = pulseByGender[[ gender ]] )
}

The above code calculates all needed fits but does show/store the results anywhere.
The results need to be stored, for example in a list.
Understand/modify the code as follows:

fitByGender <- list()
for( gender in genders ) {
  fitByGender[[gender]] <- lm( weight ~ height, data = pulseByGender[[ gender ]] )
}
fitByGender

$female

Call:
lm(formula = weight ~ height, data = pulseByGender[[gender]])

Coefficients:
(Intercept)       height  
    -22.691        0.482  


$male

Call:
lm(formula = weight ~ height, data = pulseByGender[[gender]])

Coefficients:
(Intercept)       height  
    17.3348       0.3231

fitByGender[[ "female" ]]


Call:
lm(formula = weight ~ height, data = pulseByGender[[gender]])

Coefficients:
(Intercept)       height  
    -22.691        0.482

A special function lapply allows to write the above code differently.
In R this is a preferred notation (despite being less intuitive at the first glance).
Rewrite the for loop into lapply as follows:

fitByGender <- lapply( genders, function( gender ) {
  lm( weight ~ height, data = pulseByGender[[ gender ]] )
} )
fitByGender

[[1]]

Call:
lm(formula = weight ~ height, data = pulseByGender[[gender]])

Coefficients:
(Intercept)       height  
    -22.691        0.482  


[[2]]

Call:
lm(formula = weight ~ height, data = pulseByGender[[gender]])

Coefficients:
(Intercept)       height  
    17.3348       0.3231

names( fitByGender )

NULL

As you can see, fitByGender is a list but the elements are not named.
This is because lapply names the elements as they were named in the iterated input.
Check names of genders (you will see NULL names).

genders

[1] "female" "male"

names( genders )

NULL

Consequently, set the names of genders elements to be equal to the element values.

names( genders ) <- genders
genders

  female     male 
"female"   "male"

Now, let’s repeat the lapply loop and check the result.
Check (by eye) that fitByGender[[ 'female' ]] shows the same result as at the top of this example.

fitByGender <- lapply( genders, function( gender ) {
  lm( weight ~ height, data = pulseByGender[[ gender ]] )
} )
fitByGender

$female

Call:
lm(formula = weight ~ height, data = pulseByGender[[gender]])

Coefficients:
(Intercept)       height  
    -22.691        0.482  


$male

Call:
lm(formula = weight ~ height, data = pulseByGender[[gender]])

Coefficients:
(Intercept)       height  
    17.3348       0.3231

fitByGender[[ 'female' ]] # compare to lm( weight ~ height, data = pulseByGender[[ "female" ]] )


Call:
lm(formula = weight ~ height, data = pulseByGender[[gender]])

Coefficients:
(Intercept)       height  
    -22.691        0.482

↑ Lecture ⇄ Practice