Primary exercises

  1. Create and investigate a list.
    Three students received different sets of grades (Amy: 1,6,7,9,10; Bob: 6,7,4,3,5,2,2,1,4; Dan: 9,9,10).
    In a variable scores create a list (the names of the list elements should be the names of the students and the values should be the corresponding grades).
    Print the list, its class, length and structure (str) of scores.
scores <- list(
  Amy = c( 1,6,7,9,10 ),
  Bob = c( 6,7,4,3,5,2,2,1,4 ),
  Dan = c( 9,9,10 )
)
scores
$Amy
[1]  1  6  7  9 10

$Bob
[1] 6 7 4 3 5 2 2 1 4

$Dan
[1]  9  9 10
class( scores )
[1] "list"
length( scores )
[1] 3
str( scores )
List of 3
 $ Amy: num [1:5] 1 6 7 9 10
 $ Bob: num [1:9] 6 7 4 3 5 2 2 1 4
 $ Dan: num [1:3] 9 9 10
  1. Add an element, change an element.
    Reuse scores from the previous exercise.
    Add there grades for Eve (7,3,5,8,8,9) and print the list.
    Then, for Dan merge new grades (8,8,6,7) with the existing grades (hint: use the combine function c to combine existing Dan’s grades with the new grades then put the result back to scores; do not type again 9,9,10).
scores[[ 'Eve' ]] <- c(7,3,5,8,8,9)
scores
$Amy
[1]  1  6  7  9 10

$Bob
[1] 6 7 4 3 5 2 2 1 4

$Dan
[1]  9  9 10

$Eve
[1] 7 3 5 8 8 9
scores[[ "Dan" ]] <- c( scores[[ "Dan" ]], c(8,8,6,7) )
scores
$Amy
[1]  1  6  7  9 10

$Bob
[1] 6 7 4 3 5 2 2 1 4

$Dan
[1]  9  9 10  8  8  6  7

$Eve
[1] 7 3 5 8 8 9
  1. Single and double bracket operators.
    Reuse scores from the previous exercises.
    Investigate the difference between scores[[ "Bob" ]] and scores[ "Bob" ].
    Look at what is printed and what is the class of each result.
    Then compare scores[[ c( "Amy", "Bob" ) ]] with scores[ c( "Amy", "Bob" ) ].
    Understand, why the error is reported.
scores[[ "Bob" ]]             # Returns the value of Bob element (vector)
[1] 6 7 4 3 5 2 2 1 4
scores[ "Bob" ]               # Creates a new list with only Bob there (list)
$Bob
[1] 6 7 4 3 5 2 2 1 4
class( scores[[ "Bob" ]] )
[1] "numeric"
class( scores[ "Bob" ] )
[1] "list"
scores[[ c( "Amy", "Bob" ) ]] # A list is needed to return two elements
Error in scores[[c("Amy", "Bob")]]: subscript out of bounds
scores[ c( "Amy", "Bob" ) ]   # This creates a list, so many elements are ok
$Amy
[1]  1  6  7  9 10

$Bob
[1] 6 7 4 3 5 2 2 1 4
  1. Dollar operator.
    Reuse scores from the previous exercises.
    Investigate the (lack of) difference between scores$Bob and scores[[ "Bob" ]].
    Look at what is printed and what is the class of each result.
    Then compare scores$Bo with scores[[ "Bo" ]].
    Understand, why the NULL is returned.
scores$Bob        # another way to access Bob
[1] 6 7 4 3 5 2 2 1 4
scores[[ "Bob" ]] # get an element with exact name Bob
[1] 6 7 4 3 5 2 2 1 4
class( scores$Bob )
[1] "numeric"
class( scores[[ "Bob" ]] )
[1] "numeric"
scores$Bo         # strange matching of names, it still finds Bob
[1] 6 7 4 3 5 2 2 1 4
scores[[ "Bo" ]]  # there is no "Bo" so NULL is returned
NULL

Extra exercises

  1. A list returned by a function; test for association/correlation.
    For this exercise we need two random numerical vectors.
    Let’s create x and y, each of 30 elements sampled from the normal distribution: x <- rnorm( 30 ) and y <- rnorm( 30 ).
    Print these vectors. You may also produce a scatter plot: plot( x, y ).

    The function cor.test tests for association between corresponding elements of two vectors.
    Use h <- cor.test( x, y ) and print h to see a report of the association test.
    Internally h is stored as a list. Print names of the elements stored in h.
    Now, read Help for cor.test. In the section Value you will see the description of the h elements.
    Get directy the values of elements estimate and p.value.
x <- rnorm( 30 )
y <- rnorm( 30 )
x
 [1]  0.01661926 -0.15022461  0.71947519  1.12436382 -0.06036516  0.81496205  1.35515619 -1.01629995 -1.56629754 -0.84076732 -1.12440072  0.90223238 -0.42194145
[14]  1.34488510  0.62763962  1.45858874  0.06179762  0.92389286 -0.47103586  0.57106113 -0.67810457 -0.17890391  2.25348895 -0.15977933 -0.91689066 -2.56356662
[27]  0.61770931  1.21615706 -0.09609668  0.07336768
y
 [1]  0.44054195  0.97783686  0.03200628 -1.38261288 -0.48167467  0.81241093 -0.93770309 -1.02775399  0.18194063 -1.70221850 -0.81177342 -2.20029373 -2.48058753
[14] -0.16588257 -0.01081708  0.48804528  0.15474823  1.23747511  0.12949083 -0.50415714 -0.46058738  1.30361203 -0.68965551 -0.42002551 -0.57926586 -0.44125630
[27] -0.08788693  1.31997948  0.57857201 -1.45768488
plot( x, y )

h <- cor.test( x, y )
h

    Pearson's product-moment correlation

data:  x and y
t = 0.71083, df = 28, p-value = 0.4831
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.2385735  0.4708248
sample estimates:
      cor 
0.1331392 
names( h )
[1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value"  "alternative" "method"      "data.name"   "conf.int"   
h[[ 'estimate' ]]
      cor 
0.1331392 
h[[ 'p.value' ]]
[1] 0.4830665
  1. A nested list.
    Let’s extend the concept of scores to describe various topics (see the code below).
    Check class and str of scores.
    Calculate how many students are in the scores list.
    Get Dan’s scores in physics.
scores <- list(
  Amy = list(
    math = c( 1,6,7,9,10 ),
    biology = c( 7,6,8 )
  ),
  Bob = list(
    math = c( 6,7,4,3,5,2,2,1,4 ),
    physics = c( 8,7 )
  ),
  Dan = list(
    math = c( 9,9,10 ),
    physics = c( 10, 10, 10 ),
    biology = c( 3, 5, 7 )
  )
)
class( scores )
[1] "list"
str( scores )
List of 3
 $ Amy:List of 2
  ..$ math   : num [1:5] 1 6 7 9 10
  ..$ biology: num [1:3] 7 6 8
 $ Bob:List of 2
  ..$ math   : num [1:9] 6 7 4 3 5 2 2 1 4
  ..$ physics: num [1:2] 8 7
 $ Dan:List of 3
  ..$ math   : num [1:3] 9 9 10
  ..$ physics: num [1:3] 10 10 10
  ..$ biology: num [1:3] 3 5 7
length( scores )      # number of students
[1] 3
length( scores$Bob )  # number of topics for which Bob has scores
[1] 2
scores[[ "Dan" ]][[ "physics" ]]
[1] 10 10 10
scores$Dan$physics
[1] 10 10 10
scores$Dan[[ "physics" ]]
[1] 10 10 10

Multitopic exercises

  1. (ADV) Mean grades for each student. (Call a function for each element. Collect calls’ results into list.)
    Consider the scores list from the first exercise (also copied below).
    Calculate the mean grade for each student.

    Use lapply to apply the mean function to each element of scores.
    Also, replace lapply with sapply and compare the results.
    Try to explain what lapply/sapply do.
    Note: the names of the list elements in scores are preserved in the result.

scores <- list(
  Amy = c( 1,6,7,9,10 ),
  Bob = c( 6,7,4,3,5,2,2,1,4 ),
  Dan = c( 9,9,10 )
)
lapply( scores, mean ) # the result is a list
$Amy
[1] 6.6

$Bob
[1] 3.777778

$Dan
[1] 9.333333
sapply( scores, mean ) # the result is converted to a vector
     Amy      Bob      Dan 
6.600000 3.777778 9.333333 
  1. (ADV) Simulate grades. (Define an own function and call it for each element.)
    Consider the scores list from the previous exercise.
    Let’s assume that the grades are not known yet and we need to simulate them.

    A vector nms with several (e.g. 12, see below) unique names of students is provided.
    Each student should have a random number of grades (between 5 and 14).
    The grades should be sampled from the range 1:10.
    Grades 1-4,9,10 are usually rare compared to 6-8, so the probabilities of grades should not be uniform (e.g. the ratios should be 1:1:1:1:2:10:20:20:2:1 for grades 1…10).
    For each student, the grades should be sorted in ascending order.
    The final list should have the same structure as scores (i.e. the names of the list elements should be the names of the students and the values should be the corresponding grades).

    Hints:
    • Use sample to generate a random number - how many grades a student should have.
    • Use sample with the prob and replace arguments - grades with non-uniform probabilities.
    • Put above into a function genGrades that generates grades for a single student.
    • Use lapply to apply the function to each element of nms. Note, that the function does not use the nm argument (but it still needs to be present).
    • Use setNames to assign names to the list elements (or better name the elements of nms before lapply).
nms <- c( "Amy", "Bob", "Carl", "Dany", "Ewa", "Frank", "Greg", "Holy", "Ian", "Jan", "Kees", "Leon" )
genGrades <- function(nm) { # nm is a single name, not used in the function
  gradesNum <- sample( 5:14, 1 )
  grades <- sort( sample( 1:10, size = gradesNum, prob = c(1,1,1,1,2,10,20,20,2,1), replace = TRUE ) )
  return( grades )
}
lapply( setNames( nm = nms ), genGrades ) # calls genGrades for each element of nms
$Amy
 [1] 2 4 5 6 6 7 7 7 8 8 8

$Bob
 [1] 5 6 7 7 7 7 7 7 7 7 7 8 8 8

$Carl
 [1] 3 4 5 6 6 7 7 7 8 8 8

$Dany
 [1] 6 6 6 6 6 7 7 7 7 7 8 8 8 8

$Ewa
 [1] 3 6 6 6 7 7 7 8 8 8 8 8 8

$Frank
[1] 6 7 7 7 7 7 8 9

$Greg
 [1]  2  6  6  7  7  8  8  8  8  8  9 10

$Holy
 [1]  6  6  7  7  7  8  8  8  8 10

$Ian
[1] 6 7 7 7 8 8

$Jan
 [1] 2 2 7 7 7 7 7 8 8 8 8 8 8

$Kees
[1] 4 6 7 7 8 8 8

$Leon
 [1] 1 6 6 6 6 7 7 7 7 8 8 8 8
                                          # if elements of nms have names, the result has the same names
  1. (ADV) Plot scores given in a list. (Convert list to long tibble. Plot it.)
    Plotting functions usually require a table with data in a long format.
    Convert the scores list from the first exercise to a long table, with two columns name and score (each grade should be a separate row).
    Use ggplot to plot the grades from the long table.

    Hints:
    • Write a function which converts a single element of scores to a tibble with two columns name and score.
    • Use lapply to apply the function to each element of scores (you will get a list of tibbles).
    • Use bind_rows to combine the results into a single table (you will get a single, merged tibble).
    • Use ggplot to plot the table. The example below uses geom_dotplot to plot the grades. You may use geom_point instead.
scores <- list(
  Amy = c( 1,6,7,9,10 ),
  Bob = c( 6,7,4,3,5,2,2,1,4 ),
  Dan = c( 9,9,10 )
)
d <- names(scores) %>% 
  lapply( function( nm ) tibble( name=nm, score=scores[[nm]] ) ) %>% 
  bind_rows()
p <- ggplot( d ) +
  aes( x=name, y=score ) +
  geom_dotplot( binaxis="y", stackdir="center", binwidth=0.5 ) +
  theme_bw() +
  scale_y_continuous( limits=c(1,10), breaks=1:10 )

  1. (ADV) Split a table into list of tables by a column factor; merge back.
    Some functions might require an input to be provided as a list of tables.
    Let’s assume that the pulse table should be split into a list of table parts based on the exercise argument.
    Load the pulse.csv data to variable pulse.
    Try l <- pulse %>% split( .$exercise ) and investigate the class, length and names of the result l.
    Use double square bracket to extract the part for exercise being low.
    Finally, check that with bind_rows applied to l you can recreate the pulse table (but with a different order of rows).
l <- pulse %>% split( .$exercise )  # . represents the object on the left side of %>%
class( l )
[1] "list"
length( l )
[1] 3
names( l )
[1] "high"     "low"      "moderate"
l[[ "low" ]]
# A tibble: 37 × 13
   id     name      height weight   age gender smokes alcohol exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_E Lauri        173     64    18 female no     yes     low      sat       90     88  1993
 2 1993_F George       184     74    22 male   no     yes     low      ran       78    141  1993
 3 1993_L Frederick    178     58    19 male   no     no      low      sat       74     76  1993
 4 1993_P Mathew       185    110    22 male   no     yes     low      sat       77     73  1993
 5 1993_Q Leslie       170     56    19 male   no     no      low      sat       64     63  1993
 6 1993_U Jerome       175     60    19 male   no     no      low      sat       88     86  1993
 7 1993_V Arlene       140     50    34 female no     no      low      ran       70     98  1993
 8 1993_W Glenna       163     55    20 female no     no      low      sat       78     74  1993
 9 1995_B Olga         172     60    21 female no     no      low      sat       81     79  1995
10 1995_H Eliza        164     66    23 female no     no      low      ran       74    168  1995
# ℹ 27 more rows
recreatedPulse <- bind_rows( l )
dim( pulse )
[1] 110  13
dim( recreatedPulse )
[1] 110  13
  1. (ADV) Split a table by a column and write each part to a different file.
    Continue with the setup of the previous exercise.
    Study/type/exectute the following example.
    Find the newly created files in your filesystem.
l <- pulse %>% split( .$exercise )
exercises <- names( l )                   # name in l of each table chunk
for( exercise in exercises ) {            # exercise will be a name of a single chunk
  fileName <- paste0( "pulse_", exercise, ".csv" )  # name of the file for the chunk
  message( "Writing file '", fileName, "'..." )
  write_csv( l[[ exercise ]], file = fileName )
}


Copyright © 2024 Biomedical Data Sciences (BDS) | LUMC