7 Data wrangling

In this chapter, we begin to delve into the manipulation of data in the form of a data frame or tibble. In so doing, we will introduce the tidyverse package and the various verbs (function) it provides.

The tidyverse package is not just a single package but a composite of a group of packages. These include among others the dplyr package. Most of the function we will be employing in this chapter comes from dplyr.

We begin by reading in the blood_donors.xls

df_blood <- readxl::read_xls("./Data/blood_donors_1.xls")
df_blood
Table 7.1:
id hb hct sex bldgrp pdonor
1 10.5 31.8 Male O 3
2 11.9 37.2 Male AB 0
3 1   26   Male A 1
4 8.9 26.8 Male A 3
5 7.8 24.2 Male A 2
6 10   30.9 Male B 1
7 10.4 33.9 Male B 0
8 11.3 35   Male O 1
9 16.4    Male AB 1
10 14.4 43.6 Male AB 1
11 11.2 33.2 Female O 2
12 11.5 35.5 Female O 1
13 10.5 33.7 Female O 2
14 12.2 36.8 Female AB 1
14 16.4 48.8 Female P 2
16 12.7 99   Female AB 0
17 9.8 30.5 Female A 4
18 10.9 33.8 Female O 0
19 11.6 35.4 Female O 3
20 10.6 34.9 Female O 2
21 9.1 28   A 3
22 11.9 36.1 AB 3
23 10.5 34.2 O 2
24 12.3 38.2 AB 2
25 11   35.7 O 2

The output shows we have a 25-row and 6-column tibble.

7.1 Renaming variables

Below we rename the variables hb to hemog and id to studyid using he rename function, and then show the first 5 records with the head function.

df_blood %>% 
    rename(hemog = hb, studyid = id) %>% 
    head(5)
Table 7.2:
studyid hemog hct sex bldgrp pdonor
1 10.5 31.8 Male O 3
2 11.9 37.2 Male AB 0
3 1   26   Male A 1
4 8.9 26.8 Male A 3
5 7.8 24.2 Male A 2

7.2 Sorting data

Below we use the arrange function to sort the bldgrp in ascending order and hb by descending order.

df_blood %>% 
    arrange(bldgrp, desc(hb))
Table 6.1:
id hb hct sex bldgrp pdonor
17 9.8 30.5 Female A 4
21 9.1 28   A 3
4 8.9 26.8 Male A 3
5 7.8 24.2 Male A 2
3 1   26   Male A 1
9 16.4    Male AB 1
10 14.4 43.6 Male AB 1
16 12.7 99   Female AB 0
24 12.3 38.2 AB 2
14 12.2 36.8 Female AB 1
2 11.9 37.2 Male AB 0
22 11.9 36.1 AB 3
7 10.4 33.9 Male B 0
6 10   30.9 Male B 1
19 11.6 35.4 Female O 3
12 11.5 35.5 Female O 1
8 11.3 35   Male O 1
11 11.2 33.2 Female O 2
25 11   35.7 O 2
18 10.9 33.8 Female O 0
20 10.6 34.9 Female O 2
1 10.5 31.8 Male O 3
13 10.5 33.7 Female O 2
23 10.5 34.2 O 2
14 16.4 48.8 Female P 2

7.3 Subsetting data

In this subsection, we demonstrate the use of the filter and select function to select specific records and variables in a tibble. Below we filter to select all records with hb > 12g/dl and keep only the id, hb and sex columns.

df_blood %>% 
    filter(hb > 12) %>% 
    select(id, hb, sex)
Table 7.3:
id hb sex
9 16.4 Male
10 14.4 Male
14 12.2 Female
14 16.4 Female
16 12.7 Female
24 12.3

7.4 Generating new variables

To generate new variables we use the mutate function. Based on our knowledge that the hematocrit is approximately three times the haemoglobin we generate a new variable, hb_from_hct.

df_blood %>% 
    mutate(hb_from_hct = hct/3)
Table 6.2:
id hb hct sex bldgrp pdonor hb_from_hct
1 10.5 31.8 Male O 3 10.6 
2 11.9 37.2 Male AB 0 12.4 
3 1   26   Male A 1 8.67
4 8.9 26.8 Male A 3 8.93
5 7.8 24.2 Male A 2 8.07
6 10   30.9 Male B 1 10.3 
7 10.4 33.9 Male B 0 11.3 
8 11.3 35   Male O 1 11.7 
9 16.4    Male AB 1    
10 14.4 43.6 Male AB 1 14.5 
11 11.2 33.2 Female O 2 11.1 
12 11.5 35.5 Female O 1 11.8 
13 10.5 33.7 Female O 2 11.2 
14 12.2 36.8 Female AB 1 12.3 
14 16.4 48.8 Female P 2 16.3 
16 12.7 99   Female AB 0 33   
17 9.8 30.5 Female A 4 10.2 
18 10.9 33.8 Female O 0 11.3 
19 11.6 35.4 Female O 3 11.8 
20 10.6 34.9 Female O 2 11.6 
21 9.1 28   A 3 9.33
22 11.9 36.1 AB 3 12   
23 10.5 34.2 O 2 11.4 
24 12.3 38.2 AB 2 12.7 
25 11   35.7 O 2 11.9 

7.5 Aggregating data

Data can be aggregated in R using the summarize function. Below we determine the mean and standard deviation of the haemoglobin for the patient in the data.

df_blood %>% 
    summarize(mean_hb = mean(hb), sd_hb = sd(hb))
Table 6.3:
mean_hb sd_hb
11 2.89

Grouping the data by the “bldgrp” before the aggregation yields the aggregated means and standard deviations for the various blood groups.

df_blood %>% 
    group_by(bldgrp) %>% 
    summarize(mean_hb = mean(hb), sd_hb = sd(hb))
Table 7.4:
bldgrp mean_hb sd_hb
A 7.32 3.61 
AB 13.1  1.69 
B 10.2  0.283
O 11    0.427
P 16.4      

7.6 Reshaping data

In longitudinal studies, data is captured from the same individual repeatedly. Such data is recorded either in long or wide formats. A typical example of a data frame in the long form is bpB below.

bp_long <- read_csv(
    file = "./Data/bp_long.txt",
    col_names = TRUE, 
    col_types = c("c", "c", "i")
    )

bp_long
Table 6.4:
id measure sbp
B01 sbp1 141
B01 sbp2 137
B02 sbp1 155
B02 sbp2 153
B03 sbp1 153

In this format, each visit or round of data taking is captured as a new row, but with the appropriate study ID and period of record, captured as the variable measure above. Measurement of systolic blood pressure on day 1 is indicated by sbp1 in the measure variable. Day 2 measurements are indicated as sbp2.

The wide format of the same data can be obtained as below.

bp_wide <- 
    bp_long %>% 
    pivot_wider(
        id_cols = id, 
        names_from = measure, 
        values_from = sbp
    )

bp_wide
Table 6.5:
id sbp1 sbp2
B01 141 137
B02 155 153
B03 153

Here, each study participant’s record for the whole study is on one row of the data and the different measurements of systolic blood pressure are captured as different variables. Next, we convert the wide back to the long format.

bp_wide %>% 
    pivot_longer(
        cols = c(sbp1, sbp2),
        names_to = "time",
        values_to = "syst_bp"
    )
Table 6.6:
id time syst_bp
B01 sbp1 141
B01 sbp2 137
B02 sbp1 155
B02 sbp2 153
B03 sbp1 153
B03 sbp2

7.7 Combining data

In a study to determine the change in weight of athletes running a marathon, data about the athletes were obtained by the investigators. Since the marathon starts in town A and ends in town B, the investigators decided to weigh the athletes just before starting the race. Here they took records of the ID of the athlete’s sid, sex, age and weight at the start (wgtst). The records of five of these athletes are in the data marathonA. At the end point of the marathon, another member of the investigation team recorded their IDs (eid), weight upon completion (wgtend) and the time it took the athletes to complete the marathon (dura).

dataA <- 
    read_delim(
        file = "./Data/marathonA.txt",
        col_names = TRUE,
        delim = "\t",
        col_types = c("c","c","i","d")
    )

dataB <- 
    read_delim(
        file = "./Data/marathonB.txt",
        col_names = TRUE,
        delim = "\t",
        col_types = c("c","c","i","d")
    )

dataA
Table 6.7:
sid sex age wgtst
C001 M 23 57.1
C002 F 27 62.3
C003 M 19 54.5
C004 M 21 59.4
C005 F 32 53.4

dataB
Table 6.7:
eid wgtend dura
C003 53.9 189
C005 53   197
C002 62.2 201
C001 56.8 209

We can determine the weight change only by matching the before and after weight of each individual. This is where merging is very useful. Below, we merge the two data into one. This is done below.

dataA %>% 
    full_join(dataB, by = join_by(sid==eid))
Table 6.8:
sid sex age wgtst wgtend dura
C001 M 23 57.1 56.8 209
C002 F 27 62.3 62.2 201
C003 M 19 54.5 53.9 189
C004 M 21 59.4   
C005 F 32 53.4 53   197