7 Data wrangling

In this chapter, we begin to delve into the manipulation of data in the form of a data frame or tibble. In so doing, we will introduce the tidyverse package and the various verbs (function) it provides.

The tidyverse package is not just a single package but a composite of a group of packages. These include among others the dplyr package. Most of the function we will be employing in this chapter comes from dplyr.

We begin by reading in the blood_donors.xls

df_blood <- readxl::read_xls("./Data/blood_donors_1.xls")
df_blood

Table 7.1:
id	hb	hct	sex	bldgrp	pdonor
1	10.5	31.8	Male	O	3
2	11.9	37.2	Male	AB	0
3	1	26	Male	A	1
4	8.9	26.8	Male	A	3
5	7.8	24.2	Male	A	2
6	10	30.9	Male	B	1
7	10.4	33.9	Male	B	0
8	11.3	35	Male	O	1
9	16.4		Male	AB	1
10	14.4	43.6	Male	AB	1
11	11.2	33.2	Female	O	2
12	11.5	35.5	Female	O	1
13	10.5	33.7	Female	O	2
14	12.2	36.8	Female	AB	1
14	16.4	48.8	Female	P	2
16	12.7	99	Female	AB	0
17	9.8	30.5	Female	A	4
18	10.9	33.8	Female	O	0
19	11.6	35.4	Female	O	3
20	10.6	34.9	Female	O	2
21	9.1	28		A	3
22	11.9	36.1		AB	3
23	10.5	34.2		O	2
24	12.3	38.2		AB	2
25	11	35.7		O	2

The output shows we have a 25-row and 6-column tibble.

7.1 Renaming variables

Below we rename the variables hb to hemog and id to studyid using he rename function, and then show the first 5 records with the head function.

df_blood %>% 
    rename(hemog = hb, studyid = id) %>% 
    head(5)

Table 7.2:
studyid	hemog	hct	sex	bldgrp	pdonor
1	10.5	31.8	Male	O	3
2	11.9	37.2	Male	AB	0
3	1	26	Male	A	1
4	8.9	26.8	Male	A	3
5	7.8	24.2	Male	A	2

7.2 Sorting data

Below we use the arrange function to sort the bldgrp in ascending order and hb by descending order.

df_blood %>% 
    arrange(bldgrp, desc(hb))

Table 6.1:
id	hb	hct	sex	bldgrp	pdonor
17	9.8	30.5	Female	A	4
21	9.1	28		A	3
4	8.9	26.8	Male	A	3
5	7.8	24.2	Male	A	2
3	1	26	Male	A	1
9	16.4		Male	AB	1
10	14.4	43.6	Male	AB	1
16	12.7	99	Female	AB	0
24	12.3	38.2		AB	2
14	12.2	36.8	Female	AB	1
2	11.9	37.2	Male	AB	0
22	11.9	36.1		AB	3
7	10.4	33.9	Male	B	0
6	10	30.9	Male	B	1
19	11.6	35.4	Female	O	3
12	11.5	35.5	Female	O	1
8	11.3	35	Male	O	1
11	11.2	33.2	Female	O	2
25	11	35.7		O	2
18	10.9	33.8	Female	O	0
20	10.6	34.9	Female	O	2
1	10.5	31.8	Male	O	3
13	10.5	33.7	Female	O	2
23	10.5	34.2		O	2
14	16.4	48.8	Female	P	2

7.3 Subsetting data

In this subsection, we demonstrate the use of the filter and select function to select specific records and variables in a tibble. Below we filter to select all records with hb > 12g/dl and keep only the id, hb and sex columns.

df_blood %>% 
    filter(hb > 12) %>% 
    select(id, hb, sex)

Table 7.3:
id	hb	sex
9	16.4	Male
10	14.4	Male
14	12.2	Female
14	16.4	Female
16	12.7	Female
24	12.3

7.4 Generating new variables

To generate new variables we use the mutate function. Based on our knowledge that the hematocrit is approximately three times the haemoglobin we generate a new variable, hb_from_hct.

df_blood %>% 
    mutate(hb_from_hct = hct/3)

Table 6.2:
id	hb	hct	sex	bldgrp	pdonor	hb_from_hct
1	10.5	31.8	Male	O	3	10.6
2	11.9	37.2	Male	AB	0	12.4
3	1	26	Male	A	1	8.67
4	8.9	26.8	Male	A	3	8.93
5	7.8	24.2	Male	A	2	8.07
6	10	30.9	Male	B	1	10.3
7	10.4	33.9	Male	B	0	11.3
8	11.3	35	Male	O	1	11.7
9	16.4		Male	AB	1
10	14.4	43.6	Male	AB	1	14.5
11	11.2	33.2	Female	O	2	11.1
12	11.5	35.5	Female	O	1	11.8
13	10.5	33.7	Female	O	2	11.2
14	12.2	36.8	Female	AB	1	12.3
14	16.4	48.8	Female	P	2	16.3
16	12.7	99	Female	AB	0	33
17	9.8	30.5	Female	A	4	10.2
18	10.9	33.8	Female	O	0	11.3
19	11.6	35.4	Female	O	3	11.8
20	10.6	34.9	Female	O	2	11.6
21	9.1	28		A	3	9.33
22	11.9	36.1		AB	3	12
23	10.5	34.2		O	2	11.4
24	12.3	38.2		AB	2	12.7
25	11	35.7		O	2	11.9

7.5 Aggregating data

Data can be aggregated in R using the summarize function. Below we determine the mean and standard deviation of the haemoglobin for the patient in the data.

df_blood %>% 
    summarize(mean_hb = mean(hb), sd_hb = sd(hb))

Table 6.3:
mean_hb	sd_hb
11	2.89

Grouping the data by the “bldgrp” before the aggregation yields the aggregated means and standard deviations for the various blood groups.

df_blood %>% 
    group_by(bldgrp) %>% 
    summarize(mean_hb = mean(hb), sd_hb = sd(hb))

Table 7.4:
bldgrp	mean_hb	sd_hb
A	7.32	3.61
AB	13.1	1.69
B	10.2	0.283
O	11	0.427
P	16.4

7.6 Reshaping data

In longitudinal studies, data is captured from the same individual repeatedly. Such data is recorded either in long or wide formats. A typical example of a data frame in the long form is bpB below.

bp_long <- read_csv(
    file = "./Data/bp_long.txt",
    col_names = TRUE, 
    col_types = c("c", "c", "i")
    )

bp_long

Table 6.4:
id	measure	sbp
B01	sbp1	141
B01	sbp2	137
B02	sbp1	155
B02	sbp2	153
B03	sbp1	153

In this format, each visit or round of data taking is captured as a new row, but with the appropriate study ID and period of record, captured as the variable measure above. Measurement of systolic blood pressure on day 1 is indicated by sbp1 in the measure variable. Day 2 measurements are indicated as sbp2.

The wide format of the same data can be obtained as below.

bp_wide <- 
    bp_long %>% 
    pivot_wider(
        id_cols = id, 
        names_from = measure, 
        values_from = sbp
    )

bp_wide

Table 6.5:
id	sbp1	sbp2
B01	141	137
B02	155	153
B03	153

Here, each study participant’s record for the whole study is on one row of the data and the different measurements of systolic blood pressure are captured as different variables. Next, we convert the wide back to the long format.

bp_wide %>% 
    pivot_longer(
        cols = c(sbp1, sbp2),
        names_to = "time",
        values_to = "syst_bp"
    )

Table 6.6:
id	time	syst_bp
B01	sbp1	141
B01	sbp2	137
B02	sbp1	155
B02	sbp2	153
B03	sbp1	153
B03	sbp2

7.7 Combining data

In a study to determine the change in weight of athletes running a marathon, data about the athletes were obtained by the investigators. Since the marathon starts in town A and ends in town B, the investigators decided to weigh the athletes just before starting the race. Here they took records of the ID of the athlete’s sid, sex, age and weight at the start (wgtst). The records of five of these athletes are in the data marathonA. At the end point of the marathon, another member of the investigation team recorded their IDs (eid), weight upon completion (wgtend) and the time it took the athletes to complete the marathon (dura).

dataA <- 
    read_delim(
        file = "./Data/marathonA.txt",
        col_names = TRUE,
        delim = "\t",
        col_types = c("c","c","i","d")
    )

dataB <- 
    read_delim(
        file = "./Data/marathonB.txt",
        col_names = TRUE,
        delim = "\t",
        col_types = c("c","c","i","d")
    )

dataA

Table 6.7:
sid	sex	age	wgtst
C001	M	23	57.1
C002	F	27	62.3
C003	M	19	54.5
C004	M	21	59.4
C005	F	32	53.4


dataB

Table 6.7:
eid	wgtend	dura
C003	53.9	189
C005	53	197
C002	62.2	201
C001	56.8	209

We can determine the weight change only by matching the before and after weight of each individual. This is where merging is very useful. Below, we merge the two data into one. This is done below.

dataA %>% 
    full_join(dataB, by = join_by(sid==eid))

Table 6.8:
sid	sex	age	wgtst	wgtend	dura
C001	M	23	57.1	56.8	209
C002	F	27	62.3	62.2	201
C003	M	19	54.5	53.9	189
C004	M	21	59.4
C005	F	32	53.4	53	197

6 Importing data into R

8 Data Cleaning