7 Data wrangling
In this chapter, we begin to delve into the manipulation of data in the form of a
data frame or tibble. In so doing, we will introduce the tidyverse
package
and the various verbs (function) it provides.
The tidyverse
package is not just a single package but a composite of a group
of packages. These include among others the dplyr
package. Most of the function
we will be employing in this chapter comes from dplyr
.
We begin by reading in the blood_donors.xls
df_blood <- readxl::read_xls("./Data/blood_donors_1.xls")
df_blood
id | hb | hct | sex | bldgrp | pdonor |
---|---|---|---|---|---|
1 | 10.5 | 31.8 | Male | O | 3 |
2 | 11.9 | 37.2 | Male | AB | 0 |
3 | 1 | 26 | Male | A | 1 |
4 | 8.9 | 26.8 | Male | A | 3 |
5 | 7.8 | 24.2 | Male | A | 2 |
6 | 10 | 30.9 | Male | B | 1 |
7 | 10.4 | 33.9 | Male | B | 0 |
8 | 11.3 | 35 | Male | O | 1 |
9 | 16.4 | Male | AB | 1 | |
10 | 14.4 | 43.6 | Male | AB | 1 |
11 | 11.2 | 33.2 | Female | O | 2 |
12 | 11.5 | 35.5 | Female | O | 1 |
13 | 10.5 | 33.7 | Female | O | 2 |
14 | 12.2 | 36.8 | Female | AB | 1 |
14 | 16.4 | 48.8 | Female | P | 2 |
16 | 12.7 | 99 | Female | AB | 0 |
17 | 9.8 | 30.5 | Female | A | 4 |
18 | 10.9 | 33.8 | Female | O | 0 |
19 | 11.6 | 35.4 | Female | O | 3 |
20 | 10.6 | 34.9 | Female | O | 2 |
21 | 9.1 | 28 | A | 3 | |
22 | 11.9 | 36.1 | AB | 3 | |
23 | 10.5 | 34.2 | O | 2 | |
24 | 12.3 | 38.2 | AB | 2 | |
25 | 11 | 35.7 | O | 2 |
The output shows we have a 25-row and 6-column tibble.
7.1 Renaming variables
Below we rename the variables hb
to hemog
and id
to studyid
using he
rename
function, and then show the first 5 records with the head
function.
df_blood %>%
rename(hemog = hb, studyid = id) %>%
head(5)
studyid | hemog | hct | sex | bldgrp | pdonor |
---|---|---|---|---|---|
1 | 10.5 | 31.8 | Male | O | 3 |
2 | 11.9 | 37.2 | Male | AB | 0 |
3 | 1 | 26 | Male | A | 1 |
4 | 8.9 | 26.8 | Male | A | 3 |
5 | 7.8 | 24.2 | Male | A | 2 |
7.2 Sorting data
Below we use the arrange
function to sort the bldgrp
in ascending order and
hb
by descending order.
df_blood %>%
arrange(bldgrp, desc(hb))
id | hb | hct | sex | bldgrp | pdonor |
---|---|---|---|---|---|
17 | 9.8 | 30.5 | Female | A | 4 |
21 | 9.1 | 28 | A | 3 | |
4 | 8.9 | 26.8 | Male | A | 3 |
5 | 7.8 | 24.2 | Male | A | 2 |
3 | 1 | 26 | Male | A | 1 |
9 | 16.4 | Male | AB | 1 | |
10 | 14.4 | 43.6 | Male | AB | 1 |
16 | 12.7 | 99 | Female | AB | 0 |
24 | 12.3 | 38.2 | AB | 2 | |
14 | 12.2 | 36.8 | Female | AB | 1 |
2 | 11.9 | 37.2 | Male | AB | 0 |
22 | 11.9 | 36.1 | AB | 3 | |
7 | 10.4 | 33.9 | Male | B | 0 |
6 | 10 | 30.9 | Male | B | 1 |
19 | 11.6 | 35.4 | Female | O | 3 |
12 | 11.5 | 35.5 | Female | O | 1 |
8 | 11.3 | 35 | Male | O | 1 |
11 | 11.2 | 33.2 | Female | O | 2 |
25 | 11 | 35.7 | O | 2 | |
18 | 10.9 | 33.8 | Female | O | 0 |
20 | 10.6 | 34.9 | Female | O | 2 |
1 | 10.5 | 31.8 | Male | O | 3 |
13 | 10.5 | 33.7 | Female | O | 2 |
23 | 10.5 | 34.2 | O | 2 | |
14 | 16.4 | 48.8 | Female | P | 2 |
7.3 Subsetting data
In this subsection, we demonstrate the use of the filter
and select
function
to select specific records and variables in a tibble. Below we filter to select
all records with hb
> 12g/dl and keep only the id
, hb
and sex
columns.
df_blood %>%
filter(hb > 12) %>%
select(id, hb, sex)
id | hb | sex |
---|---|---|
9 | 16.4 | Male |
10 | 14.4 | Male |
14 | 12.2 | Female |
14 | 16.4 | Female |
16 | 12.7 | Female |
24 | 12.3 |
7.4 Generating new variables
To generate new variables we use the mutate
function. Based on our knowledge
that the hematocrit is approximately three times the haemoglobin we generate
a new variable, hb_from_hct
.
df_blood %>%
mutate(hb_from_hct = hct/3)
id | hb | hct | sex | bldgrp | pdonor | hb_from_hct |
---|---|---|---|---|---|---|
1 | 10.5 | 31.8 | Male | O | 3 | 10.6 |
2 | 11.9 | 37.2 | Male | AB | 0 | 12.4 |
3 | 1 | 26 | Male | A | 1 | 8.67 |
4 | 8.9 | 26.8 | Male | A | 3 | 8.93 |
5 | 7.8 | 24.2 | Male | A | 2 | 8.07 |
6 | 10 | 30.9 | Male | B | 1 | 10.3 |
7 | 10.4 | 33.9 | Male | B | 0 | 11.3 |
8 | 11.3 | 35 | Male | O | 1 | 11.7 |
9 | 16.4 | Male | AB | 1 | ||
10 | 14.4 | 43.6 | Male | AB | 1 | 14.5 |
11 | 11.2 | 33.2 | Female | O | 2 | 11.1 |
12 | 11.5 | 35.5 | Female | O | 1 | 11.8 |
13 | 10.5 | 33.7 | Female | O | 2 | 11.2 |
14 | 12.2 | 36.8 | Female | AB | 1 | 12.3 |
14 | 16.4 | 48.8 | Female | P | 2 | 16.3 |
16 | 12.7 | 99 | Female | AB | 0 | 33 |
17 | 9.8 | 30.5 | Female | A | 4 | 10.2 |
18 | 10.9 | 33.8 | Female | O | 0 | 11.3 |
19 | 11.6 | 35.4 | Female | O | 3 | 11.8 |
20 | 10.6 | 34.9 | Female | O | 2 | 11.6 |
21 | 9.1 | 28 | A | 3 | 9.33 | |
22 | 11.9 | 36.1 | AB | 3 | 12 | |
23 | 10.5 | 34.2 | O | 2 | 11.4 | |
24 | 12.3 | 38.2 | AB | 2 | 12.7 | |
25 | 11 | 35.7 | O | 2 | 11.9 |
7.5 Aggregating data
Data can be aggregated in R using the summarize
function. Below we determine
the mean and standard deviation of the haemoglobin for the patient in the data.
mean_hb | sd_hb |
---|---|
11 | 2.89 |
Grouping the data by the “bldgrp” before the aggregation yields the aggregated means and standard deviations for the various blood groups.
bldgrp | mean_hb | sd_hb |
---|---|---|
A | 7.32 | 3.61 |
AB | 13.1 | 1.69 |
B | 10.2 | 0.283 |
O | 11 | 0.427 |
P | 16.4 |
7.6 Reshaping data
In longitudinal studies, data is captured from the same individual repeatedly. Such data is recorded either in long or wide formats. A typical example of a data frame in the long form is bpB below.
bp_long <- read_csv(
file = "./Data/bp_long.txt",
col_names = TRUE,
col_types = c("c", "c", "i")
)
bp_long
id | measure | sbp |
---|---|---|
B01 | sbp1 | 141 |
B01 | sbp2 | 137 |
B02 | sbp1 | 155 |
B02 | sbp2 | 153 |
B03 | sbp1 | 153 |
In this format, each visit or round of data taking is captured as a new row, but with the appropriate study ID and period of record, captured as the variable measure above. Measurement of systolic blood pressure on day 1 is indicated by sbp1 in the measure variable. Day 2 measurements are indicated as sbp2.
The wide format of the same data can be obtained as below.
bp_wide <-
bp_long %>%
pivot_wider(
id_cols = id,
names_from = measure,
values_from = sbp
)
bp_wide
id | sbp1 | sbp2 |
---|---|---|
B01 | 141 | 137 |
B02 | 155 | 153 |
B03 | 153 |
Here, each study participant’s record for the whole study is on one row of the data and the different measurements of systolic blood pressure are captured as different variables. Next, we convert the wide back to the long format.
bp_wide %>%
pivot_longer(
cols = c(sbp1, sbp2),
names_to = "time",
values_to = "syst_bp"
)
id | time | syst_bp |
---|---|---|
B01 | sbp1 | 141 |
B01 | sbp2 | 137 |
B02 | sbp1 | 155 |
B02 | sbp2 | 153 |
B03 | sbp1 | 153 |
B03 | sbp2 |
7.7 Combining data
In a study to determine the change in weight of athletes running a marathon, data about the athletes were obtained by the investigators. Since the marathon starts in town A and ends in town B, the investigators decided to weigh the athletes just before starting the race. Here they took records of the ID of the athlete’s sid, sex, age and weight at the start (wgtst). The records of five of these athletes are in the data marathonA. At the end point of the marathon, another member of the investigation team recorded their IDs (eid), weight upon completion (wgtend) and the time it took the athletes to complete the marathon (dura).
dataA <-
read_delim(
file = "./Data/marathonA.txt",
col_names = TRUE,
delim = "\t",
col_types = c("c","c","i","d")
)
dataB <-
read_delim(
file = "./Data/marathonB.txt",
col_names = TRUE,
delim = "\t",
col_types = c("c","c","i","d")
)
dataA
sid | sex | age | wgtst |
---|---|---|---|
C001 | M | 23 | 57.1 |
C002 | F | 27 | 62.3 |
C003 | M | 19 | 54.5 |
C004 | M | 21 | 59.4 |
C005 | F | 32 | 53.4 |
dataB
eid | wgtend | dura |
---|---|---|
C003 | 53.9 | 189 |
C005 | 53 | 197 |
C002 | 62.2 | 201 |
C001 | 56.8 | 209 |
We can determine the weight change only by matching the before and after weight of each individual. This is where merging is very useful. Below, we merge the two data into one. This is done below.
dataA %>%
full_join(dataB, by = join_by(sid==eid))
sid | sex | age | wgtst | wgtend | dura |
---|---|---|---|---|---|
C001 | M | 23 | 57.1 | 56.8 | 209 |
C002 | F | 27 | 62.3 | 62.2 | 201 |
C003 | M | 19 | 54.5 | 53.9 | 189 |
C004 | M | 21 | 59.4 | ||
C005 | F | 32 | 53.4 | 53 | 197 |