6 Importing data into R
In this chapter, we discuss ways of getting data into R, either by directly entering it into R or importing it from another software. In R, a data frame is the data structure desirable for data analysis. With the advent of the tidy data
, a tibble is now the predominant data structure being used. For this section, we will be reading in various file formats and presenting them as a tibble.
6.1 Using data in R packages
Many packages in R come with data that can be used for practice. To be able to use a dataset in a specific package, that package first has to be installed. For instance, to be able to use the data Oswego
native to the package epiDisplay
, we first ensure the package is installed by running the command line below:
install.packages("epiDisplay")
Note that this will install the package and any other packages that the epiDisplay
package depends on.
The next step will be to make the data available in the R session as below
data("Oswego", package = "epiDisplay")
Now that the data is available in the working environment, we can visualise the first 6 rows below
Oswego %>% head()
age | sex | timesupper | ill | onsetdate | onsettime | bakedham | spinach | mashedpota | cabbagesal | jello | rolls | brownbread | milk | coffee | water | cakes | vanilla | chocolate | fruitsalad |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | M | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | |||
52 | F | 2e+03 | TRUE | 04/19 | 30 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE |
65 | M | 1.83e+03 | TRUE | 04/19 | 30 | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | TRUE | FALSE |
59 | F | 1.83e+03 | TRUE | 04/19 | 30 | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | TRUE | TRUE | TRUE | FALSE |
13 | F | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | |||
63 | F | 1.93e+03 | TRUE | 04/18 | 2.23e+03 | TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | TRUE | FALSE | FALSE |
And finally visualise the details of the data by using the help()
function as below
help(Oswego)
6.2 Direct entry into R
We first use the data.frame()
function from the base package to create the data frame.
data.frame(
name = c("Ama", "Yakubu", "John"),
sex = c("Female", "Male", "Male"),
age = c(12, 9, 4),
school = c("JHS", "Primary", "Creche")
)
name | sex | age | school |
---|---|---|---|
Ama | Female | 12 | JHS |
Yakubu | Male | 9 | Primary |
John | Male | 4 | Creche |
Below we first describe how to manually enter data into R. We aim to create a tibble by using the tibble
function.
tibble(
name = c("Ama", "Yakubu", "John"),
sex = c("Female", "Male", "Male"),
age = c(12, 9, 4),
school = c("JHS", "Primary", "Creche")
)
name | sex | age | school |
---|---|---|---|
Ama | Female | 12 | JHS |
Yakubu | Male | 9 | Primary |
John | Male | 4 | Creche |
6.3 R data file
When working in R, a frequent mode of storage of data is as an .Rdata file. This preserves the structure and environment of the data. Below we will read an already saved .Rdata file.
We then visualise the first 4 rows of the single data within the loaded file called data1_stata
data1_stata %>% head(n=4)
id | sex | weight | height |
---|---|---|---|
125 | Male | 7 | 64 |
62 | Female | 11 | 96 |
112 | Female | 13 | 115 |
29 | Female | 20 | 106 |
6.4 Text files
The first file format that we are going to read from is a flat file or text file. These usually have the extension .txt. The data in these files could be separated by various delimiters. These include tabs, commas, spaces, etc. In this section, we will read in one with a tab delimiter as a prototype as the rest will be similar.
read_delim(
file = "./Data/bpA.txt",
delim = "\t",
col_types = c("c", "c", "d", "d")
)
id | sex | age | sbp0 |
---|---|---|---|
B01 | M | 73 | 145 |
B02 | F | 47 | 164 |
B03 | M | 59 | 153 |
The last file to be read in this subsection is a comma-delimited text file
read_delim(
file = "./Data/blood.txt",
delim = ",",
col_types = c("c", "d", "d", "d", "d", "d", "d")
) %>%
head(n=4)
stno | age | wgt | hgt | hb | wbc | hct |
---|---|---|---|---|---|---|
1001 | 120 | 18.4 | 128 | 7.7 | 13.7 | 22.6 |
1002 | 96 | 24 | 123 | 8 | 8.5 | 22.8 |
1003 | 168 | 38.5 | 143 | 8.2 | 26.8 | 24.1 |
1004 | 96 | 20.4 | 114 | 8.3 | 10.5 | 23.3 |
Comma-delimited files with extension .csv can also be imported with the commnands
read_csv(
file = "./Data/blood.txt",
col_types = c("c", "d", "d", "d", "d", "d", "d")) %>%
head(n=4)
stno | age | wgt | hgt | hb | wbc | hct |
---|---|---|---|---|---|---|
1001 | 120 | 18.4 | 128 | 7.7 | 13.7 | 22.6 |
1002 | 96 | 24 | 123 | 8 | 8.5 | 22.8 |
1003 | 168 | 38.5 | 143 | 8.2 | 26.8 | 24.1 |
1004 | 96 | 20.4 | 114 | 8.3 | 10.5 | 23.3 |
6.5 Microsoft Excel
Probably the most common format for transferring data is Microsoft Excel. There are two versions of Excel with extensions .xls and .xlsx. Below reading in the .xlsx is demonstrated using the readxl
package.
id | sex | weight | height |
---|---|---|---|
125 | Male | 7 | 64 |
62 | Female | 11 | 96 |
112 | Female | 13 | 115 |
29 | Female | 20 | 106 |
6.6 SPSS files
Files from SPSS are usually saved with the extension of .sav. Below we read an SSPSS data file using the haven
package
id | sex | weight | height |
---|---|---|---|
125 | Male | 7 | 64 |
62 | Female | 11 | 96 |
112 | Female | 13 | 115 |
29 | Female | 20 | 106 |
6.7 Stata files
Stata files, similar to SPSS data files can be imported using the haven
package. This is illustrated below
id | sex | weight | height |
---|---|---|---|
125 | Male | 7 | 64 |
62 | Female | 11 | 96 |
112 | Female | 13 | 115 |
29 | Female | 20 | 106 |