2 Introduction
2.1 The statistical data analyst
Statistical data analysis is more than just using computer software to generate results. It involves a basic understanding of the data type and the best way to analyse and present such data to make meaning to the general population. Thus, the data analyst:
- Must understand the genesis (study methodology) involved in obtaining the data in the first place. The conclusion from the same data may differ depending on the study methodology used and the hypothesis being tested. It is very prudent, therefore, that the statistical data analyst be involved in the data collection process right from the beginning.
- Should be able to point out errors in the data collection process in the early stages. This avoids wasting valuable resources on data that may not answer the question.
- Provide valuable advice on the best method of analysing the data at hand.
- Perform the analysis scientifically and soundly by applying the most current and statistically appropriate principles.
- Present the result of the analysis in a manner that makes it easy for all persons without statistical and analytical expertise to understand with the least effort. This requires the statistical data analyst to be in a position to explain the analysis in a common language.
- Finally, the data analyst must know his limit. There are often instances where the analyst should seek ”professional” help, even though he may feel he is on the right path. It never hurts to seek a second opinion from your peers. It, therefore, goes without saying from the prior discussion that the data analyst must have a firm understanding of statistical and research methodology.
2.2 Statistical software
Some years back, statistical analysis was one of the most tedious processes done mainly by dedicated statisticians. With the advent of computers and statistical software, it has become rather handy, with many advantages but some disadvantages. The main advantages are: 1. The tremendous speed with which large data is processed and results obtained. 1. The accuracy of the statistical calculations performed. Computers do not make mistakes but one has to beware of rounding in some software. Some software can perform calculations to a specific number of decimal places. Therefore, when one is confronted with a figure such as 1.00377655432, the software may work with 1.0037765, leaving out the last four digits. Calculations using this truncated value are likely to have a different result from the non-truncated figure, thus affecting the accuracy of the final result. 1. Many modern statistical software can read data from varied sources and formats. This makes it easy to transfer data from one software to another without having to re-enter the data collected into the second computer or software. This transferability has enabled the use of other digital equipment such as smartphones, personal digital assistants and tablets for data collection. Data collected in this manner is said to be ready for cleaning and analysis, bypassing the data entry stage. 1. Plotting graphs is one of the most important uses of modern-day computerised data analysis. Statistical software tends to make this rather tedious process almost hassle-free and accords us the ability to redo the plot from scratch at the click of a button.
Despite all these advantages, many disadvantages are also inherent in the use of computers and statistical software. Some include:
- Many people with very little or no statistical knowledge can manipulate data and come up with conclusions that often tend to be very spurious. The cliche ”Garbage In Garbage Out” could not apply better than in this situation.
- Many commonly used software tend to be very reliable and accurate. With a large number of often user-written statistical software freely available online, one needs to be cautious of the output generated. Some of these could be wrongly written codes or have errors, thus producing faulty results.
- Unfortunately, the most used, reliable and accurate statistical software tends to be expensive as well. This notwithstanding, there are few, such as R, that combine free and open source with versatility and reliability. This forms the basis for my choice of R for this book.
2.3 Obtaining and installing R
R is a free software programming language and environment for data manipulation, calculation and graphical display. It can run on Windows, MacOS X and Unix systems. It has great applications in many academic fields, including mathematics, economics and epidemiology. This capability has been enhanced by the many packages written by individuals over the years. R has the great advantage of being able to handle many datasets simultaneously. However, this functionality comes at a cost, which will be discussed in the subsequent chapters. R also has great graphics functionality but requires practice.
Several advanced statistical and mathematical functions, such as regression and survival analysis, are also implemented in R. R and its many packages are obtainable free from http://cran.r-project.org/. The most current version at the time of writing this book is R-4.2.1. The Windows operating system version can be installed on both 32 and 64-bit operating systems. Download the base file from https://cran.r-project.org/bin/windows/base/, save it on your computer and install it, preferably as an administrator, by following the on-screen instructions.
2.4 Obtaining and installing RStudio
RStudio is an Integrated Development Environment for R and other software such as Python. It provides an interface that adds more functionality and automation to R. It is downloadable from https://posit.co/download/rstudio-desktop/. It has two forms, a desktop version and the Server version used within a web browser. For simplicity and easy following of the processes in this book, it will be preferable to download and install RStudio.