12 Normality of data
Many of the test statistical tests, specifically the parametric tests are done
on the premise that numeric data is normally distributed. Unfortunately, this
is not always so. In this chapter, we look at what is normally distributed data
and how we can tell if our data is normally distributed. For this section, we
will use the hb
variable from the mps.dta
data.
12.1 The normal distribution
The normal distribution, also called the Gaussian Distribution or a bell curve, is defined by two main statistics. These are the mean and standard deviation. The wider the standard deviation, the broader the curve. An example of the normal distribution is shown below:
df_temp <- data.frame(x = rnorm(2000))
df_temp %>%
ggplot(aes(x = x))+
geom_histogram(
aes(y=after_stat(density)),
bins=10, fill = "snow", col = "red") +
stat_function(
fun = dnorm,
args = list(
mean = mean(df_temp$x, na.rm=T),
sd = sd(df_temp$x)), col = "blue",
linewidth = 1.5) +
labs(x = NULL, y = NULL) +
scale_x_continuous(labels = NULL)+
scale_y_continuous(labels = NULL)+
theme_minimal()
These are some features of the normal distribution:
- It is symmetrical.
- The mean, median and mode are the same.
- Approximately 68% of the data falls within one standard deviation of the mean.
- Approximately 95% of the data falls within two standard deviations of the mean
- Approximately 99.7% of the data fall within three standard deviations of the mean.
The normal distribution with a mean of 1 and a standard deviation of 1 is called the standard normal distribution.
12.2 Evaluating normality
There are two main modalities for evaluating normality. These are graphical and formal hypothesis testing.
12.2.1 Graphical evaluation
Histogram: Probably the most well-known modality here is the histogram.
Below we first read the data and keep only the hb
variable:
df_mps <- haven::read_dta("./Data/mps.dta")
We then draw the histogram of the hb
.
df_mps %>%
drop_na(hb) %>%
ggplot(aes(x = hb)) +
geom_histogram(bins = 10, fill = 'white', col = "black") +
labs(title = "Histogram of HB", y = "Frequency", x = 'Hb (g/dL') +
theme_bw()
There is near symmetry with a slightly heavier left tail.
Boxplot: Our next graphical modality is the boxplot as drawn below.
df_mps %>%
drop_na(hb) %>%
ggplot(aes(x = hb)) +
geom_boxplot(fill = 'white', col = "black") +
labs(title = "Boxplot of HB", y = "Frequency", x = 'Hb (g/dL') +
theme_bw()
The same conclusion of good symmetry and a slightly heavier lower tail is seen here.
Q-Q plot: Finally, the Q-Q plot with a line. This graphical modality plots the actual values of the data against a theoretical normal distribution. Thus, if all the dots were to be in a straight line and along the line drawn that would be the ideal normal distribution. We therefore use this principle to determine if our data is for instance heavy at the tails, indicating skewness. The Q-Q plot of our data is as done below:
df_mps %>%
drop_na(hb) %>%
ggpubr::ggqqplot(x = "hb",title = "Q-Q plot of the HB", conf.int = FALSE)
It is seen that apart from a few points mainly on the right tail the rest pretty much follow the line.
12.2.2 Statistical tests for normality
Formal statistical tests are available for testing. H0: The data is sampled from a normally distributed population. Ha: The data is not sampled from a normally distributed population There are a few of these tests but we will be concentrating on the Shapiro-Wilk tests. It is never advisable to do different tests together as they use different algorithms and may produce different results and conclusions.
Shapiro-Wilk test: Below we perform the Shapiro-wilk test for normality.
df_mps %$%
shapiro.test(hb) %>%
broom::tidy()
statistic | p.value | method |
---|---|---|
0.994 | 0.108 | Shapiro-Wilk normality test |
A p-value greater than 0.05 indicates we reject the Null hypothesis and thus conclude our data comes from a normally distributed population.
12.3 Conclusion
In conclusion, it can be seen from the graphical presentations as well as the formal test that the data we have is coming from a normally distributed population.
The various tests can give contradictory results so I recommend evaluating the normality of a population, one should first plot the histogram, and Q-Q plot, and then perform one formal test, and combine these results before making the the judgement that numeric data is or is not normally distributed.