13 Analysis of numeric data

So far, we have dealt with descriptive statistics and the analysis of the sample data collected. However, the bane of most statistical analysis is to make inferences about the population as a whole. In this section, we mainly do an inferential analysis of continuous variables.

13.1 Confidence interval of a mean

To determine the confidence interval of the mean of a numeric variable in R, we use the One Sample Student’s T-test. The assumptions for the validity of this test are:

The sample should have been randomly chosen
The population distribution of the variable should be normal. This can be assumed to be present if
- The distribution of the population is known to be normally distributed
- The population distribution should have one mode, symmetric, without outliers and a sample size of 15 or less
- The population distribution should be moderately skewed, without outliers, and have one mode and with a sample size between 16 and 40
- The sample size is more than 40 and the data has no outliers.

With our sample considered to be randomly selected and a sample size of 140, we apply the One-sample T-test as below.

We first import the data

df_data1 <- 
    read_delim(
        file = "./Data/data1.txt", 
        delim = "\t",
        col_types = c("c", "f", "i","i")
    ) %>% 
    mutate(sex = factor(sex))

And then summarize it below

df_data1 %>% 
    summarytools::dfSummary(graph.col = F)
Data Frame Summary  
df_data1  
Dimensions: 140 x 4  
Duplicates: 0  

--------------------------------------------------------------------------------------
No   Variable      Stats / Values            Freqs (% of Valid)   Valid      Missing  
---- ------------- ------------------------- -------------------- ---------- ---------
1    id            1. 1                        1 ( 0.7%)          140        0        
     [character]   2. 10                       1 ( 0.7%)          (100.0%)   (0.0%)   
                   3. 100                      1 ( 0.7%)                              
                   4. 101                      1 ( 0.7%)                              
                   5. 102                      1 ( 0.7%)                              
                   6. 103                      1 ( 0.7%)                              
                   7. 104                      1 ( 0.7%)                              
                   8. 105                      1 ( 0.7%)                              
                   9. 106                      1 ( 0.7%)                              
                   10. 107                     1 ( 0.7%)                              
                   [ 130 others ]            130 (92.9%)                              

2    sex           1. Female                 64 (45.7%)           140        0        
     [factor]      2. Male                   76 (54.3%)           (100.0%)   (0.0%)   

3    weight        Mean (sd) : 12.2 (6.3)    28 distinct values   140        0        
     [numeric]     min < med < max:                               (100.0%)   (0.0%)   
                   2 < 11 < 33                                                        
                   IQR (CV) : 7 (0.5)                                                 

4    height        Mean (sd) : 90.9 (21.3)   70 distinct values   139        1        
     [numeric]     min < med < max:                               (99.3%)    (0.7%)   
                   49 < 88 < 137                                                      
                   IQR (CV) : 33 (0.2)                                                
--------------------------------------------------------------------------------------

Table 6.2:
variable	n	min	max	median	iqr	mean	sd	se	ci
height	139	49	137	88	33	90.9	21.3	1.81	3.58

df_data1 %>% 
    meantables::mean_table(height)

Table 6.3:
response_var	n	mean	sd	sem	lcl	ucl	min	max
height	139	90.9	21.3	1.81	87.3	94.4	49	137

For sex stratified confidence intervals we have

df_data1 %>% 
    group_by(sex) %>% 
    meantables::mean_table(height)

Table 7.4:
response_var	group_var	group_cat	n	mean	sd	sem	lcl	ucl	min	max
height	sex	Female	63	92.2	23	2.89	86.4	98	49	137
height	sex	Male	76	89.8	20	2.29	85.2	94.3	55	131

13.2 Comparing the mean to a hypothesised value

Assuming the objective of the data collected was to determine if the average weight of our population was similar to a population with a known mean weight of 14kgs.

Our null hypothesis is:

H0: There is no difference in mean weight between our population and a population with mean weight of 14kgs.

To test this hypothesis we use the One sample t-test after we have satisfied ourselves that the assumptions for its use have been met.

df_data1 %$% 
    t.test(weight, mu=14) %>% 
    broom::tidy()

Table 6.4:
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
12.2	-3.32	0.00113	139	11.2	13.3	One Sample t-test	two.sided

The p-value of 0.001 is the probability that our sample could have come from a population with a mean weight of 14kgs. Since this is very small we reject the null H0 at a 5% significance level and conclude that our population mean weight is significantly different from 14kgs. The confidence interval generated is that of our sample mean. With the hypothesized value of 14kgs outside the confidence interval of the mean we conclude that there is insuficient evidence to suggest that the mean weight of our population is 14kgs.

13.3 Comparing mean of two independent groups

This is possibly the most common use for the t-test. To compare the mean weights of the males and female in our study we come up with

H0: There is no difference in weight between the males and females in our population.

To test this assertion we first determine if our sample fits the assumption for the use of the Two sample t-test. These are:

The sample should have been randomly chosen
The two samples are completely independent
Each population is at least 20 times larger than its respective sample.
The population distribution of the variable should be normal. This can be assumed to be present if
- The distribution of the population is known to be normal
- The population distribution should have one mode, symmetric, without outliers and a sample size of 15 or less
- The population distribution should be moderately skewed, without outliers, have one mode and with a sample size between 16 and 40
- The sample size is more than 40 and data has no outliers.

Our data fulfils all the criteria above hence we apply the test

df_data1 %$% 
    t.test(formula = weight ~ sex) %>% 
    broom::tidy()

Table 6.5:
estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
1.31	12.9	11.6	1.2	0.234	117	-0.856	3.47	Welch Two Sample t-test	two.sided

With a relatively high p-value we conclude that there is insufficient evidence to refute null hypothesis. In other words there is insufficient evidence to conclude the mean weights of males and females differ in the study population. Note that in our sample however females appear heavier than males as shown by the last two lines of the output above.

The confidence interval determined above (-0.86 to 3.47) is actually that for the mean sample difference between females and males. Since the confidence interval contains the null value from H0 above i.e. 0, we conclude that there isn’t enough evidence of a difference in mean weight between the two sexes. Hence both the confidence interval and p-value come to similar conclusions.

13.4 Comparing means of paired observations

In this section we use the bread.txt data which has the weight in grams before and after baking some loaves of bread. A description of the variables is contained in the data file. Paired observations occur in circumstances where repeated measurement are done on the same object or data collected has some characteristics in common. For the bread data the same bread is weighed before and after baking. Determining if there is a significant difference between the two measurements requires the use of the Paired t-test. As always be begin by importing the data

Next we determine the structure of the data frame df2

df_bread <- 
    read.table("./Data/bread.txt", sep="\t", header=T) %>% 
    mutate(oven = factor(oven), type = factor(type))

And then summarize it

df_bread %>% summarytools::dfSummary(graph.col = F)
Data Frame Summary  
df_bread  
Dimensions: 399 x 5  
Duplicates: 0  

--------------------------------------------------------------------------------------
No   Variable    Stats / Values             Freqs (% of Valid)    Valid      Missing  
---- ----------- -------------------------- --------------------- ---------- ---------
1    sid         Mean (sd) : 200 (115.3)    399 distinct values   399        0        
     [integer]   min < med < max:           (Integer sequence)    (100.0%)   (0.0%)   
                 1 < 200 < 399                                                        
                 IQR (CV) : 199 (0.6)                                                 

2    type        1. maize                   212 (53.1%)           399        0        
     [factor]    2. wheat                   187 (46.9%)           (100.0%)   (0.0%)   

3    before      Mean (sd) : 379.5 (28.1)   120 distinct values   399        0        
     [integer]   min < med < max:                                 (100.0%)   (0.0%)   
                 304 < 379 < 479                                                      
                 IQR (CV) : 40 (0.1)                                                  

4    after       Mean (sd) : 217 (29.8)     121 distinct values   399        0        
     [integer]   min < med < max:                                 (100.0%)   (0.0%)   
                 140 < 215 < 295                                                      
                 IQR (CV) : 48 (0.1)                                                  

5    oven        1. Firewood                199 (49.9%)           399        0        
     [factor]    2. Gas                     200 (50.1%)           (100.0%)   (0.0%)   
--------------------------------------------------------------------------------------

Our next task is to compare the weight of the loaves of bread before and after baking. We begin by looking at the mean and standard deviations of the two weights.

options(huxtable.knit_print_df = TRUE)

df_bread %>% 
    select(before, after) %>% 
    rstatix::get_summary_stats(type= "mean_sd")

Table 6.10:
variable	n	mean	sd
before	399	380	28.1
after	399	217	29.8

It is obvious that the mean weight of the bread before baking is much higher than after however the standard deviations appear similar. For a formal test to determine the difference in means we use the paired t-test. As before the we state the assumptions of the paired t-test

The sample should have been randomly chosen
The two samples are not independent (they are related)
Each population is at least 20 times larger than its respective sample.
The population distribution of the difference between the two variables should be normal. This can be assumed to be present if
- The distribution of the population is known to be normal
- The population distribution should have one mode, symmetric, without outliers and a sample size of 15 or less
- The population distribution should be moderately skewed, without outliers, have one mode and with a sample size between 16 and 40
- The sample size is more than 40 and data has no outliers.

The only new assumption we need to evaluate here is the distribution of the difference between the weights before and after. We do so below

df_bread %>% 
    mutate(diff_in_wgt = after - before) %>% 
    ggplot(aes(x = diff_in_wgt)) +
    geom_histogram(bins = 10, col = "white") + 
    labs(x = "Difference in weight") +
    theme_bw()

Alternatively, we can perform a Shapiro-Wilk’s test for normality. This has H0 as not deviating from the normal distribution. This is done below

df_bread %>% 
    mutate(diff_in_wgt = after - before) %>%
    rstatix::shapiro_test(vars = "diff_in_wgt")

Table 13.1:
variable	statistic	p
diff_in_wgt	0.997	0.636

The output and graphical representation above shows our difference in weight is literally normally distributed. We therefore go ahead to determine the difference in mean weights. First we state our hypothesis

H0: There is no change in weight of loaves of bread after baking

And then perform the test after converting the data to the long format

df_bread %>%
    pivot_longer(
        cols = c(before, after), names_to = "time",values_to = "weight"
        ) %>% 
    rstatix::t_test(formula = weight~time, paired = TRUE, detailed = TRUE)

Table 13.2:
estimate	.y.	group1	group2	n1	n2	statistic	p	df	conf.low	conf.high	method	alternative
-163	weight	after	before	399	399	-94.1	5.48e-274	398	-166	-159	T-test	two.sided

There was on average a 162.5g reduction in weight of the loaves of bread after baking. This reduction has a 95% confidence interval of 159.1g to 165.9g and is significantly different from 0 (p-value<0.001).

13.5 Test for equality of variances

In using the Student’s T-test to determine difference between the means of two independent groups we need to be mindful of the variances of each group. The computations done for the independent groups t-test are different when the variances between the groups are similar or different. Therefore to determine if the mean weight significantly differ between males and females we need to determine and compare their variances. The function var.test() in R compares the variances between two independent groups and can be used for this determination.Below we apply this F-test to compare the variances of the weight for the two sexes. First we determine the variances.

df_data1 %>% 
    group_by(sex) %>%
    summarise(across(weight, list(var = var, meam = mean)))

Table 13.3:
sex	weight_var	weight_meam
Female	50.8	12.9
Male	30.2	11.6

There seem to be a big difference between the variance of the weights for the two sexes. That for the females is almost 1.7 times that of the males. To determine if this is not a chance finding we apply a formal statistical test. Here our

H0: There is no difference in the variance of the weights for males and females in our population

The F-test actually tests the ratio of the variances not the difference. In that regard our null value would be 1.

df_data1 %>% 
    var.test(formula = weight~sex, data = .) %>% 
    broom::tidy() 
Multiple parameters; naming those columns num.df, den.df

Table 13.4:
estimate	num.df	den.df	statistic	p.value	conf.low	conf.high	method	alternative
1.68	63	75	1.68	0.0311	1.05	2.73	F test to compare two variances	two.sided

The significant p-value (at a significance level of 0.05) and a confidence interval not containing 1 (the null value) implies there is very little evidence that the variance between the two groups are the same (in other words they differ significantly). In that case the conclusion from the previous analysis is valid as R assumes the variances to be unequal by default if the t.test() function is used.

Next we apply the same principle to the determine if the mean heights are similar for males and females in our population. We first determine if the variances are significantly different.

df_data1 %>% 
    group_by(sex) %>%
    summarise(
        across(height, list(var = ~var(., na.rm=T), meam = ~mean(., na.rm=T)))
    )

Table 13.5:
sex	height_var	height_meam
Female	528	92.2
Male	399	89.8

From the results above the variance for the females look much higher (1.3 times) than the males however we apply a test to formally determine this.

df_data1 %>% 
    var.test(formula = height~sex, data = .) %>% 
    broom::tidy() 
Multiple parameters; naming those columns num.df, den.df

Table 13.6:
estimate	num.df	den.df	statistic	p.value	conf.low	conf.high	method	alternative
1.32	62	75	1.32	0.246	0.823	2.15	F test to compare two variances	two.sided

Both p-value and confidence interval conclude there is insufficient evidence to say the two variances are different. To use the t.test() function to determine the possibility that mean height differ between males and females we specify that variance is equal as below.

df_data1 %>%
    rstatix::t_test(formula = height~sex, var.equal = TRUE, detailed = TRUE)

Table 13.7:
estimate	estimate1	estimate2	.y.	group1	group2	n1	n2	statistic	p	df	conf.low	conf.high	method	alternative
2.38	92.2	89.8	height	Female	Male	63	76	0.654	0.514	137	-4.82	9.59	T-test	two.sided

12 Normality of data

14 Analysis of categorical data