10 Descriptive Statistics: Numeric

The initial analysis of numeric data is usually a description of the data at hand without making inference to the population from which the data was drawn. This gives the data analyst a general overview of the data at hand, how best to describe it and what analysis best suits it. In descriptive analysis of numeric data the most basic is to determine the:

  • Measure of Central Tendency: This is a description of the center of the data. These measures include mean, median and mode.
  • Measure of Dispersion: A measure of how widespread the data is. These include standard deviation, variance, interquartile range and range.

For this section, we will use the NewDrug_clean.dta dataset

newdrug <-  
    haven::read_dta("./Data/NewDrug_clean.dta") %>% 
    mutate(sex  = haven::as_factor(sex), treat = haven::as_factor(treat)) %>% 
    haven::zap_labels() 

newdrug %>% summary()
      id                treat         age        sex   
 Length:50          Control:22   Min.   :45.00   F:26  
 Class :character   Newdrug:28   1st Qu.:57.25   M:24  
 Mode  :character                Median :63.00         
                                 Mean   :61.48         
                                 3rd Qu.:65.00         
                                 Max.   :75.00         
      bp1              bp2            bpdiff      
 Min.   : 87.50   Min.   :78.00   Min.   : 0.500  
 1st Qu.: 95.62   1st Qu.:85.22   1st Qu.: 4.800  
 Median : 97.70   Median :88.15   Median : 8.250  
 Mean   : 98.30   Mean   :88.60   Mean   : 9.704  
 3rd Qu.: 99.40   3rd Qu.:92.10   3rd Qu.:13.700  
 Max.   :111.70   Max.   :99.70   Max.   :26.300  

10.1 Single continuous variable

10.1.1 Measures of Central Tendency & Dispersion

Below we determine the mean, median, standard deviation, range (minimum, maximum) and interquartile range of out initial blood pressure

newdrug %>% 
    summarise(
        Mean = mean(bp1), 
        Median = median(bp1), 
        Standard_Dev = sd(bp1), 
        Minimum = min(bp1), 
        Maximum = max(bp1),
        IQR = IQR(bp1)
    ) 
Table 7.2:
Mean Median Standard_Dev Minimum Maximum IQR
98.3 97.7 5.17 87.5 112 3.78

Alternatively, the psych package gives these measures in further details. The output includes a measure of the Kurtosis and Skewness, both describing the shape of the data.

newdrug %$% 
    psych::describe(bp1)
Table 6.1:
vars n mean sd median trimmed mad min max range skew kurtosis se
1 50 98.3 5.17 97.7 97.9 2.97 87.5 112 24.2 0.696 0.617 0.731

And to show the interquartile range we do the following.

newdrug %$% 
    psych::describe(bp1, IQR = TRUE,quant = c(.25, .75))
Table 7.3:
vars n mean sd median trimmed mad min max range skew kurtosis se IQR Q0.25 Q0.75
1 50 98.3 5.17 97.7 97.9 2.97 87.5 112 24.2 0.696 0.617 0.731 3.78 95.6 99.4

10.1.2 Graphs - Histogram

newdrug %>% 
    ggplot(aes(x = bp1)) + 
    geom_histogram(bins = 7, col="black", alpha = .5, fill = "red") +
    labs(title = "Histogram of Blood Pressure before  intervention",
         x= "BP1")+
    theme_light()

10.1.3 Graphs - Boxplot and violin plot

newdrug %>% 
    ggplot(aes(y = bp1)) + 
    geom_boxplot(col="black",  
                 alpha = .2, 
                 fill = "blue", 
                 outlier.fill = "black",
                 outlier.shape = 22) +
    labs(title = "Boxplot of Blood Pressure before  intervention",
         y = "BP1")+
    theme_light()

10.1.3.1 Graphs - Density plot

newdrug %>% 
    ggplot(aes(y = bp1)) + 
    geom_density(col="black", fill = "yellow", alpha=.6) +
    labs(title = "Density Plot of Blood Pressure before  intervention",
         y = "Blood Pressure before  intervention")+
    coord_flip() +
    theme_light()

10.1.3.2 Graphs - Cumulative Frequency plot

newdrug %>% 
    group_by(bp1) %>% 
    summarize(n = n()) %>% 
    ungroup() %>% 
    mutate(cum = cumsum(n)/sum(n)*100) %>% 
    ggplot(aes(y = cum, x = bp1)) +
    geom_line(col=3, linewidth=1.2)+
    labs(
        title = "Cumulative Frequency Plot of Blood Pressure before  intervention",
        x = "BP1",
        y = "Cumulative Frequency")+
    theme_light() 

10.1.4 Multiple Continuous variables

10.1.4.1 Measures of Central tendency & Dispersion

newdrug %>% 
    select(where(is.numeric)) %>% 
    psych::describe()
Table 6.5:
vars n mean sd median trimmed mad min max range skew kurtosis se
1 50 61.5 6.51 63    62    4.45 45   75   30   -0.602 0.157 0.92 
2 50 98.3 5.17 97.7  97.9  2.97 87.5 112   24.2 0.696 0.617 0.731
3 50 88.6 4.56 88.2  88.5  4.52 78   99.7 21.7 0.252 -0.236 0.645
4 50 9.7 6.2  8.25 8.95 5.49 0.5 26.3 25.8 0.931 0.243 0.877

To illustrate graphing multiple continuous variables we use the 2 bp variables

bps <- 
    newdrug %>%
    select(bp1, bp2) %>% 
    pivot_longer(
        cols = c(bp1, bp2),
        names_to = "measure", 
        values_to = "bp") %>% 
    mutate(
        measure = fct_recode(
            measure, "Pre-Treatment" = "bp1", "Post-Treatment" = "bp2"
            )
        )

Next, we create multiple density plots

bps %>% 
    ggplot(aes(y = measure, x = bp, fill = measure)) +
    ggridges::geom_density_ridges2( col="black", alpha = .5, scale=1, 
                                    show.legend = F) +
    labs(x = "Blood pressure (mmHg)", 
         y = "Density", 
         fill = "Blood Pressure") +
    theme_bw()
Picking joint bandwidth of 1.52
bps %>% 
    ggplot(aes(y = measure, x = bp, fill = measure))+
    geom_boxplot(show.legend = FALSE) +
    labs(y = NULL, 
         x = "Blood Pressure", 
         fill = "Blood Pressure") +
    coord_flip()+
    theme_light() 
bps %>% 
    ggplot(aes(y = measure, x = bp, fill = measure))+
    geom_violin(show.legend = FALSE) +
    coord_flip()+
    theme_light() 

10.2 Continuous by a single categorical variable

10.2.1 Summary

We do this with one variable.

newdrug %>% 
    group_by(treat) %>% 
    summarize(mean.bp1 = mean(bp1),
              sd.bp1 = sd(bp1),
              var.bp1 = var(bp1),
              se.mean.bp1 = sd(bp1)/sqrt(n()),
              median.bp1 = median(bp1),
              min.bp1 = min(bp1),
              max.bp1 = max(bp1)) %>% 
    ungroup()
Table 6.10:
treat mean.bp1 sd.bp1 var.bp1 se.mean.bp1 median.bp1 min.bp1 max.bp1
Control 97.1 3.56 12.7 0.76 97.4 89.8 103
Newdrug 99.2 6.05 36.6 1.14 98.2 87.5 112

10.2.2 Graph - Histogram, Boxplot, Density plot and cumulative frequency

The graphs are similar to the above so we skip them.

10.3 Continuous by multiple categorical variables

10.3.1 Summary

This can be done as below.

newdrug %>% 
    group_by(treat, sex) %>% 
    summarize(mean.bp1 = mean(bp1),
              sd.bp1 = sd(bp1),
              var.bp1 = var(bp1),
              se.mean.bp1 = sd(bp1)/sqrt(n()),
              median.bp1 = median(bp1),
              min.bp1 = min(bp1),
              max.bp1 = max(bp1),
              .groups = "drop") 
Table 6.11:
treat sex mean.bp1 sd.bp1 var.bp1 se.mean.bp1 median.bp1 min.bp1 max.bp1
Control F 97.2 3.82 14.6 1.15 97.4 90.1 103
Control M 97   3.47 12.1 1.05 97.5 89.8 102
Newdrug F 98.6 6.01 36.1 1.55 98.4 87.5 112
Newdrug M 100   6.25 39.1 1.73 98.1 91.7 112

And this can be presented in a boxplot below

newdrug %>% 
    ggplot(aes(y = bp1, x = sex, fill = treat)) +
    geom_boxplot()+
    labs(
        y = "Blood Pressure (mmHg)",
        x =  "Sex",
        fill = 'Treatment') +
    theme_bw()