25 Survival analysis

All the aforementioned analysis deals with the presence or absence of an outcome. They do not take into consideration how long it took the outcome to occur. Survival analysis is designed for this purpose, analysis of time to events. For instance, analysing how long it takes for a car to develop a faulty engine, for a person to develop a heart attack and divorce after marriage will require survival analysis.

25.1 Terminologies

In survival analysis, subjects are usually followed for a period of time with our main interest being the duration it takes for the event to occur. To understand survival analysis we need to familiarise ourselves with a few terminologies. Consider Figure 42. At the beginning of the year 2001, a study was initiated to determine the survival of women treated for breast cancer. The six horizontal lines represent six patients recruited for this study and indicate their time of recruitment into the study, how long they were in the study and their outcome.

Survival: This is generally termed as the probability that an individual or a subject survives from the beginning of say a study or observation (where t0 = 0) to a specific time (t) without experiencing the event. So in a study where one is determining the survival of breast cancer patients, the five-year survival is the probability that a patient will be alive at five years after diagnosis or observation. This is also called the survival function and is often represented by S(t).

Hazard: The hazard on the other hand is the probability that an individual who has been under observation from time t0 has the event at time t. Relating this to our breast cancer example the hazard at five years is the probability that a patient would be dead at five years. The hazard function is often represented by \(h(t)\) or \(\lambda(t)\). Note that the survivor function deals with the event (death) not occurring while the hazard function deals with the event (death) occurring.

Event: The event we are interested in is death. In this example, death is the event of interest but this could be divorce, relapse of cancer, etc. depending on the outcome of the interest being studied. In our diagram, the event (death) is coloured red. Patient A was recruited in 2001 but died in 2003 while E was recruited into the study in 2003 but died in 2005. Hence both patients were in the study for about 2 years and experienced the event of death.

Censor: Patients who do not experience the event are usually said to be censored, literally meaning the time they will develop the event cannot be determined within the study period. Patient B for instance was recruited at the beginning of the study but lost to follow-up (green dot) in 2004. Patient D was recruited into the study in 2002 but was also lost to follow up in 2005. Patients C and F both stayed in the study till the end without experiencing the event. For patients B, C, D, and F we have no idea at the end of the the follow-up period when their events will occur. For these, their outcome is termed as censored. These six lines give us the characteristic scenarios that may occur in the right censored data.

Median Survival: It is the time when half the study subjects are expected to have experienced the event. In other words, the chance of surviving (or not experiencing the event) beyond that time is 50% for a population with the disease or exposure being studied.

In this section, we will use the lung data from the survival package. First, we have a look at the details of the lung data by using

library(survival)
lung %>% head()
Table 7.2:
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
3 306 2 74 1 1 90 100 1175
3 455 2 68 1 0 90 90 1225 15
3 1010 1 56 1 0 90 90 15
5 210 2 57 1 1 90 60 1150 11
1 883 2 60 1 0 100 90 0
12 1022 1 74 1 1 50 80 513 0

For a good understanding of the lung data readers are advised to read the help page revealed by the help() function above. Note that time is the duration each participant was followed up while in the study. status on the other hand indicates if the patient experienced the outcome, in this case, death while being followed up or was censored, indicating death never occurred during the follow-up period.

25.2 The survival object

The initial step in survival analysis in R (and for that matter many other statistical software) is the creation of the survival object. This is achieved in R with the function Surv() from the survival package. The function requires at least two arguments to be supplied. These include the time, indicating the duration each subject was in the study and status indicating whether the event occurred or not. This is demonstrated below using the “lung” dataset in the survival package.

Surv.1 <- Surv(time=lung$time, event=lung$status==2)
Surv.1
 [1]  306   455  1010+  210   883  1022+  310   361   218   166   170   654 
[13]  728    71   567   144   613   707    61    88   301    81   624   371 
[25]  394   520   574   118   390    12   473    26   533   107    53   122 
[37]  814   965+   93   731   460   153   433   145   583    95   303   519 
[49]  643   765   735   189    53   246   689    65     5   132   687   345 
[61]  444   223   175    60   163    65   208   821+  428   230   840+  305 
[73]   11   132   226   426   705   363    11   176   791    95   196+  167 
[85]  806+  284   641   147   740+  163 
 [ reached getOption("max.print") -- omitted 138 entries ]

In the Surv()function above time is supplied as a numeric variable while the event is either expressed as 0 (for no event or censored) and 1 (for the event occurring) or FALSE (for no event or censoring) and TRUE (for the occurrence of the event). Our command above used the logical version (TRUE/FALSE) and created a vector of TRUE (for status of 2) and FALSE (for status otherwise).

The survival object is essentially a conversion of the time into an object that incorporates both the time and the status. Below we combine the time and status from the lung dataset and the survival object. It can be seen that the time variable is the same as the survival object if the patient died but in patients with censored outcomes, a plus (+) sign is attached.

cbind(lung[,2:3], Surv.1) %>% head()
Warning in value[[jj]][ri] <- if (is.factor(xij)) as.vector(xij) else xij:
number of items to replace is not a multiple of replacement length
Table 7.3:
time status Surv0.1
306 2 306
455 2 455
1010 1 1010
210 2 210
883 2 883
1022 1 1022

25.3 The survival fit object

Next, a survival object is created in accordance with the analysis intended to be done. The non-stratified analysis is currently fitted with a 1 on the right side of the formula as shown below while stratified survival is done with the stratifying variable on the right side of the formula.

In the analysis below we try to determine the median survival time for all persons included in the study

sfit.1 <- survfit(Surv.1 ~ 1)
sfit.1
Call: survfit(formula = Surv.1 ~ 1)

       n events median 0.95LCL 0.95UCL
[1,] 228    165    310     285     363

The survival fit object sfit.1 like a linear regression or logistic regression object has very little but important information at first hand. It shows the number of records used, the median survival time (the time at which it is expected that half the patients would have the event) and its confidence interval. Our analysis indicates that after 310 days half the patients would have died! The output is simple but there is more that can be revealed in the attributes. These are displayed below

attributes(sfit.1)
$names
 [1] "n"         "time"      "n.risk"    "n.event"   "n.censor"  "surv"     
 [7] "std.err"   "cumhaz"    "std.chaz"  "type"      "logse"     "conf.int" 
[13] "conf.type" "lower"     "upper"     "call"     

$class
[1] "survfit"

25.4 Kaplan-Meier life table

For a compact view of these attributes, we summarise the survfit object. This is displayed below

summary(sfit.1)
Call: survfit(formula = Surv.1 ~ 1)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    5    228       1   0.9956 0.00438       0.9871        1.000
   11    227       3   0.9825 0.00869       0.9656        1.000
   12    224       1   0.9781 0.00970       0.9592        0.997
   13    223       2   0.9693 0.01142       0.9472        0.992
   15    221       1   0.9649 0.01219       0.9413        0.989
   26    220       1   0.9605 0.01290       0.9356        0.986
   30    219       1   0.9561 0.01356       0.9299        0.983
   31    218       1   0.9518 0.01419       0.9243        0.980
   53    217       2   0.9430 0.01536       0.9134        0.974
   54    215       1   0.9386 0.01590       0.9079        0.970
   59    214       1   0.9342 0.01642       0.9026        0.967
   60    213       2   0.9254 0.01740       0.8920        0.960
 [ reached getOption("max.print") -- omitted 127 rows ]

The long table produced in this output is called the Kaplan-Meier life table. It contains the individual survival probabilities with their 95% confidence intervals at every point in time of the study. For instance, the proportion that survived to the end of day 30 is about 0.9561 (95%CI: 0.9299 to 0.983). The rather long life table can be significantly compacted to show only specific times. For instance, if we decide to show the survival in a life table for every 90 days we can specify these times below

summary(sfit.1, times=seq(0,900,90))
Call: survfit(formula = Surv.1 ~ 1)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0    228       0   1.0000  0.0000       1.0000        1.000
   90    201      27   0.8816  0.0214       0.8406        0.925
  180    160      36   0.7217  0.0298       0.6655        0.783
  270    108      30   0.5753  0.0338       0.5128        0.645
  360     70      24   0.4340  0.0358       0.3693        0.510
  450     48      16   0.3287  0.0356       0.2659        0.406
  540     33      10   0.2554  0.0344       0.1962        0.333
  630     22       7   0.1958  0.0330       0.1407        0.272
  720     14       8   0.1246  0.0290       0.0789        0.197
  810      7       5   0.0783  0.0246       0.0423        0.145
  900      3       2   0.0503  0.0228       0.0207        0.123

25.5 Kaplan-Meier curve

The Kaplan-Meier life table can be plotted to give it a graphical representation called the Kaplan-Meier curve. The resulting curve is a plot of the Survival probability against the time. An example is shown in the figure below and plotted with

survminer::ggsurvplot(
    sfit.1, 
    data = lung,
    palette= 'blue',
    surv.median.line = "hv",
    censor =F,
    title = "Kaplan-Meier Curve with 95% CI",
    ggtheme = theme_bw())

The green dotted line in the plot cuts the curve at the median survival time. Also, the plot has the confidence interval superimposed (dotted red lines).

25.6 Survival for subgroups

To determine the survival separately for subgroups we need to fit the survival object specifying our requisite group. For instance, to determine survival among the different sexes we fit the object as below. First, we make sure the sex variable is in the right format, a factor.

lung <- 
    lung %>% 
    mutate(sex = factor(sex, levels = 1:2, labels = c("Male", "Female")))

And then fit the appropriate survfit object. Note that the right-hand side of the formula this time is sex and not 1.

sfit.2 <- survfit(Surv.1 ~ lung$sex)
sfit.2
Call: survfit(formula = Surv.1 ~ lung$sex)

                  n events median 0.95LCL 0.95UCL
lung$sex=Male   138    112    270     212     310
lung$sex=Female  90     53    426     348     550

Once again we are presented with the survfit object displaying the records, the number of events, the median survival and the confidence intervals. These statistics however are separate for each sex. For instance, we realise that the number of deaths in males was more than twice that for females and the median survival time for males of 270 days is lesser than that for females. As before, for a complete life table, we summarise the survfit object.

summary(sfit.2)
Call: survfit(formula = Surv.1 ~ lung$sex)

                lung$sex=Male 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   11    138       3   0.9783  0.0124       0.9542        1.000
   12    135       1   0.9710  0.0143       0.9434        0.999
   13    134       2   0.9565  0.0174       0.9231        0.991
   15    132       1   0.9493  0.0187       0.9134        0.987
   26    131       1   0.9420  0.0199       0.9038        0.982
   30    130       1   0.9348  0.0210       0.8945        0.977
   31    129       1   0.9275  0.0221       0.8853        0.972
   53    128       2   0.9130  0.0240       0.8672        0.961
   54    126       1   0.9058  0.0249       0.8583        0.956
   59    125       1   0.8986  0.0257       0.8496        0.950
   60    124       1   0.8913  0.0265       0.8409        0.945
   65    123       2   0.8768  0.0280       0.8237        0.933
 [ reached getOption("max.print") -- omitted 87 rows ]

                lung$sex=Female 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    5     90       1   0.9889  0.0110       0.9675        1.000
   60     89       1   0.9778  0.0155       0.9478        1.000
   61     88       1   0.9667  0.0189       0.9303        1.000
   62     87       1   0.9556  0.0217       0.9139        0.999
   79     86       1   0.9444  0.0241       0.8983        0.993
   81     85       1   0.9333  0.0263       0.8832        0.986
   95     83       1   0.9221  0.0283       0.8683        0.979
  107     81       1   0.9107  0.0301       0.8535        0.972
  122     80       1   0.8993  0.0318       0.8390        0.964
  145     79       2   0.8766  0.0349       0.8108        0.948
  153     77       1   0.8652  0.0362       0.7970        0.939
  166     76       1   0.8538  0.0375       0.7834        0.931
 [ reached getOption("max.print") -- omitted 39 rows ]

The rather long life table can best be appreciated with a KM curve. We further go ahead to plot the Kaplan-Meier curve for the two sexes as shown in Figure 44 and plot with

plot(
    sfit.2, 
    col=1:2, 
    lty=1:2, 
    mark.time=F,
    main="K-M curve by sex",
    xlab = "Days", ylab="Survival")
legend("topright", c("Male","Female"), lty=1:2, col=1:2)
abline(h=0.5, lty=2, col=3)
survminer::ggsurvplot(
    sfit.2, 
    data = lung,
    surv.median.line = "hv",
    censor = F,
    title = "Kaplan-Meier Curve by sex (with 95% CI)",
    ggtheme = theme_bw(),
    conf.int = T,
    xlab = "Days")

The figure above shows the KM survival curve separately for males and females. It can be seen that the two plots are completely separate, a sign that the survival is not the same in both sexes. The steeper curve for the males indicates they have a relatively worse outcome over time than the females. The median survival time for instance (indicated by the green dotted line) intersects the males’ curve much earlier than the females indicating a much lower median survival time for the males compared to the females.

25.7 Cumulative Incidence (Hazard)

Another way of expressing the relationship between time to an outcome and the outcome is with the Cumulative Incidence (CI). The CI at any point in time is a measure of the frequency of the outcome over a period of time. This can be determined in R as below. First, we fit the survival object.

sfit.3 <- 
    survfit(
        Surv(
            time = time, 
            event = status==2, 
            type = "mstate") ~ 1, 
        data=lung)
sfit.3
Call: survfit(formula = Surv(time = time, event = status == 2, type = "mstate") ~ 
    1, data = lung)

       n nevent   rmean*
(s0) 228      0 376.2747
TRUE 228    165 645.7253
   *restricted mean time in state (max time = 1022 )

The survival object is similar to that done previously but this time in fitting it we used the argument type="mstate" to specify our preference for the cumulative incidence. We further use the summary() function on the survfit object to generate the CI table.

sfit.3 %>% broom::tidy() %>% head(10)
Table 13.1:
time n.risk n.event n.censor estimate std.error conf.high conf.low state
5 228 0 0 0.995614 0.00437634 1        0.987073 (s0)
11 227 0 0 0.982456 0.00869464 0.999646 0.965562 (s0)
12 224 0 0 0.97807  0.00969918 0.997266 0.959244 (s0)
13 223 0 0 0.969298 0.0114246  0.991951 0.947163 (s0)
15 221 0 0 0.964912 0.0121858  0.989094 0.941322 (s0)
26 220 0 0 0.960526 0.0128956  0.986137 0.935581 (s0)
30 219 0 0 0.95614  0.0135621  0.983094 0.929925 (s0)
31 218 0 0 0.951754 0.0141914  0.979979 0.924342 (s0)
53 217 0 0 0.942982 0.0153564  0.973566 0.91336  (s0)
54 215 0 0 0.938596 0.015899   0.970281 0.907947 (s0)

Similar to the survival the cumulative incidence table can be drawn as above. The obvious addition here is the presence of the prevalence column, indicating the cumulative prevalence of the outcome at each time point.

A pictorial overview of the cumulative incidence can also obtained using the commands below and shown in the figure below.

25.8 Comparing survival between two or more groups

It often arises that survival or hazard between two or more groups is compared. For instance in a randomised controlled trial to determine the survival in patients with a chronic disease 3 drugs can be given to three randomly selected groups. After a period of follow-up, one would want to compare which group survived best. To do this various non-parametric tests are available to help. Standing tall among them is the log-rank test (Peto et al, 1977) which we are going to apply.

25.8.1 The Log-Rank test

The idea behind the log-rank test is to test the proportional hazards among the groups to be compared. The test has a null hypothesis that the hazard is equal in all the populations under consideration. The p-value generated is the probability that the hazard is equal in all populations under consideration. Hence a small p-value indicates some differences exist at least between two members of the group. Below we use the log-rank test to determine if the hazards in the two sexes are significantly different for our lung cancer patients. Remember pictorially and by comparison of the median survival we already know that the females seem to do better than the males but here we put it to a formal test. This is done in R using survdiff() and the initial survival object created.

survdiff(Surv.1 ~ lung$sex)
Call:
survdiff(formula = Surv.1 ~ lung$sex)

                  N Observed Expected (O-E)^2/E (O-E)^2/V
lung$sex=Male   138      112     91.6      4.55      10.3
lung$sex=Female  90       53     73.4      5.68      10.3

 Chisq= 10.3  on 1 degrees of freedom, p= 0.001 

The small p-value above (p= 0.00131) leads us to the conclusion that the hazards within the two groups significantly differ.

25.8.2 Stratified Log-Rank test

In trying to figure out why females tend to do better than males with lung cancer we ask ourselves a few questions. One of these could be: Could it be because there is a difference in ages between the two sexes that took part in the study? To begin to evaluate this assertion we compare the ages of the two sexes.

lung %>% 
    group_by(sex) %>% 
    meantables::mean_table(.x=age)
Table 13.3:
response_var group_var group_cat n mean sd sem lcl ucl min max
age sex Male 138 63.34 9.14 0.777983 61.8  64.88 39 82
age sex Female 90 61.08 8.85 0.932588 59.22 62.93 41 77

Well, it appears the males are on average older than the females. Could this be the reason why the males did worse? To remove the possibility of age being the reason why males may be doing worse we perform a stratified log-rank test below

survdiff(Surv.1 ~ lung$sex + strata(lung$age))
Call:
survdiff(formula = Surv.1 ~ lung$sex + strata(lung$age))

                  N Observed Expected (O-E)^2/E (O-E)^2/V
lung$sex=Male   138      112     96.6      2.47      9.42
lung$sex=Female  90       53     68.4      3.48      9.42

 Chisq= 9.4  on 1 degrees of freedom, p= 0.002 

The analysis above asks the question: Assuming the ages of all the sexes were equal among both sexes, would the survival still depend on the sex of the patient? After the analysis, it appears the age difference is not the reason for the difference in hazards observed. The significant p-value is hardly affected when we stratify by age.

25.9 Cox regression

Similar to our observation when we performed the chi-squared test in categorical data analysis, we only generated a p-value but no comparable effect like the odds ratio and risk ratio when we used the log-rank test. We cannot for instance say that the hazard in the males is twice that of the females. This is where Cox regression analysis kicks in. Cox regression analysis generates the Hazard Ratio (HR) a ratio similar to the odds ratio but this time compares the hazards in the various groups. Like the OR, HR has a null value of 1, when the hazard is equal in both groups being compared. Cox regression has the added advantage of being able to determine the HR for more than one variable at a time thereby correcting for their respective effects and stratifying at the same time. Cox proportional modelling is implemented in R using the function coxph(). Below we determine again the association between sex and the hazard in the lung data.

cox.1 <- coxph(Surv.1 ~ sex, data = lung)
cox.1 %>% 
    broom::tidy(exponentiate = TRUE, conf.int = TRUE)
Table 13.6:
term estimate std.error statistic p.value conf.low conf.high
sexFemale 0.588003 0.167179 -3.17638 0.00149123 0.423718 0.815985

Just like other regression functions in R, the output is scanty. However, the negative coeff indicated that the hazard is less in females compared to males (the referent group). Exponentiating the coefficient gives the hazard ratio. This is shown as the second column of the matrix with the heading exp(coef) and with a value of 0.588. This value indicates that females have about half the hazard of males. This is in concert with what we already know. The last column is the respective p-value associated with the coefficient or hazard ratio. It tests the hypothesis that the hazard ratio is equal to 1 (the null value). Therefore a small p-value makes this a significant finding. The last 2 columns are the confidence interval.

As was mentioned one of the advantages of using the Cox proportional regression is our ability to put in more than one variable at a time. Below we determine the hazard ratios for both sex and age concurrently and after correcting for each other.

cox.2 <- coxph(Surv.1 ~ sex + age, data = lung)
cox.2 %>% 
    broom::tidy(exponentiate = TRUE, conf.int = TRUE)
Table 13.7:
term estimate std.error statistic p.value conf.low conf.high
sexFemale 0.598566 0.167458   -3.06476 0.00217845 0.431094 0.831099
age 1.01719  0.00922327 1.84808 0.064591   0.998969 1.03575 

And then finally we determine the HRs while stratifying for the ECOG performance score (0=good 5=dead)

cox.3 <- coxph(Surv.1 ~ sex + age + strata(ph.ecog), data = lung)
cox.3 %>% 
    broom::tidy(exponentiate = TRUE, conf.int = TRUE)
Table 25.1:
term estimate std.error statistic p.value conf.low conf.high
sexFemale 0.574924 0.170873  -3.23935 0.00119802 0.411304 0.803633
age 1.01083  0.0092538 1.16408 0.244392   0.992662 1.02933 

Putting all three into one Cox’s regression equation we have the table below

lung %>%
    select(sex, ph.ecog, age, status, time) %>%
    survival::coxph(
        survival::Surv(event = status==2, time = time) ~ ., 
        data = .) %>%
    gtsummary::tbl_regression(
        label = list(
            age ~ "Age (years)",
            sex ~ "Sex",
            ph.ecog = "PH.ECOG"),
        pvalue_fun = ~ gtsummary::style_pvalue(.x, digits = 3),
        exponentiate = TRUE) %>%
    gtsummary::bold_labels() %>%
    gtsummary::bold_p()
Characteristic HR1 95% CI1 p-value
Sex


    Male
    Female 0.58 0.41, 0.80 <0.001
PH.ECOG 1.59 1.27, 1.99 <0.001
Age (years) 1.01 0.99, 1.03 0.232
1 HR = Hazard Ratio, CI = Confidence Interval

Conclusion: It can be concluded that the hazards differ significantly between the two sexes, with females surviving better than males with lung cancer. Though the ages of the males in our study were relatively higher than the females this difference in age could not explain completely why males did worse than females. In other words, the difference in survival for the two sexes is irrespective of their age.