Time-to-event modeling with the MonolixSuite: a simple TTE model for the Veterans’ Administration Lung Cancer study

Download data set onlyDownload all Monolix project files | Download simulx R script

This case study is a detailed modeling and simulation workflow on a TTE data set. It is recommended to read before the Introduction to TTE data and library of models in Monolix.

A second case study on time-to-event data is also available: Case study on NCCTG lung cancer data set

In this case study, we develop a model for the survival of patients with advanced lung cancer. The effect of several covariates, including a novel chemotherapy, is investigated.


In a study conducted by the US Veterans Administration, male patients with advanced inoperable lung cancer were given either a standard therapy or a test chemotherapy. Time to death was recorded for 137 patients, while 9 left the study before death. Various covariates were also documented for each patient.
The primary goal is to assess if the test chemotherapy is beneficial. Secondary goals include the analysis of covariates as prognostic variables.
This data set has been initially presented and analyzed in:

D Kalbfleisch and RL Prentice (1980), The Statistical Analysis of Failure Time Data. Wiley, New York.

Data set visualization

The data set includes data for 137 patients, 9 of them are right censored. According to the data set formatting guidelines, the data set includes for each patient a line indicating the start time (time=0 and observation Y=0), and a line indicating either the time of death (Y=1) or the time at which the patient has left the study (Y=0).
In addition the following covariates were recorded:

  • Treatment (“trt” column): “standard” or “test” (tested new chemotherapy)
  • Histological type of the tumor (“celltype” column): “squamous”, “smallcell”, “adeno” or “large”
  • Karnofsky score (“karno” column): performance score that describes the overall patients status at the beginning of the study. The scale goes from 0 (dead) to 100 (completely healthy). More details about the scale can be found here.
  • Time between diagnosis and start of the study (column “diagtime”): in months
  • Age (column “age”): in years
  • Prior therapy (column “priortherapy”): indicates if the patient has received another therapy before the current one

A snapshot of the data set in shown below:

We first start with a visualization of the data set using Datxplore. We open Datxplore, click on Project > New project/data and Browse to select the data set. The columns are assigned in the following way (the covariates must be assigned to CONTINUOUS COVARIATE for continuous covariates or CATEGORICAL COVARIATE for categorical covariates):

Before using the data visualization feature of Monolix, we must indicate that the data is of type TTE (rather than continuous). We can now accept the column assignments by clicking on “Accept” before clicking on DATA VIEWER in the DATA frame.The Kaplan-Meier (KM) estimate of the survival curve as well as the mean number of events curve can be seen. In “Settings”, we can choose to focus only on the KM survival curve and display the censored data.

Remark: In Datxplore KM curves can not be stratified by color. However, in the Monolix plot “Observed data” (in the Plots tab), you can split a KM curve and then a “merge” button appears. It will merge the plots on a one figure with different colors (fixed) for each curve.

The plot for “Observed data” is available without running Monolix tasks, but it requires choosing a model. You can take the simplest model from the TTE library, if you are interested only in the data visualization. Two following figures below are obtained in Monolix with the “merge” option.

In the “Stratify” menu, we can split the KM curve according to categorical covariates, or categorized continuous covariates (groups can be changed with the “modify” button).  Stratifying by treatment trt (and using the “merge” option) leads to the following figure:

At first sight, it seems that there is no clear difference in survival for the standard treatment group and the chemotherapy treatment group. For the Karnofsky score, we create two groups (Group1= karno<50, and Group2= karno>50) and stratify:

Here it seems that patients with higher Karnofsky score (higher general health status) have a tendency to survive longer. We can go similarly through the other covariates.

Modeling with Monolix

We now would like to develop a model for this data set, in order to analyze the influence of the covariates, in particular the treatment. Within the MonolixSuite, the model must be fully specified via its hazard function (no non-parametric or semi-parametric approaches).

Structural model

We start with the specification of the structural model. In the TTE introduction, we have presented a library of common TTE models (defined via their hazard functions), and have detailed the typical shape of the resulting survival curve. Using the KM curve visualization and the overview of the library models presented in the TTE introduction, it seems that an exponential model could be appropriate. We thus choose the file “exponential_model_singleEvent.txt” (because death happens only once per individual) from the library.
We next need to define the distribution of the individual parameters. Most of the time in survival analysis, one assumes that the individual parameters are influenced by the covariates but that all individual with the same covariates have the same individual parameters and thus the same hazard function (but different times of death).  On the opposite, we can also assume a random effect for the individual parameters (often called frailty models), which permits to capture heterogeneity between individuals, which cannot be explained by the covariates. Because Monolix estimation algorithms and plots are most powerful when parameters include random effects, we will start with this approach.

The model contains a single parameter, the characteristic time Te. As this parameter should remain positive, we keep the default log-normal distribution. An initial value for this parameter can be chosen with help of the library overview, for instance 100 for this data set.
The model being fully defined, we can launch the estimation tasks: estimation of the population parameters, estimation of the standard errors via the Fisher Information Matrix, estimation of the log-likelihood and generation of the plots (individual parameters have been skipped because no individual fits plots are possible with TTE data). The Te_pop parameter very rapidly converges towards a random walk around the maximum likelihood estimate (MLE) in the exploratory phase of SAEM (before the red line, see figure below), and then converges to the MLE in the smoothing phase (after the red line). The estimated value (visible using the “Last results” button) is 109, close to our initial estimate. The standard deviation of the random effects (omega_Te parameter) converges to 0.63, which corresponds to a coefficient of variation of \( \textrm{CV} = \sqrt{e^{{\omega_{T_e}}^2} – 1} = 50 % \). The large inter-individual variability probably reflects that sub-populations within the data set have different Te values. Considering covariates on the Te parameter to reduce the random effects is the topic of the next section.

To assess if the exponential model was a good choice, we can display the VPC for Time-to-Event data (click on the list icon next to the task Plots and in the pop-up window, on “Visual Predictive Check”). The figure displays the empirical KM curve overlayed with the 90% prediction interval calculated from model simulations. The figure is then interpreted similarly to a VPC. Here the empirical KM curve lies within the prediction interval, we conclude that there is no model mis-specification.

Covariate search

The next step is to assess if the hazard is significantly different for sub-populations (for instance with the standard treatment or the chemotherapy). For model selection several statistical tools can be used, such as information criteria (AIC and BIC) or hypothesis tests (Wald test and likelihood ratio test). In this case study, we will use the Wald test.

Covariates can be added by clicking on the boxes in the “Individual model” section:

The details of the resulting equations can be seen by clicking on “Formula”. If we add the categorical covariate trt on parameter Te, we obtain:

$$T_{e,i} = \begin{cases} T_{e,\textrm{pop}}e^{\eta_i} & \text{if} \quad \textrm{trt}=0\textrm{  (reference)} \\ T_{e,\textrm{pop}}e^{\beta_{\textrm{trt}}}e^{\eta_i} & \text{if} \quad \textrm{trt}=1 \end{cases} $$

If we instead add the continuous covariate karno on parameter Te, we obtain:


Note that this parameterization corresponds to the typical form of a Cox proportional model. Via transformation of the covariates, other forms are also possible.

The parameterization using betas permits to directly perform statistical tests on the beta parameters. In particular, the Wald test tests the following null hypothesis:

$$H_0:\quad \textrm{”}\beta=0\textrm{”} \quad \textrm{versus} \quad H_1:\quad \textrm{”}\beta\neq 0\textrm{”}$$

The associated p-value is reported in the output file. If the p-value is smaller than a threshold (usually 0.05), we can consider that \(\beta\) is significantly different from zero.

We create 8 models, each having one covariate on the Te parameter. Here are the p-value results for each covariate (each model):

The covariates karno and celltype appear as significant, while all others not. In particular, given the data, the new chemotherapy does not significantly lower the risk, compared to the standard treatment.
Next, we can try a model with both karno and celltype. Because the difference between the “smallcell” cell type and the “adeno” cell type (used as reference) is not significant, we can choose to group these two into one category. Using the “Discrete” button, we create a new categorical covariate, where “adeno” and “smallcell” are put into the G1 reference group.

The results are the following:

From the estimated beta values, we can easily calculate that the risk of death is around exp(0.851)=2.3 times higher for patients with squamous cell types, compared to adeno or small cell typesIn the graphics, the 95% prediction interval for the survival can be stratified by cell type and/or karno score categories, to verify the agreement between model and data:

No model mis-specification can be detected. We thus consider this model as our final model.

The final model is:

$$h_{i}(t)=\begin{cases} \frac{1}{T_{e,\textrm{pop, adeno/smallcell}}e^{\beta\times \textrm{karno}}e^{\eta_i} }& \text{if} \quad \textrm{celltype}=\textrm{adeno or smallcell (G_adeno_smallcell)} \\ \frac{1}{T_{e,textrm{pop, large}}e^{\beta\times \textrm{karno}}e^{\eta_i} }& \text{if} \quad \textrm{celltype}=\textrm{large (G_large)}\\ \frac{1}{T_{e,\textrm{pop, squamous}}e^{\beta\times \textrm{karno}}e^{\eta_i}} & \text{if} \quad \textrm{celltype}=\textrm{squamous (G_squamous )} \end{cases} $$

For a given cell type and a given karno score, we can easily calculate analytically the typical hazard and the associated exponential model survival (i.e if \(h(t)=\lambda\) then \(S(t)=e^{-\lambda t})\). From the survival function, we calculate the probability to survive at least 1 month (30 days), 3 months (90 days) or 6 months (180 days), depending on the cell type and karno score:

The probability density function of the death event can also be calculated for any given karno score and cell type ( \( f(t)=h(t)*S(t) \) ), giving the probability of death over time. For each displayed subpopulation, the plot also represents the distribution of death times in the subpopulation.

Simulations using Simulx

Now that we have developed a parametric model for the survival of patients with advanced lung cancer, we can use the model to explore the expected survival for new cohorts of patients. To perform simulations, we use the simulx function of the R package mlxR. mlxR version >= 3.3.1 is required to run the scripts. Using the Monolix project as input, we can either simulate the same population as in the original data set, or define new populations.

We first simulate the same population as in the original data set. Because, for a given hazard, time of death is a random variable, the Kaplan-Meier survival curve will be different compared to the original data. The simulation is done via the simulx function which returns an R object containing the result of the simulation (time of death for each individual). We can then pass this object to kmplotmlx to obtain the Kaplan-meier survival curve:

# loading the mlxR library

# defining the path to the mlxtran project file
project.file <- "./monolixproject/project_veteran_3_celltype_karno.mlxtran"

# calling simulx, giving the path to the project as argument
res <- simulx(project = project.file)

# plotting the KM survival curve
kmplotmlx(res$Event) + xlab("Time (days)")

We obtain the following Kaplan-Meier survival curve, similar (but different) from that of the original data set:

We next simulate two new patients cohorts (with different covariates compared to the original data set). The first cohort is composed of patients with a quite good karno score, around 70, and mostly with squamous cell types (70% squamous, 10% adeno, 10% smallcell, 10% large). The second cohort contains patients having mostly adeno or smallcell celltypes (40% adeno, 40% smallcell, 10% large, 10% squamous) and a low karno score, around 40.

To simulate these two cohorts, we create two groups, each defined via a data frame of the individuals covariates, and pass them as simulx input argument. For instance for group 1:

# covariate data frame for group 1, with column id and one column per covariate
cov1 <- data.frame(id=1:137,
                   karno = rnorm(137,mean=70, sd=5),
                   celltype = c(rep("squamous",100),rep("large",12),rep("smallcell",12),rep("adeno",13)),
                   diagtime = NaN,                                #unused in model but must be present
                   age = NaN,                                     #unused in model but must be present
                   trt = c(rep("test",100),rep("standard",37)),   #unused in model but must be present
                   priortherapy = c(rep("yes",100),rep("no",37))) #unused in model but must be present

#============= calling simulx for group 1
res1 <- simulx(project = project.file,
               parameter = cov1)

After having put together the results for group 1 and group 2, we obtain the following prediction:

Because the time of death is a random variable, several simulations of the same population will lead to different survival curves. We can assess the uncertainty of the survival curve by doing replicates. This can be done very easily by adding an argument nrep:

res1 <- simulx(project = project.file,
               parameter = cov1,
               nrep = 100)

We then obtain a prediction interval for the two survival curves:


Monolix permits an efficient parametric analysis of this TTE data set. The risk of death for male patients with advanced inoperable lung cancer appears to be constant over time (exponential distribution of death times). Prognostic factors are the histological type of the tumor (cell type) and the overall health Karnofsky score. Unfortunately, no improvement with the new treatment can be shown, given the data. The parametric nature of the model permits to easily predict the survival probability for a typical subject with a given celltype and a given karno score.

Using Simulx, the developed parametric model can be used for more advanced predictions. Indeed, we can define new populations with a complex mix of covariates (different celltypes and different karno scores) and simulate the expected survival for this population. The uncertainty of the prediction can easily be assessed via replicates.