VPC for Time-To-Event data

The construction of a VPC plot for time-to-event data in Monolix requires several steps. First, Monolix simulates datasets using the population model, which contains the structural and statistical models and the estimated population parameters. Then, for each simulated dataset, it estimates the survival function with the Kaplan-Meier estimator for exactly observed events or Turnbull for interval censored. Because in time-to-event problems, time itself is an observation, simulated curves are interpolated on a fixed grid. As a result, at each time point of the new grid, there is a set of probabilities that events occur after that point. Finally, Monolix calculates lower and upper percentiles, which give a prediction interval.

 







Simulation of data sets using a population model

Simulations of time-to-event datasets give, for each individual, times when an event occurs. They keep the same design as in the original data set, the same number of individuals as in the original data set and individual parameters are sampled using the covariates from this original data set and the population parameters estimated in Monolix.

To simulate a time when an event occurs, \(u\) is defined as a probability that an event \(X\) occurs before a certain time \(T\). This probability is computed using the cumulative hazard function \(H(t)\)

 u = P(X\leq T) = 1-exp\left(-\int_0^T h(t,\psi_i) dt\right) = 1-exp(-H(T)),

where \(H\) is a solution to the ODE system \(dH/dt = h\). A value of the probability \(u\) is sampled from a uniform \([0,1]\) distribution, thus in the above equation the time T becomes the unknown. A root finding method from an ODE solver finds its value.

The time T is not a given independent variable, but it is a simulated observation. Moreover, simulations are only till time \(T_{max}\), which corresponds to the last time found in the data set. This value is the final time in integration of ODEs for each individual and it also defines a unique right censoring time for all individuals in the simulated data sets. It is independent of the individual times of events or right censoring in the original data set.

Example:

Individual 3 in the original data set (table on the left) experienced an event at time 4. In simulations (tables on the right) an event for this individual can occur before or after that time, or may occur after \(T_{max}\)  (last table). Similarly, individual 5 left the study at time 3 without experiencing an event in the original data set. But in simulations, an event can occur before or after time 3. If an event has not occurred at \(T_{max}=5\), then it is right censored.

Survival function

Monolix estimates survival functions for each simulated dataset. Depending on the event type, there are two methods:

  1. Kaplan-Meier estimator for exactly observed events. Simulated survival step-functions are directly interpolated on a fixed grid (see next section), while the linear approximation is applied to the empirical  Kaplan-Meier curve.
  2. Turnbull estimator for interval censored data. Monolix uses this method for VPC only if a user includes the “eventType = intervalCensored” option in the structural model
    DEFINITION: 
    Event = {type = event, eventType = intervalCensored, intervalLength = L,
                    maxEventNumber = 1, hazard = h}

    Plotting the empirical survival curve uses Turnbull intervals constructed from the censored intervals of the original data set.

    Simulations of interval censored datasets give, as in the case of exactly observed events, exact times of events for each individual. Then, events are counted in each interval. The definition of intervals depends on whether the structural model contains the option “intervalLength” or not.
    –    If “intervalLength” L is given, then the intervals are

    [0, L], (L, 2\cdot L],\dots , (m\cdot L,T_{max} ]

    –    If “intervalLength” is missing, then the observation period \([0,T_{max}]\) is divided into 10 equal intervals.

 

Prediction interval

To generate the prediction interval in the VPC plot, Monolix performs by default 500 simulations with N individuals as in the original data set. As a result, it produces 500 time series that form 500 survival curves (based on Kaplan-Meier or Turnbull estimators). Since simulated event times are random numbers, all estimated survival functions have different time grids (Fig.(a)). The interpolation of these curves on a fixed time grid allows to calculate the prediction interval (Fig.(b)). In Monolix the default time grid consists of a uniform grid with 100 points. For exactly observed events the grid includes in addition all time points from the original data set to provide all possible information. Exact event times are unknown for interval censored events, so dataset time points have no additional information. The interpolation on the new grid gives at each time point of the grid a set of probabilities corresponding to different simulated survival curves. It allows to calculate lower and upper percentiles (the 5th and 95 percentiles are a default choice) and draw the prediction interval (Fig.(c)).

Example 1: exactly observed events.

The dataset contains times of exactly observed single events of 30 individuals. The survival model has a constant hazard function and the Kaplan-Meier estimator approximates the survival function. The VPC plot uses a linear approximation of the Kaplan-Meier curve (Fig. on the left). By default, the fixed grid has 100 uniformly distributed points together with all time points from the date set. The figure on the left presents the VPC plot with the default settings. The approximation of the empirical and simulated Kaplan-Meier curves is sufficient on the default grid. In contrast, the VPC on the Figure on the right uses a grid with only time points from the original dataset. The accuracy of the prediction interval is decreased.

Example 2: interval censored events.

In this dataset, the times of events from the previous example are the ends of time intervals within which events occurred. The length of the censored intervals is non constant among intervals, and oscillates around 5. The structural model includes this information in eventTime and intervalLength options.

 

Survival function estimators

 

Exact events: Kaplan-Meier estimator

The Kaplan-Meier estimator is a standard method to approximate the survival function \( S(t) \). It describes the probability that an individual survives until time t, knowing that it survived at any earlier time. For single events, it is given by the following formula

\hat{S}(t)=\prod_{i:t_i<t} \left(1-\frac{d_i}{n_i}\right),

where

  • \(t_i\) – times before t, when at least one event occurred,
  • \(d_i\) – number of events at the time \(t_i\),
  • \(n_i\) – number of individuals at risk, that is who did not experience an event until \(t_i\).

The probability \((p_e)\) of an event is the ratio between the number of occurred events \((d_i)\) and the number of individuals at risk \((n_i)\). The complement of it, \((1-p_e)\), gives an estimation of the survival. For each time t, the total number of individuals at risk, who survived, may change. Thus, the current survival probability is the product of all probabilities at previous times \(t_i\), when at least one event occurred. It is similar to calculating the probability that a patient survives 2 days. It is a product of two probabilities: the probability  that the patient survives the first day and a conditional probability that it survives the second day, knowing that it survived the first one.

Example:

On the left of the figure below, a typical example of a time-to-event data set contains information about exact times when individuals experienced an event or when they left the study (drop-out). Five individuals have two observations: the time at which the observation starts ( 0 for everyone) and the time of an event. If a patient leaves the study, then the dataset contains a drop-out time with 0, instead of 1, in the column for the observations. The patient didn’t experience an event but survived until the drop-out time. The Kaplan-Meier estimator takes into account such situations, because these individuals are not counted as individuals at risk (they are not counted in the denominator \(n_i\)) at later times.

The study starts at time \(t_1\). There are no events, so \(d_1=0\) and the value of the Kaplan-Meier estimator is 1. Until the next event time at \(t_2=1\), the estimator remains constant. At that time, one individual experienced an event, hence \(d_2=1\), and all individuals survived until that time so \(n_2=5\). As a result, the probability to survive decreases by 0.2. It is the height of the jump at \(t=1\) in the plot. The Kaplan-Meier estimator remains constant until the next event. At time \(t=3\) there are two events. The number \(n_3\) counts now only 4 individuals – it has decreased by 1 due to the previous event. The final probability at time 3 is a product of all earlier probabilities. At \(t=4\) there is a drop-out: patient 5 left the study without experiencing an event. The survival curve remains constant and a red dot marks the drop-out. The Kaplan-Meier estimator takes into account this situations, because at the next event time \(t=5\), this individual is not counted as an individual at risk – thus the denominator n will be smaller. Finally, at time \(t=5\), there is only one individual left, and one event, so the survival becomes 0.

Remarks

  • The Kaplan-Meier estimator handles correctly the information about individuals who left the study, but there is a bias when the exact times of events are unknown.
  • In data visualization, Monolix assumes that all events are exactly observed. For example: assume that an observation period started at \(t=0\) and at \(t=1\) an event is marked by 1 in the column for the observation. Without any other information, the event could have occurred at \(t=1\) or before. The data set is similar when it consists of the beginning of the study and time interval limits. Just looking at the data set, the case of exact and interval censored events are indistinguishable.
  • If event type is not specified in the structural model, Monolix considers all interval censored events as exactly observed at the end of a censored interval.

 

Interval censored events: Turnbull estimator

 

Interval censored events do not occur at exact times \( t_i \), \(i = (1,…,n) \), but within some time intervals between times \( L_i \) and \( U_i \), \( L_i < t_i \leq U_i \). In case of right censored data, eg. dropout, an event occur within the interval \( (L_i, +\infty) \), where \( L_i\) is the time when an individual leaves the study.

Non parametric maximum likelihood estimator (NPMLE)

The expectation-Maximisation (EM) algorithm is a technique to estimate parameters using “incomplete data”. For example, interval censored data can be used to estimate survival. The algorithm starts with some initial estimate and “expects” missing values, knowing the initial parameters. Then, it maximizes the likelihood considering the expected data as the given data. These processes alternate until some stopping criteria is satisfied.

Notation:

  • The grid of times is constructed from all points \( L_i, U_i \) for \( i = 1,…,n\) in increasing order, i.e.

     0 = \tau_0\leq\tau_1\leq\dots\leq\tau_m\leq\tau_{m+1}=\infty

  • For each individual \( i\): \( \alpha_i \) denotes a weight such that

     \alpha_i = 1\cdot { (\tau_{j-1} , \tau_j] \subset (L_i, U_i] }.

    The Weight \(\alpha_{ij} \) is an event indicator – it indicates whether an event from the interval \( (L_i, U_i] \) occurred at \( \tau_j \).

  • Intervals \( (\tau_{j-1} , \tau_j] \) are called Turnbull intervals.
  • The probability that an event occurs in a j-th Turnbull interval is

     p_j = P(\tau_{j-1}\leq T\leq\tau_j) = S(\tau_{j-1}) - S(\tau_j)

The likelihood function for \(p = (p_1, …, p_{m+1})\) is:

L_S(p)= \prod_{i-1}^n [S(L_i) - S(R_i)] = \prod_{i-1}^n\sum_{j=1}^{m+1}\alpha_{ij} p_j

Finding NPMLE of S, denoted by \(\hat{S}\), means maximizing \(L_S(p)\) under the constraint that \(p_j\) are positive and \(\sum_{j=1}^{m+1} p_j = 1\).
Remark: \(L_S\) depends only on values of S at points \(\tau_j\), and the behaviour of S within the Turnbull intervals is unknown. The convention applied here is: \(\hat{S}(t) = \hat{S}(\tau_j)\) when \(\tau_{j-1}\leq t\leq \tau_j\).

Turnbull algorithm

The original Turnbull algorithm (1976) is the first and the simplest method to maximize \(L_S(p)\) with respect to \(p\). It is an application of the EM algorithm with two steps repeated until convergence:

  • E-step:

    \forall i=1,\dots ,n \quad\quad \hat{p}_i(\tau_j) = \frac{\alpha_{ij} p_j}{\sum_{k=1}^m \alpha_{ik} p_k}

  • M-step:

    \forall j=1,\dots ,m \quad\quad \hat{p}(\tau_j) = (1/n)\sum_{i=1}^{n} \hat{p}_i(\tau_j)

Isotonic regression type method

Monolix uses this method to estimate the survival function in the VPC plots.

If events are exactly observed, then the NPMLE of S is equal to the isotonic regression of

{1-d_1/n_1, 1-d_2/n_2,\dots ,1-d_m/n_m}

,

where \(d_j\) is the number of events until \(\tau_j\) and \(n_j\) is the number of individuals at risk at \(\tau_j\).

The algorithm is the following:

  • Step 0: Initialization – choose an initial guess for \( S^{old}(\tau_j) \)
  • Step 1: Compute the probability of an event at time \( \tau_j \) as

    p_j = S^{old}(\tau_{j-1}) - S^{old}(\tau_j), j = 1,\dots ,m

  • Step 2: Estimate the number of events at \( \tau_j \) as

     \tilde{d}_j = \sum_{i=1}^n \frac{\alpha_{ij} p_j}{\sum_{k=1}^m \alpha_{ik} p_k}, j = 1,\dots ,m

  • Step 3: Compute the estimated number of individuals at risk at time \( \tau_j \)

     \tilde{n}_j = \sum_{k=j}^m \tilde{d}_k, j = 1,\dots ,m

  • Step 4: Compute \(S^{new} \) using the Kaplan-Meier estimator using \( \tilde(d) \) and \( \tilde(n) \) to update the survival function
  • If the updated solution \(S^{new} \) is close to the old solution \(S^{old} \) (in Monolix, \(\Vert S^{new} – S^{old} \Vert < 0.001\) ), stop, otherwise set \(S^{old} = S^{new} \) and repeat steps 1 – 4

Example

Exactly observed events: the 4 individuals below are sorted in increasing order of the times of events or drop-outs, ie. \( 0<t1<t2<t3<t4\).

Interval censored events: the individuals are in increasing order of the times at which censored intervals end.

Observations:

  • The upper interval limits \(U\) are in the increasing order, because they correspond to the exact times in case of “exactly observed” events.
  • The lower interval limits \(L\) are not necessarily ordered with respect to themselves or with respect to the upper interval limits, eg. \( L_4 < L_2 < U_2 \).
  • The upper bound curve assumes that events occur exactly at the upper interval limit \( t_i = U_i \).
  • The lower bound curve assumes that events occur exactly at the lower interval limit \( t_i = L_i \).
Procedure

The time points of the lower and upper interval limits for all individuals are ordered in increasing sequence. In this example, the assumption is

0 < L_1 < U_1 < L_4 < L_2 < U_2 < L_3 < U_4

0 < \tau_1 < \tau_2 < \tau_3 < \tau_4 < \tau_5 < \tau_6 < \tau_7

The Turnbull intervals are the intervals \(\tau_{j-1},\tau_j\), where \(j=1,\dots , m\), with \(\tau_0=0\) and \(\tau_8=\infty\).

The weights \( \alpha_{ij} \) are elements of the following matrix:

First iteration of the algorithm

Step 0: The initial guess is the upper bound curve, that is the Kaplan-Meier estimator under the assumption that events occur at \( U_i \), interpolated on the \( \tau – \) grid.

S^{old}(\tau_0)=1,\quad S^{old}(\tau_1)=1,\quad S^{old}(\tau_2)=S_1,\quad S^{old}(\tau_3)=S_1,\quad S^{old}(\tau_4)=S_1\newline S^{old}(\tau_5)=S_2,\quad S^{old}(\tau_6)=S_2,\quad S^{old}(\tau_7)=S_4,\quad S^{old}(\tau_8)=S_4

Step 1: The probabilities that events occur in the Turnbull intervals are

p_1=0,\quad p_2=1-S_1,\quad p_3=0,\quad p_4=0\newline p_5=S_1-S_2,\quad p_6=0,\quad p_7=S_2-S_4,\quad p_8=0

Step 2: Modified number of events
\(\quad\quad\quad \tilde{d}_1=0\\
\quad\quad\quad\tilde{d}_2=\frac{\alpha_{12} p_2}{\alpha_{12} p_2 }=1\\
\quad\quad\quad\tilde{d}_3=0\\
\quad\quad\quad\tilde{d}_4=0\\
\quad\quad\quad\tilde{d}_5=\frac{\alpha_{25} p_5}{\alpha_{25} p_5 } + \frac{\alpha_{45} p_5}{\alpha_{45} p_5+\alpha_{47} p_7 }=1+\frac{S_1-S_2}{S_1-S_4 }\\
\quad\quad\quad\tilde{d}_6=0\\
\quad\quad\quad\tilde{d}_7=\frac{\alpha_{37} p_7}{\alpha_{37} p_7 } + \frac{\alpha_{47} p_7}{\alpha_{45} p_5+\alpha_{47} p_7 } =1+\frac{S_2-S_4}{S_1-S_4 }\\
\quad\quad\quad\tilde{d}_8=0\)

Step 3: Modified number of individuals at risk
\(\quad\quad\quad\tilde{n}_1=4\\
\quad\quad\quad\tilde{n}_2=4\\
\quad\quad\quad\tilde{n}_3=3\\
\quad\quad\quad\tilde{n}_4=3\\
\quad\quad\quad\tilde{n}_5=3 \\
\quad\quad\quad\tilde{n}_6=1+\frac{S_2-S_4}{S_1-S_4}\\
\quad\quad\quad\tilde{n}_7=1+\frac{S_2-S_4}{S_1-S_4}\\
\quad\quad\quad\tilde{n}_8=0\)

Step 4: Kaplan-Meier estimator for the modified values of \(d\) and \(n\):
\(\quad\quad\quad S^{new}(\tau_0)=1\\
\quad\quad\quad S^{new}(\tau_1) = 1\cdot (1-\frac{0}{4})=1\\
\quad\quad\quad S^{new}(\tau_2) = 1\cdot (1-\frac{1}{4})=\frac{3}{4}\\
\quad\quad\quad S^{new}(\tau_3) = \frac{3}{4}\cdot (1-\frac{0}{3}) = \frac{3}{4}\\
\quad\quad\quad S^{new}(\tau_4) = \frac{3}{4}\cdot (1-\frac{0}{3}) = \frac{3}{4}\\
\quad\quad\quad S^{new}(\tau_5) = \frac{3}{4}\cdot (1-\frac{1+\frac{1}{3}}{3}) = \frac{5}{12}\\
\quad\quad\quad S^{new}(\tau_6) = \frac{5}{12}\cdot (1-\frac{0}{1+\frac{2}{3}}) = \frac{5}{12}\\
\quad\quad\quad S^{new}(\tau_7) = \frac{5}{12}\cdot (1-\frac{1+\frac{2}{3}}{1+\frac{2}{3}}) = 0\\
\quad\quad\quad S^{new}(\tau_8) = 0\)