# VPC for Time-To-Event data

Construction of a VPC plot for time-to-event data in Monolix requires several steps. First, Monolix simulates datasets using the population model, which contains the structural and statistical models and the estimated population parameters. Then, for each simulated dataset, it estimates the survival function with the Kaplan-Meier estimator for exactly observed events or Turnbull for interval censored. Because in time-to-event problems, time itself is an observation, simulated curves are interpolated on a fixed grid. As a result, at each time point of the new grid, there is a set of probabilities that events occur after that point. Finally, Monolix calculates lower and upper percentiles, which give a prediction interval.

# Simulation of data sets using a population model

Simulations of time-to-event datasets give, for each individual, times when an event occurs. They keep the same design as in the original data set, the same number of individuals as in the original data set and individual parameters are sampled using the covariates from this original data set and the population parameters estimated in Monolix.

To simulate a time when an event occurs, $$u$$ is defined as a probability that an event $$X$$ occurs before a certain time $$T$$. This probability is computed using the cumulative hazard function $$H(t)$$

$u = P(X\leq T) = 1-exp\left(-\int_0^T h(t,\psi_i) dt\right) = 1-exp(-H(T)),$

where $$H$$ is a solution to the ODE system $$dH/dt = h$$. Value of the probability $$u$$ is sampled from a uniform $$[0,1]$$ distribution, thus in the above equation time T becomes the unknown. Root finding method from an ODE solver finds its value.

Time T is not a given independent variable, but it is a simulated observation. Moreover, simulations are only till time $$T_{max}$$, which corresponds to the last time found in the data set. This value is the final time in integration of ODEs for each individual and it also defines a unique right censoring time for all individuals in the simulated data sets. It is independent of the times of events or right censoring in the original data set.

#### Example:

Individual 3 in the original data set (table on the left) experienced an event at time 4. In simulations (tables on the right) an event for this individual can occur before or after that time, or may occur after $$T_{max}$$  (last table). Similarly, individual 5 left the study at time 3 without experiencing an event in the original data set. But in simulations, an event can occur before or after time 3. If an event has not occurred at $$T_{max}=5$$, then it is right censored.

#### Survival function

Monolix estimates survival functions for each simulated dataset. Depending on the event type, there are two methods:

1. Kaplan-Meier estimator for exactly observed events. Simulated survival step-functions are directly interpolated on a fixed grid (see next section), while the linear approximation is applied to the empirical  Kaplan-Meier curve.
2. Turnbull estimator for interval censored data. Monolix uses this method for VPC only if a user includes the “eventType = intervalCensored” option in the structural model
DEFINITION:
Event = {type = event, eventType = intervalCensored, intervalLength = L,
maxEventNumber = 1, hazard = h}

Plot of the empirical survival curve uses Turnbull intervals constructed from the censored intervals of the original data set.

Simulations of interval censored datasets give, as in the case of exactly observed events, exact times of events for each individual. Then, events are counted in each interval. The definition of intervals depends if the structural model contains the option “intervalLength” or not.
–    If “intervalLength” is given, then the intervals are

$[0, L], (L, 2\cdot L],\dots , (m\cdot L,T_{max} ]$

–    If “intervalLength” is missing, then the observation period $$[0,T_{max}]$$ is divided into 10 equal intervals.

# Prediction interval

To generate VPC plot, Monolix performs by default 500 simulations with N individuals as in the original data set. As a result, it produces 500 time series that form 500 survival curves (Kaplan-Meier or Turnbull estimators). Time is a random number, so all estimated survival functions have different time grid (Fig.(a)). Interpolation of these curves on a fixed time grid allows to calculate the prediction interval (Fig.(b)). In Monolix a default setting consist of a uniform grid with 100 points. For exactly observed events there are also all time points from the original data set to provide all possible information. Exact time is unknown for interval censored events, so dataset time points have no additional information. Interpolation at the new grid gives at each time point of the grid a set of probabilities corresponding to different simulated survival curves. It allows to calculate lower and upper percentiles (the 5th and 95 percentiles are a default choice) and draw the prediction interval (Fig.(c)).

#### Example 1: exactly observed events.

The dataset contains times of exactly observed single events of 30 individuals. The survival model has a constant hazard function and the Kaplan-Meier estimator approximates the survival function. VPC plot uses a linear approximation of the Kaplan-Meier curve (Fig. on the left). By default, the fixed grid has 100 uniformly distributed points together with all time points from the date set. Figure on the left presents the VPC plot with default settings. Approximation of the empirical and simulated Kaplan-Meier curves is sufficient on the default grid. In contrast, VPC in the Figure on the right uses a grid with time points only from the original dataset. The accuracy of the prediction interval decreased.

#### Example 2: interval censored events.

In this dataset, the times of events from the previous example are the ends of time intervals within which events occurred. The length of the censored intervals is non constant among intervals, and oscillates around 5. The structural model includes this information in eventTime and intervalLength options.

# Survival function estimators

## Exact events: Kaplan-Meier estimator

Kaplan-Meier estimator is a standard method to approximate the survival function $$S(t)$$. It describes the probability that an individual survives until time t, knowing that it survived at any earlier time. For single events, it is given by the following formula

$\hat{S}(t)=\sum_{i:t_i

where

• $$t_i$$ – times before t, when at least one event occurred,
• $$d_i$$ – number of events at the time $$t_i$$,
• $$n_i$$ – number of individuals at risk, that is who did not experience an event until $$t_i$$.

The probability $$(p_e)$$ of an event is the ratio between number of occurred events $$(d_i)$$ and number of individuals at risk $$(n_i)$$. The complement of it, $$(1-p_e)$$, gives an estimation of the survival. For each time t, the total number of individuals at risk, who survived, may change. Thus, current survival probability is the product of all probabilities at previous times $$t_i$$, when at least one event occurred. It is similar to calculating the probability that a patient survives 2 days. It is a product of two probabilities: patient survives the first day and a conditional probability that it survives the second day, knowing that it survived the first one.

#### Example:

A typical example of a time-to-event data set contains information about exact times when individuals experienced an event or when they left a study (drop-out). Below, five individuals have two observations: time when the observation starts ( 0 for everyone) and time of an event. If a patient leaves a study, then the dataset contains a drop-out time with 1, instead of 0, in the column for the observation. The patient didn’t experience an event but survived until the drop-out time. Kaplan-Meier estimator takes into account such situations, because these individuals are not counted as individuals at risk (they are not counted in the denominator $$n_i$$) at later times.

A study starts at time $$t_1$$. There are no events, so $$d_1=0$$ and the value of the survival curve is 1. Until the next event time at $$t_2=1$$, the survival remains constant. Then, one individual experienced an event, $$d_2=1$$, and all individuals survived until that time $$n_2=5$$. As a result, the probability to survive decreases by 0.2. It is the height of the jump at $$t=1$$ in the plot. Survival remains constant until the next event. At time $$t=3$$ there are two events. The number $$n_3$$ counts now only 4 individuals – it has decreased by 1 due to the previous event. The final probability at time 3 is a product with all earlier probabilities. At $$t=4$$ there is a drop-out. Patient 5 left the study without experiencing an event. The survival curve remains constant and a red dot marks the drop-out. The Kaplan-Meier estimator takes into account this situations, because at the next event time $$t=5$$, this individual is not counted as an individual at risk – denominator n will be smaller. Finally, at time $$t=5$$, there is only one individual left, and one event, so the survival becomes 0.

#### Remarks

• Kaplan-Meier estimator handles correctly information about individuals who left the study, but there is a bias when the exact times of events are unknown.
• In data visualization, Monolix assumes that all events are exactly observed. For example: assume that an observation period started at $$t=0$$ and at $$t=1$$ an event is marked by 1 in the column for the observation. Without any other information, event could occurred at $$t=1$$ or before. Similarly when the dataset has the beginning of the study and time interval limits. Just looking at the data set, an exact and interval censored events are indistinguishable.
• If event type is not specified in the structural model, Monolix considers all interval censored events as exactly observed at the end of a censored interval.

## Interval censored events: Turnbull estimator

Interval censored events do not occur at exact times $$t_i$$, $$i = 1,…,n)$$, but within some time intervals between times $$L_i$$ and $$U_i$$, $$L_i < t_i \leq U_i$$. In case of right censored data, eg. dropout, an event occur within the interval $$(L_i, +\infty)$$, where $$L_i$$ is the time when an individual leaves a study.

### Non parametric maximum likelihood estimator (NPMLE)

Expectation-Maximisation (EM) algorithm is a technique to estimate parameters using “incomplete data”. For example, interval censored data to estimate survival. The algorithm starts with some initial estimate and “expects” missing values, knowing the initial parameters. Then, it maximizes the likelihood considering the expected data as the given data. These processes alternates until some stopping criteria is satisfied.

Notation:

• Grid of times constructed from all points $$L_i, U_i$$ for $$i = 1,…,n$$ in increasing order, i.e.

$0 = \tau_0\leq\tau_1\leq\dots\leq\tau_m\leq\tau_{m+1}=\infty$

• For each individual $$i$$: $$\alpha_i$$ denotes a weight such that

$\alpha_i = 1\cdot { (\tau_{j-1} , \tau_j] \subset (L_i, U_i] }.$

Weights $$\alpha_{ij}$$ are events indicators – indicate whether an event from the interval $$(L_i, U_i]$$ occurred at $$\tau_j$$.

• Intervals $$(\tau_{j-1} , \tau_j]$$ are called Turnbull intervals.
• Probability that an event occurs in a j-th Turnbull interval is

$p_j = P(\tau_{j-1}\leq T\leq\tau_j) = S(\tau_{j-1}) - S(\tau_j)$

Likelihood function for $$p = (p_1, …, p_{m+1})$$ is:

$L_S(p)= \prod_{i-1}^n [S(L_i) - S(R_i)] = \prod_{i-1}^n\sum_{j=1}^{m+1}\alpha_{ij} p_j$

Finding NPMLE of S, denoted by $$\hat{S}$$, means maximizing $$L_S(p)$$ under the constraint that $$p_j$$ are positive and $$\sum_{j=1}^{m+1} p_j = 1$$.
Remark: $$L_S$$ depends only on values of S at points $$\tau_j$$, and behaviour within the Turnbull intervals is unknown. Convention applied here is: $$\hat{S}(t) = \hat{S}(\tau_j)$$ when $$\tau_{j-1}\leq t\leq \tau_j$$.

### Turnbull algorithm

Original Turnbull algorithm (1976) is the first and the simplest method to maximize $$L_S(p)$$ with respect to $$p$$. It is an application of the EM algorithm with two steps repeated until convergence:

• E-step:

$\forall i=1,\dots ,n \quad\quad \hat{p}_i(\tau_j) = \frac{\alpha_{ij} p_j}{\sum_{k=1}^m \alpha_{ik} p_k}$

• M-step:

$\forall j=1,\dots ,m \quad\quad \hat{p}(\tau_j) = (1/n)\sum_{i=1}^{n} \hat{p}_i(\tau_j)$

### Isotonic regression type method

Monolix uses this method to estimate survival function in the VPC plots.

If events are exactly observed, then the NPMLE of S is equal to the isotonic regression of

${1-d_1/n_1, 1-d_2/n_2,\dots ,1-d_m/n_m}$

,

where $$d_j$$ is the number of events until $$\tau_j$$ and $$n_j$$ is the number of individuals at risk at $$\tau_j$$.

The algorithm is the following:

• Step 0: Initialization – choose an initial guess for $$S^{old}(\tau_j)$$
• Step 1: Compute the probability of an event at time $$\tau_j$$ as

$p_j = S(\tau_{j-1}) - S(\tau_j), j = 1,\dots ,m$

• Step 2: Estimate the number of events at $$\tau_j$$ as

$\tilde{d}_j = \sum_{i=1}^n \frac{\alpha_{ij} p_j}{\sum_{k=1}^m \alpha_{ik} p_k}, j = 1,\dots ,m$

• Step 3: Compute the estimated number of individuals at risk at time $$\tau_j$$

$\tilde{n}_j = \sum_{k=j}^m d_k, j = 1,\dots ,m$

• Step 4: Compute $$S^{new}$$ using the Kaplan-Meier estimator using $$\tilde(d)$$ and $$\tilde(n)$$ to update the survival function
• If the updated solution $$S^{new}$$ is close to the old solution $$S^{old}$$, stop, otherwise set $$S^{old} = S^{new}$$ and repeat steps 1 – 4

### Example

Exactly observed events: 4 individuals are in the increasing order of times of events or drop-outs $$0<t1<t2<t3<t4$$.

Interval censored events: individuals are in the increasing order of times, when censored intervals end

Observations:

• Upper interval limits $$U$$ are in the increasing order, because they correspond to the exact times in case of “exactly observed” events
• Lower interval limits $$L$$ are not necessarily ordered with respect to themselves or with respect to the upper interval limits, eg. $$L_4 < L_2 < U_2$$
• Upper bound assumes that events occur exactly at the upper interval limit $$t_i = U_i$$
• Lower bound assumes that events occur exactly at the lower interval limit $$t_i = L_i$$
##### Procedure

Time points of the lower and upper interval limits for all individuals are ordered in the increasing sequence. In this example, the assumption is

$0 < L_1 < U_1 < L_4 < L_2 < U_2 < L_3 < U_4$

$0 < \tau_1 < \tau_2 < \tau_3 < \tau_4 < \tau_5 < \tau_6 < \tau_7$

Turnbull intervals are intervals $$\tau_{j-1},\tau_j$$, where $$j=1,\dots , m$$.

Weights $$\alpha_{ij}$$ are elements of the following matrix

First iteration of the algorithm

Step 0: Initial guess is the upper bound curve, that is the Kaplan-Meier estimator under the assumption that events occur at $$U_i$$, interpolated on the $$\tau –$$ grid.

$S^{old}(\tau_0)=1,\quad S^{old}(\tau_1)=1,\quad S^{old}(\tau_2)=S_1,\quad S^{old}(\tau_3)=S_1,\quad S^{old}(\tau_4)=S_1\newline S^{old}(\tau_5)=S_2,\quad S^{old}(\tau_6)=S_2,\quad S^{old}(\tau_7)=S_3,\quad S^{old}(\tau_8)=S_3$

Step 1: Probabilities that events occur in the Turnbull intervals are

$p_1=0,\quad p_2=1-S_1,\quad p_3=0,\quad p_4=0\newline p_5=S_1-S_2,\quad p_6=0,\quad p_7=S_2-S_3,\quad p_8=0$

Step 2: Modified number of events
$$\quad\quad\quad \tilde{d}_1=0\\ \quad\quad\quad\tilde{d}_2=\frac{\alpha_{12} p_2}{\alpha_{12} p_2 }=1\\ \quad\quad\quad\tilde{d}_3=0\\ \quad\quad\quad\tilde{d}_4=0\\ \quad\quad\quad\tilde{d}_5=\frac{\alpha_{25} p_5}{\alpha_{25} p_5 } + \frac{\alpha_{45} p_5}{\alpha_45 p_5+\alpha_{47} p_7 }=1+\frac{S_1-S_2}{S_1-S_4 }\\ \quad\quad\quad\tilde{d}_6=0\\ \quad\quad\quad\tilde{d}_7=\frac{\alpha_{37} p_7}{\alpha_{37} p_7 } + \frac{\alpha_{47} p_7}{\alpha_45 p_5+\alpha_{47} p_7 } =1+\frac{S_2-S_4}{S_1-S_4 }\\ \quad\quad\quad\tilde{d}_8=0$$

Step 3: Modified number of individuals at risk
$$\quad\quad\quad\tilde{n}_1=4\\ \quad\quad\quad\tilde{n}_2=4\\ \quad\quad\quad\tilde{n}_3=3\\ \quad\quad\quad\tilde{n}_4=3\\ \quad\quad\quad\tilde{n}_5=3 \\ \quad\quad\quad\tilde{n}_6=1+\frac{S_2-S_4}{S_1-S_4}\\ \quad\quad\quad\tilde{n}_7=1+\frac{S_2-S_4}{S_1-S_4}\\ \quad\quad\quad\tilde{n}_8=0$$

Step 4: Kaplan-Meier estimator for the modified values of $$d$$ and $$n$$:
$$\quad\quad\quad S^{new}(\tau_0)=1\\ \quad\quad\quad S^{new}(\tau_1) = 1\cdot (1-\frac{0}{4})=1\\ \quad\quad\quad S^{new}(\tau_2) = 1\cdot (1-\frac{1}{4})=\frac{3}{4}\\ \quad\quad\quad S^{new}(\tau_3) = \frac{3}{4}\cdot (1-\frac{0}{3}) = \frac{3}{4}\\ \quad\quad\quad S^{new}(\tau_4) = \frac{3}{4}\cdot (1-\frac{0}{3}) = \frac{3}{4}\\ \quad\quad\quad S^{new}(\tau_5) = \frac{3}{4}\cdot (1-\frac{1+\frac{1}{3}}{3}) = \frac{5}{12}\\ \quad\quad\quad S^{new}(\tau_6) = \frac{5}{12}\cdot (1-\frac{0}{1+\frac{2}{3}}) = \frac{5}{12}\\ \quad\quad\quad S^{new}(\tau_7) = \frac{5}{12}\cdot (1-\frac{1+\frac{2}{3}}{1+\frac{2}{3}}) = 0$$