Select Page

Count data model

Objectives: learn how to implement a model for count data.

Projects: count1a_project, count1a_project, count1a_project, count2_project


Longitudinal count data is a special type of longitudinal data that can take only nonnegative integer values {0, 1, 2, …} that come from counting something, e.g., the number of seizures, hemorrhages or lesions in each given time period . In this context, data from individual j is the sequence y_i=(y_{ij},1\leq j \leq n_i) where y_{ij} is the number of events observed in the jth time interval I_{ij}.
Count data models can also be used for modeling other types of data such as the number of trials required for completing a given task or the number of successes (or failures) during some exercise. Here, y_{ij} is either the number of trials or successes (or failures) for subject i at time t_{ij}. For any of these data types we will then model y_i=(y_{ij},1 \leq j \leq n_i) as a sequence of random variables that take their values in {0, 1, 2, …}.  If we assume that they are independent, then the model is completely defined by the probability mass functions \mathbb{P}(y_{ij}=k) for k \geq 0 and 1 \leq j \leq n_i. Here, we will consider only parametric distributions for count data.

Formatting of count data in the MonolixSuite

Count data can take only non-negative integer values that come from counting something, e.g., the number of trials required for completing a given task. The task can for instance be repeated several times and the individuals performance followed. In the following data set:

1 0 10
1 24 6
1 48 5
1 72 2

10 trials are necessary the first day (t=0), 6 the second day (t=24), etc. Count data can also represent the number of events happening in regularly spaced intervals, e.g the number of seizures every week. If the time intervals are not regular, the data may be considered as repeated time-to-event interval censored, or the interval length can be given as regressor to be used to define the probability distribution in the model.
One can see the epilepsy attacks data set for a more practical example.

Count data with constant distribution over time

  • count1a_project (data = ‘count1_data.txt’, model = ‘count_library/poisson_mlxt.txt’)

A Poisson model is used for fitting the data:

input = lambda

Y = {type = count,  log(P(Y=k)) = -lambda + k*log(lambda) - factln(k) }

output = Y

Residuals for noncontinuous data reduce to NPDE’s. We can compare the empirical distribution of the NPDE’s with the distribution of a standardized normal distribution either with the pdf (on the left) and the cdf (on the right):

VPC’s for count data compare the observed and predicted frequencies of the categorized data over time:

  • count1b_project (data = ‘count1_data.txt’, model = ‘count_library/poissonMixture_mlxt.txt’)

A mixture of two Poisson distributions is used to fit the same data. For that, we define the probability of k occurrences as the weigthed sum of two Poisson distribution with two expected number of occurrences lambda1 and lambda2. The structural model file writes

input = {lambda1, alpha, mp}

lambda2 = (1+alpha)*lambda1

Y = { type = count,
       P(Y=k) = mp*exp(-lambda1 + k*log(lambda1) - factln(k)) + (1-mp)*exp(-lambda2 + k*log(lambda2) - factln(k)) 

output = Y

Thus, the parameter alpha has to be strictly positive to ensure different expected number of occurrences in the two poisson distributions and mp has to be in [0, 1] to ensure the probability is correctly defined. Thus those parameters should be defined with lognormal and probitnormal distribution respectively as shown on the following figure.

We see on the VPC below that the data set is well modeled using this mixture of Poisson distribution.

In addition, we can compute the prediction distribution of the modalities as on the following figure

Count data with time varying distribution

  • count2_project (data = ‘count2_data.txt’, model = ‘count_library/poissonTimeVarying_mlxt.txt’)

The distribution of the data changes with time in this example:

We then use a Poisson distribution with a time varying intensity:

input =  {a,b}
lambda= a*exp(-b*t)
y = {type=count, P(y=k)=exp(-lambda)*(lambda^k)/factorial(k)}

output = y

This model seems to fit the data very well: