Kaplan-Meier Survival Probability Estimates
Suppose that 100 subjects of a certain type were tracked over a period of
time to determine how many survived for one year, two years, three years, and so
forth. If all the subjects remained accessible throughout the entire length of
the study, the estimation of year-by-year survival probabilities for subjects of
this type in general would be an easy matter. The survival of 87 subjects at the
end of the first year would give a one-year survival probability estimate of
87/100=0.87; the survival of 76
subjects at the end of the second year would yield a two-year estimate of
76/100=0.76; and so forth.
But in real-life longitudinal research it rarely works out this neatly.
Typically there are subjects lost along the way for reasons unrelated to the
focus of the study. To illustrate the complication in this sort of situation,
consider the following hypothetical scenario. Of the 100 subjects who are "at
risk" at the beginning of the study, 3 become unavailable during the first year
and 5 are known to have died by the end of the first year. Another 3 become
unavailable during the second year and another 10 are known to have died by the
end of the second year. And so on for the other years shown. For the sake of
numerical simplicity I am showing 3 subjects becoming unavailable in each of the
five years. In real-life research the loss rate would of course not normally be
so uniform as this.
Time Period
| At
Risk
| Became Unavailable (Censored)
| Died
| Survived
|
Year
1
| 100
| 3
| 5
| ?
|
Year
2
| ?
| 3
| 10
| ?
|
Year
3
| ?
| 3
| 15
| ?
|
Year
4
| ?
| 3
| 20
| ?
|
Year
5
| ?
| 3
| 25
| ?
|
The question in a
situation of this sort is: What shall we make of the subjects who become
unavailable in a given time period? (Within the context of the Kaplan-Meier
procedure, the subjects who become unavailable are spoken of as
censored.) We in fact do not know whether these subjects survived or
died. Yet, if we were simply to omit them from the study, we would be losing
valuable information: namely, that the 3 subjects who became unavailable during
Year 2 survived at least through Year 1; that the 3
who became unavailable during Year 3 survived at least through
Year 2; and so on.
Kaplan and Meier,
recognizing that any attempt to salvage this information would involve a certain
amount of "fudging," proposed that subjects who become unavailable during a
given time period be counted among those who survive through the end of that
period, but then deleted from the number who are at risk for the next time
period. "These conventions," they wrote,
|
| may be
paraphrased by saying that deaths recorded as of [time]
t are treated as if they occurred slightly before
t, and losses recorded as of [time] t
are treated as occurring slightly after t. In this way
the fudging is kept conceptual, systematic, and automatic. (Kaplan &
Meier, 1958)
| |
Time Period
| At
Risk
| Became Unavailable (Censored)
| Died
| Survived
|
Year
1
| 100
| 3
| 5
| 95
|
Year
2
| 92
| 3
| 10
| 82
|
Year
3
| 79
| 3
| 15
| 64
|
Year
4
| 61
| 3
| 20
| 41
|
Year
5
| 38
| 3
| 25
| 13
| The adjacent table shows how these conventions would work
out for the present example. Of the 100 subjects who are at risk at the
beginning of the study, 3 become unavailable during the first year and 5 die.
The number surviving the first year is therefore 100-5=95 and the number at risk at
the beginning of Year 2 is 100-3-5=92. Another 3 subjects
become unavailable during the second year and another 10 die. So the number
surviving Year 2 is 92-10=82 and the number at risk at
the beginning of Year 3 is 92-3-10=79. And so on for the
other years shown.
As illustrated in the next table, the
Kaplan-Meier procedure then calculates the survival probability estimate for
each of the t time periods, except the first, as a compound conditional
probability.
Time Period
| At
Risk
| Became Unavailable (Censored)
| Died
| Survived
| Kaplan-Meier Survival Probability
Estimate
|
Year
1
| 100
| 3
| 5
| 95
| (95/100)=0.95
|
Year
2
| 92
| 3
| 10
| 82
| (95/100)x(82/92)=0.8467
|
Year
3
| 79
| 3
| 15
| 64
| (95/100)x(82/92)x(64/79)=0.70
|
Year
4
| 61
| 3
| 20
| 41
| (95/100)x(82/92)x(64/79)x(41/61)=0.4611
|
Year
5
| 38
| 3
| 25
| 13
| (95/100)x(82/92)x(64/79)x(41/61)x(13/38)=0.1577
| The
estimate for surviving through Year 1 is simply 95/100=0.95. And if one does
survive through Year 1, the conditional probability of then
surviving through Year 2 is 82/92=0.8913. The estimated
probability of surviving through both Year 1 and Year
2 is therefore (95/100)x(82/92)=0.8467. Similarly,
if one survives through the first two years, the conditional probability
of then surviving through Year 3 is 64/79=0.8101. So the estimated probability
of surviving through Year 1 and Year 2
and Year 3 is (95/100)x(82/92)x(64/79)=0.70. And
similarly for the other time periods.
This cumbersome structure is shown
only to illustrate the logic of the procedure. For practical computational
purposes, the same results can be obtained more efficiently by using the
Kaplan-Meier product-limit estimatorQ
 where S(ti) is the estimated survival
probability for any particular one of the t time periods;
ni is the number of subjects at risk at the
beginning of time period ti; and
di is the number of subjects who die during time
period ti.
The Kaplan-Meier procedure is
not limited to the measurement of survival in the narrow sense of dying or not
dying. It can also be used to estimate the time-defined probabilities for the
failure of an instrument or device of a certain type; or alternatively, to
estimate the time-defined probabilities for some particular type of success
(e.g., finding employment after becoming unemployed).
For purposes of illustration, the following Kaplan-Meier calculator is set
up for 5 time periods and the values that need to be entered for the above
example (total number of subjects along with the number of subjects for each
time period who died or became unavailable) are already in place. To perform the
analysis on the data of this example, click the «Calculate» button. To perform
an analysis on a different set of data with exactly 5 time periods, click the
«Clear» button, enter the relevant values into the yellow cells, and then click
«Calculate». To perform an analysis with fewer or more than 5 time periods,
click the «Reload» button and enter the number of time periods at the prompt.
The user's own labels for the time periods can be substituted for the labels t1,
t2, etc. c
Data EntryQ
|
|
| <==enter here the number
of subjects enrolled <==at the
beginning of the study.
|
Time Period
| At
Risk
| Became Unavailable (Censored)
| Died (Failed) (Succeeded)
| Survival Probability Estimate
| 0.95 Confidence Interval
|
Lower Limit
| Upper Limit
|
|
| The lower and upper limits of the 95% confidence
intervals are calculated according to the efficient-score method (corrected
for continuity) described by Robert Newcombe (1998), based on
the procedure outlined by E. B. Wilson (1927).
References:
Kaplan, E.L. & Meier, P. "Nonparametric estimation from
incomplete observations," Journal of the American Statistical
Association, 53, 457-481 (1958).
Newcombe, Robert G.
"Two-Sided Confidence Intervals for the Single Proportion: Comparison of
Seven Methods," Statistics in Medicine, 17, 857-872
(1998).
Wilson, E. B. "Probable Inference, the Law of Succession, and
Statistical Inference," Journal of the American Statistical
Association, 22, 209-212 (1927).
|