Additional notes:
Survival analysis with STATA
Ivan Iachine
Revision 2.30, 29.10.2003
1 Introduction
These notes contain additional material not included in the slides used at
the lectures. In particular the use of STATA survival analysis procedures
is illustrated in detail. The STATA output presented in these notes was
produced using STATA version 8.1. It is assumed that the reader has some
basic knowledge about STATA and the main concepts of STATA datasets (ie.
what is a variable, etc). In particular, one has to be able to start STATA, to
open files containing STATA datasets, to open the data in the browser and
to enter commands in the STATA command line. The description of STATA
survival analysis procedures in these notes is based on the “RATS” dataset
and Exercise series 1.
2 Preparing a dataset for survival analysis
Note, that STATA allocates memory for datasets statically, meaning that
there is a fixed amount of space reserved for data. The size of this space
can only be increased if all data is deleted from the memory. Consequently,
creating new variables in large datasets can fail. In this case one has to clear
the memory and increase the amount of space reserved for data manually by
1
using the set memory command described in the STATA survival analysis
reference section. Therefore it is recommended to adjust the memory size
before the actual analysis of the data.
Before one can carry out any STATA survival analysis procedures on
a dataset, it must be prepared for usage by means of the STATA’s stset
command. In the following the preparation procedure is described in more
detail. First one has to open the file containing the dataset in STATA (eg.
using the File → Open... menu or by entering the use rats command).
Next, inspect the variables contained in the dataset in order to find the
failure time variable and the censoring indicator variable. The datasets used
in the course observe the following naming conventions: the failure time
variable is called time and the censoring indicator variable is called event
(except for the dataset “MALEMICE” where there are several censoring
indicators event0,event1,event2 associated with different death causes).
Since we only consider right censoring with a single event per individual
without delayed entry in the course, the two variables are sufficient. Next,
enter the following command in the command line:
stset time, failure(event)
You should now see an output that looks like the following:
. stset time, failure(event)
failure event: event != 0 & event < .
obs. time interval: (0, time]
exit on or before: failure
------------------------------------------------------------------------------
40 total obs.
0 exclusions
------------------------------------------------------------------------------
40 obs. remaining, representing
36 failures in single record/single failure data
2
9118 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 344
The dataset name field above should contain the name of the dataset you
have opened previously.
Now you are ready to use STATA’s survival analysis procedures. Their
names usually start with the letters st (eg. stcox, sts, etc). The idea of
preparing a dataset for subsequent survival analysis using stset is that you
do not have to write the name of the failure time variable and censoring in-
dicator each time you call a survival analysis procedure – STATA remembers
this for you. Of course, if you open a different dataset you will have to use
the stset command again. In the following we assume that the dataset has
been correctly prepared for the analysis using stset.
3 Kaplan-Meier estimates and related topics
Kaplan-Meier estimates are classical estimates of the survival function, as
discussed in the theoretical section. In this section we show how to plot
these estimates on a graph with STATA, including the use of stratification
techniques to compare survival of several populations and how to obtain an
estimate of the median survival times from the Kaplan-Meier estimator. It
is shown how to obtain confidence intervals for the median survival time and
for the value of the survival function at a particular age.
3.1 Kaplan-Meier plots
To plot the Kaplan-Meier (K-M) estimates on a graph the command sts
graph is used. This would plot a single graph for the entire dataset. If the
dataset contains several groups, say the variable group contains the group
number, one may want to plot several K-M estimates on the same graph
in order to compare survival of the subpopulations. To do this issue the
3
0
.0
0
0
.2
5
0
.5
0
0
.7
5
1
.0
0
0 100 200 300 400
analysis time
group = 1 group = 2
Kaplan−Meier survival estimates, by group
Figure 1: Exercise 1.1
command:
sts graph, by(group)
The output from STATA’s graphics window is shown on Figure 1. Now it is
possible to compare the survival of the two groups.
3.2 Pointwise confidence intervals
It is possible to use STATA for computation of pointwise confidence intervals
for S(t) (ie. the value of the survival function at a given age t) based on the
Greenwood’s formula introduced in the theoretical section. In order to plot
the confidence intervals on a graph together with the K-M estimate use the
command:
sts graph, by(group) gwood
The default is to compute 95% confidence intervals (see the STATA manual
for information of how to change the confidence level). This will actually
4
0
.2
5
.5
.7
5
1
0 100 200 300 400 0 100 200 300 400
1 2
95% CI Survivor function
analysis time
Graphs by pretreatment regime
Kaplan−Meier survival estimates, by group
Figure 2: Exercise 1.2
produce two graphs, one for group=1, another for group=2, as shown on Fig-
ure 2. This gives some idea about the quality of K-M estimates computed
for our dataset. However, it is a bad idea to read the confidence intervals di-
rectly from the graphs. In Exercise 1.2 we have to compute a 95% confidence
interval for S(200). To compute it directly use the following command:
sts list, by(group) at(200 201)
This produces the following output (only relevant part of the output shown):
5
Beg. Survivor Std.
Time Total Fail Function Error [95% Conf. Int.]
----------------------------------------------------------------------
group=1
200 14 6 0.6842 0.1066 0.4279 0.8439
201 14 0 0.6842 0.1066 0.4279 0.8439
group=2
200 18 4 0.8095 0.0857 0.5689 0.9239
201 18 0 0.8095 0.0857 0.5689 0.9239
----------------------------------------------------------------------
Note, that in order to compute the values at t = 200 we had to request the
values for two ages (200 and 201). This is because specifying the at(200)
option alone in the sts command would have produced values of S(t) at 200
uniformly spaced points (for details see the STATA manual). From the out-
put we can read the estimated value of S(200) for group=1 and group=2 being
equal to 0.6842 and 0.8095 with 95% confidence intervals [0.4279, 0.8439] and
[0.5689, 0.9239] respectively.
3.3 Estimating median survival time
As it was presented in the theoretical section, the median survival time es-
timate can be obtained using the K-M estimate of the survival function,
namely as the smallest observed failure time where the K-M estimate is not
greater than 0.5. To compute the median estimates for the two groups use
the following command:
stsum, by(group)
The STATA output is as follows:
| incidence no. of |------ Survival time -----|
group | time at risk rate subjects 25% 50% 75%
------+--------------------------------------------------------------
1 | 4095 .0041514 19 190 216 234
6
2 | 5023 .0037826 21 232 233 280
------+--------------------------------------------------------------
total | 9118 .0039482 40 205 232 261
The median survival time estimate can be read in the output as the 50%
percentile. In our case the median estimates for the two groups are 216 and
233 respectively.
The 95% confidence intervals for the median survival times may be com-
puted using the stci command:
stci, median by(group)
The STATA output is as follows:
failure _d: event
analysis time _t: time
| no. of
group | subjects 50% Std. Err. [95% Conf. Interval]
------+-------------------------------------------------------------
1 | 19 216 5.171042 190 234
2 | 21 233 2.179595 232 280
------+-------------------------------------------------------------
total | 40 232 2.562933 213 239
The 95% confidence interval for the median for group=1 is [190, 234] and for
group=1 the respective interval is [232, 280].
4 Testing for equality of survival distributions
4.1 Logrank test
The logrank test allows for testing for the equality of two or more survival
distributions, as discussed in the theoretical section. In our case (Exercise
1.4) we wish to test the hypothesis that the two groups (ie. group=1 and
7
group=2) have equal survival distributions. To test this hypothesis using the
logrank test enter the following STATA command:
sts test group, logrank
The output is (only relevant parts shown):
| Events
group | observed expected
------+-------------------------
1 | 17 12.24
2 | 19 23.76
------+-------------------------
Total | 36 36.00
chi2(1) = 3.12
Pr>chi2 = 0.0772
The most important figure for us is the P-value (denoted Pr>chi2 in the
output). In our case it is equal to 0.0772 and the hypothesis about the
equality of the survival distributions is accepted on the 5% significance level,
since 0.0772 > 0.05.
5 Cox regression
Cox regression procedure is used to evaluate the effects of explanatory vari-
ables or covariates on the hazard rate using a proportinal hazard regression
model discussed in the theoretical section shown below for the case of a single
covariate x:
h(t|x) = h0(t)e
βx
Our dataset “RATS” contains a single explanatory variable group that takes
values 1 and 2. The following STATA command executes the Cox regression
procedure to estimate β:
stcox group, nohr
The output is:
8
No. of subjects = 40 Log likelihood = -100.71913
No. of failures = 36 chi2(1) = 2.88
Time at risk = 9118 Prob > chi2 = 0.0898
------------------------------------------------------------------------------
time |
event | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
group | -.5958958 .3484041 -1.710 0.087 -1.278755 .0869637
------------------------------------------------------------------------------
In the table we see a number of values associated with the group vari-
able. They are (left to right): an estimate the regression coefficient β
(βˆ = −0.5958958), the standard error of the estimates, Wald’s test statistics,
P-value for the Wald’s test and a 95% confidence interval for the regression
coefficient.
Note, that the assignment of group numbers is arbitrary (ie. one could
change the assignment of 1 and 2 without any change in the meaning of
the data). This is a example of nominal scale data. It usually does not
make much sense to include such covariate in the regression equation directly.
Including a variable into the equation implicitly means that we suppose that
there is a monotonic relationship between the increase of the variable value
and the increase in the risk. For nominal scale data it is convenient to use
dummy variables in the regression equation instead of the original variable.
Assume for a moment that the variable group actually takes three values
1, 2 and 3. We want to estimate the relative risk for members of groups 2
and 3 relative to group 1. To do so we construct 2 new variables:
x2 =
1, if group = 2
0, if group 6= 2
x3 =
1, if group = 3
0, if group 6= 3
9
The dummy variables x2, x3 are then included into the regression equation:
h(t|x2, x3) = h0(t)e
β2x2+β3x3
This regression equation implies:
h(t|group = 1) = h0(t)
h(t|group = 2) = h0(t)e
β2
h(t|group = 3) = h0(t)e
β3
This means that eβ2 and eβ3 are the two relative risks we wanted to estimate.
Fortunately, STATA is able to construct dummy variables automatically. The
respective STATA command is:
xi: stcox i.group
Since our real group variable only takes two values a single dummy variable
Y2 named Igroup 2 is created by STATA and the output is:
No. of subjects = 40 Log likelihood = -100.71913
No. of failures = 36 chi2(1) = 2.88
Time at risk = 9118 Prob > chi2 = 0.0898
------------------------------------------------------------------------------
time |
event | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
Igroup_2 | .5510687 .1919946 -1.710 0.087 .2783836 1.090857
------------------------------------------------------------------------------
Note, that if the nohr option is not specified then hazard ratio estimates
(ie. eβ) are shown in the first column of the output table (with respective
confidence intervals in the last two columns).
10
6 Exercises
6.1 Dataset 1 (RATS.DTA)
1.1. Plot the K-M survival function estimates for groups 1 and 2. Interpret
the graphs.
1.2. Add 95% pointwise confidence intervals to the K-M plots of Ex.1.1.
Compute a 95% confidence interval for S(200) for groups 1 and 2.
1.3. Estimate median survival for groups 1 and 2. Compare the results
with your findings from Ex.1.1. Compute a 95% confidence interval for me-
dian survival for groups 1 and 2.
1.4. Test the equality of survival distributions for groups 1 and 2 by means
of a logrank test. Interpret your findings and compare with previous results.
1.5. Use Cox regression to estimate the relative risk for members of group
2 relative to group 1. What is the estimate of the regression coefficient?
Interpret the results and compare with the findings of Ex.1.1.
1.6. Determine if the relative risk in Ex.1.5 is significantly different from
1 by examining the output from the Cox procedure and by performing a
likelihood ratio test. Compare the conclusions with the result of Ex.1.4.
11
6.2 Dataset 2 (TWINS.DTA)
2.1. How do you think should the K-M plots for male and female twins look
like? Plot the respective K-M estimates to check your intuition. Interpret
the graphs.
2.2. Compute 95% confidence intervals for S(80) for males and females.
2.3. Estimate median survival for males and females. Compare the re-
sults with your findings from Ex.2.1. Compute 95% confidence intervals for
median survival for males and females.
2.4. Is the difference in male and female survival statistically significant?
Base your answer on a statistical test.
2.5. Investigate the effect of birth year and sex on survival (use the group
of males born in 1870 as baseline).
2.6. Answer the following questions using sex and byear0 as covariates:
Using the group of males born in 1870 as baseline, what is the relative risk
for a male twin born in 1930? What is the relative risk for a female twin
born in 1870 and in 1930?
2.7. Is the effect of birth year on relative risk different for the two sexes, i.e.
is there any interaction between birth year and sex?
2.8. Answer the following questions taking into account the interaction be-
tween birth year and sex. Using the group of males born in 1870 as baseline,
what is the relative risk for a male and for a female twin born in 1870 and
in 1930?
2.9. Investigate the effect of month of birth on twin survival by formulat-
12
ing relevant models and performing statistical tests (Hint: introduce dummy
variables).
6.3 Dataset 3 (VETERAN.DTA)
3.1. Perform a preliminary evaluation of test treatment effect on cancer sur-
vival by K-M plots and analysis of percentile estimates (e.g. median).
3.2. Formulate relevant models for estimation of test treatment effect on
cancer survival controlling for additional covariates and apply the models to
the data.
3.3. Based on statistical tests, determine what covariates have significant
effects on cancer survival.
6.4 Dataset 4 (MALEMICE.DTA)
4.1. Considering death from thymic lymphoma as the true death time and
death from all other causes as censoring, formulate and test hypotheses about
the significance of germ-free environment for mortality.
4.2. Carry out the analysis of Ex.4.1. for reticulum cell sarcoma.
4.3. Interpret the results of Ex.4.1. and Ex.4.2.
13
7 Homework (CANCER.DTA)
1. Investigate the effect of test treatment on survival using Kaplan-Meier
plots. Do you see any benefit of the test treatment?
2. Estimate median survival in the two treatment groups and report your
estimates along with 95% confidence intervals. Use these results to compare
the two treatments.
3. Determine the estimated survival probability at 230 days in the two treat-
ment groups. Compare your findings with the conclusion of question 2.
4. Is survival after the test treatment significantly different from survival
after the standard treatment? Base your answer on the logrank test.
5. Determine the effect of treatment on survival controlling only for age
using Cox regression. Are the effects of treatment statistically significant?
6. Evaluate the effect of treatment on survival controlled for age and Karnof-
sky score using Cox regression. Determine the 95% confidence interval for the
relative risk in the test treatment group relative to the standard treatment
group adjusted for age and Karnofsky score. Is the test treatment signifi-
cantly better than the standard one when you control for age and Karnofsky
score? Compare the result to your findings in question 5.
7. Consider again the analysis of question 6. What can you tell about the
effects of age and Karnofsky score on survival? Why is it a good idea to
control for Karnofsky score?
8. Write down the estimated Cox regression model. What is the relative risk
for a 50 year old patient with Karnofsky score 90 in the test treatment group
relative to a 50 year old patient with Karnofsky score 10 in the standard
treatment group?
14
8 STATA survival analysis reference
Some variable naming conventions:
• time is always used for the survival time
• event is used as death indicator: 1 - indicates death, 0 - censoring
Note: event is not present in MALEMICE
Start the data analysis by:
• Loading the dataset (use the File-Open menu)
• Preparing the data for survival analysis by: stset time event
or for MALEMICE data by one of the following:
stset time, failure(event0), stset time, failure(event1) or
stset time, failure(event2)
Survival analysis command summary:
• Prepare a dataset for survival analysis:
stset time, failure(event)
• Plot K-M estimates by variable group with 95% CI
sts graph, by(group) gwood
• Compute K-M estimates at age 200 by variable group
sts list, by(group) at(200 201)
• Estimate median survival time (by group)
stsum, by(group)
• Determine the 95% CI for the median (by group)
stci, by(group)
• Test for equality of survival time distribution using logrank test.
The subpopulations to be compared are given by the variable group
sts test group, logrank
15
• Do Cox regression on the variable group reporting estimates of the
regression coefficients
stcox group, nohr
• Do Cox regression on the nominal scale variable group by creating
dummy variables. Report estimates of the hazard ratios.
xi: stcox i.group
• Test for significance of several variables x3,x4,x5 in Cox regression
using likelihood ratio test.
stcox x1 x2 x3 x4 x5
est store A
stcox x1 x2
lrtest A
• Do Cox regression on a null model (a model with no covariates). This
is useful for likelihood ratio tests.
stcox, estimate
• Do Cox regression using a categorical variable sex and a continuous
variable byear0 and their interaction (creating dummy variables for
sex)
xi: stcox i.sex*byear0
• Solution to Ex.2.7 and 2.8 using Cox regression:
gen x=sex*byear0
st
本文档为【生存分析讲义】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。