首页 生存分析讲义

生存分析讲义

生存分析讲义 Additional notes: Survival analysis with STATA Ivan Iachine Revision 2.30, 29.10.2003 1 Introduction These notes contain additional material not included in the slides used at the lectures. In particular the use of STATA survival analysis procedures is...

Additional notes: Survival analysis with STATA Ivan Iachine Revision 2.30, 29.10.2003 1 Introduction These notes contain additional material not included in the slides used at the lectures. In particular the use of STATA survival analysis procedures is illustrated in detail. The STATA output presented in these notes was produced using STATA version 8.1. It is assumed that the reader has some basic knowledge about STATA and the main concepts of STATA datasets (ie. what is a variable, etc). In particular, one has to be able to start STATA, to open files containing STATA datasets, to open the data in the browser and to enter commands in the STATA command line. The description of STATA survival analysis procedures in these notes is based on the “RATS” dataset and Exercise series 1. 2 Preparing a dataset for survival analysis Note, that STATA allocates memory for datasets statically, meaning that there is a fixed amount of space reserved for data. The size of this space can only be increased if all data is deleted from the memory. Consequently, creating new variables in large datasets can fail. In this case one has to clear the memory and increase the amount of space reserved for data manually by 1 using the set memory command described in the STATA survival analysis reference section. Therefore it is recommended to adjust the memory size before the actual analysis of the data. Before one can carry out any STATA survival analysis procedures on a dataset, it must be prepared for usage by means of the STATA’s stset command. In the following the preparation procedure is described in more detail. First one has to open the file containing the dataset in STATA (eg. using the File → Open... menu or by entering the use rats command). Next, inspect the variables contained in the dataset in order to find the failure time variable and the censoring indicator variable. The datasets used in the course observe the following naming conventions: the failure time variable is called time and the censoring indicator variable is called event (except for the dataset “MALEMICE” where there are several censoring indicators event0,event1,event2 associated with different death causes). Since we only consider right censoring with a single event per individual without delayed entry in the course, the two variables are sufficient. Next, enter the following command in the command line: stset time, failure(event) You should now see an output that looks like the following: . stset time, failure(event) failure event: event != 0 & event < . obs. time interval: (0, time] exit on or before: failure ------------------------------------------------------------------------------ 40 total obs. 0 exclusions ------------------------------------------------------------------------------ 40 obs. remaining, representing 36 failures in single record/single failure data 2 9118 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 344 The dataset name field above should contain the name of the dataset you have opened previously. Now you are ready to use STATA’s survival analysis procedures. Their names usually start with the letters st (eg. stcox, sts, etc). The idea of preparing a dataset for subsequent survival analysis using stset is that you do not have to write the name of the failure time variable and censoring in- dicator each time you call a survival analysis procedure – STATA remembers this for you. Of course, if you open a different dataset you will have to use the stset command again. In the following we assume that the dataset has been correctly prepared for the analysis using stset. 3 Kaplan-Meier estimates and related topics Kaplan-Meier estimates are classical estimates of the survival function, as discussed in the theoretical section. In this section we show how to plot these estimates on a graph with STATA, including the use of stratification techniques to compare survival of several populations and how to obtain an estimate of the median survival times from the Kaplan-Meier estimator. It is shown how to obtain confidence intervals for the median survival time and for the value of the survival function at a particular age. 3.1 Kaplan-Meier plots To plot the Kaplan-Meier (K-M) estimates on a graph the command sts graph is used. This would plot a single graph for the entire dataset. If the dataset contains several groups, say the variable group contains the group number, one may want to plot several K-M estimates on the same graph in order to compare survival of the subpopulations. To do this issue the 3 0 .0 0 0 .2 5 0 .5 0 0 .7 5 1 .0 0 0 100 200 300 400 analysis time group = 1 group = 2 Kaplan−Meier survival estimates, by group Figure 1: Exercise 1.1 command: sts graph, by(group) The output from STATA’s graphics window is shown on Figure 1. Now it is possible to compare the survival of the two groups. 3.2 Pointwise confidence intervals It is possible to use STATA for computation of pointwise confidence intervals for S(t) (ie. the value of the survival function at a given age t) based on the Greenwood’s formula introduced in the theoretical section. In order to plot the confidence intervals on a graph together with the K-M estimate use the command: sts graph, by(group) gwood The default is to compute 95% confidence intervals (see the STATA manual for information of how to change the confidence level). This will actually 4 0 .2 5 .5 .7 5 1 0 100 200 300 400 0 100 200 300 400 1 2 95% CI Survivor function analysis time Graphs by pretreatment regime Kaplan−Meier survival estimates, by group Figure 2: Exercise 1.2 produce two graphs, one for group=1, another for group=2, as shown on Fig- ure 2. This gives some idea about the quality of K-M estimates computed for our dataset. However, it is a bad idea to read the confidence intervals di- rectly from the graphs. In Exercise 1.2 we have to compute a 95% confidence interval for S(200). To compute it directly use the following command: sts list, by(group) at(200 201) This produces the following output (only relevant part of the output shown): 5 Beg. Survivor Std. Time Total Fail Function Error [95% Conf. Int.] ---------------------------------------------------------------------- group=1 200 14 6 0.6842 0.1066 0.4279 0.8439 201 14 0 0.6842 0.1066 0.4279 0.8439 group=2 200 18 4 0.8095 0.0857 0.5689 0.9239 201 18 0 0.8095 0.0857 0.5689 0.9239 ---------------------------------------------------------------------- Note, that in order to compute the values at t = 200 we had to request the values for two ages (200 and 201). This is because specifying the at(200) option alone in the sts command would have produced values of S(t) at 200 uniformly spaced points (for details see the STATA manual). From the out- put we can read the estimated value of S(200) for group=1 and group=2 being equal to 0.6842 and 0.8095 with 95% confidence intervals [0.4279, 0.8439] and [0.5689, 0.9239] respectively. 3.3 Estimating median survival time As it was presented in the theoretical section, the median survival time es- timate can be obtained using the K-M estimate of the survival function, namely as the smallest observed failure time where the K-M estimate is not greater than 0.5. To compute the median estimates for the two groups use the following command: stsum, by(group) The STATA output is as follows: | incidence no. of |------ Survival time -----| group | time at risk rate subjects 25% 50% 75% ------+-------------------------------------------------------------- 1 | 4095 .0041514 19 190 216 234 6 2 | 5023 .0037826 21 232 233 280 ------+-------------------------------------------------------------- total | 9118 .0039482 40 205 232 261 The median survival time estimate can be read in the output as the 50% percentile. In our case the median estimates for the two groups are 216 and 233 respectively. The 95% confidence intervals for the median survival times may be com- puted using the stci command: stci, median by(group) The STATA output is as follows: failure _d: event analysis time _t: time | no. of group | subjects 50% Std. Err. [95% Conf. Interval] ------+------------------------------------------------------------- 1 | 19 216 5.171042 190 234 2 | 21 233 2.179595 232 280 ------+------------------------------------------------------------- total | 40 232 2.562933 213 239 The 95% confidence interval for the median for group=1 is [190, 234] and for group=1 the respective interval is [232, 280]. 4 Testing for equality of survival distributions 4.1 Logrank test The logrank test allows for testing for the equality of two or more survival distributions, as discussed in the theoretical section. In our case (Exercise 1.4) we wish to test the hypothesis that the two groups (ie. group=1 and 7 group=2) have equal survival distributions. To test this hypothesis using the logrank test enter the following STATA command: sts test group, logrank The output is (only relevant parts shown): | Events group | observed expected ------+------------------------- 1 | 17 12.24 2 | 19 23.76 ------+------------------------- Total | 36 36.00 chi2(1) = 3.12 Pr>chi2 = 0.0772 The most important figure for us is the P-value (denoted Pr>chi2 in the output). In our case it is equal to 0.0772 and the hypothesis about the equality of the survival distributions is accepted on the 5% significance level, since 0.0772 > 0.05. 5 Cox regression Cox regression procedure is used to evaluate the effects of explanatory vari- ables or covariates on the hazard rate using a proportinal hazard regression model discussed in the theoretical section shown below for the case of a single covariate x: h(t|x) = h0(t)e βx Our dataset “RATS” contains a single explanatory variable group that takes values 1 and 2. The following STATA command executes the Cox regression procedure to estimate β: stcox group, nohr The output is: 8 No. of subjects = 40 Log likelihood = -100.71913 No. of failures = 36 chi2(1) = 2.88 Time at risk = 9118 Prob > chi2 = 0.0898 ------------------------------------------------------------------------------ time | event | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- group | -.5958958 .3484041 -1.710 0.087 -1.278755 .0869637 ------------------------------------------------------------------------------ In the table we see a number of values associated with the group vari- able. They are (left to right): an estimate the regression coefficient β (βˆ = −0.5958958), the standard error of the estimates, Wald’s test statistics, P-value for the Wald’s test and a 95% confidence interval for the regression coefficient. Note, that the assignment of group numbers is arbitrary (ie. one could change the assignment of 1 and 2 without any change in the meaning of the data). This is a example of nominal scale data. It usually does not make much sense to include such covariate in the regression equation directly. Including a variable into the equation implicitly means that we suppose that there is a monotonic relationship between the increase of the variable value and the increase in the risk. For nominal scale data it is convenient to use dummy variables in the regression equation instead of the original variable. Assume for a moment that the variable group actually takes three values 1, 2 and 3. We want to estimate the relative risk for members of groups 2 and 3 relative to group 1. To do so we construct 2 new variables: x2 =   1, if group = 2 0, if group 6= 2 x3 =   1, if group = 3 0, if group 6= 3 9 The dummy variables x2, x3 are then included into the regression equation: h(t|x2, x3) = h0(t)e β2x2+β3x3 This regression equation implies: h(t|group = 1) = h0(t) h(t|group = 2) = h0(t)e β2 h(t|group = 3) = h0(t)e β3 This means that eβ2 and eβ3 are the two relative risks we wanted to estimate. Fortunately, STATA is able to construct dummy variables automatically. The respective STATA command is: xi: stcox i.group Since our real group variable only takes two values a single dummy variable Y2 named Igroup 2 is created by STATA and the output is: No. of subjects = 40 Log likelihood = -100.71913 No. of failures = 36 chi2(1) = 2.88 Time at risk = 9118 Prob > chi2 = 0.0898 ------------------------------------------------------------------------------ time | event | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- Igroup_2 | .5510687 .1919946 -1.710 0.087 .2783836 1.090857 ------------------------------------------------------------------------------ Note, that if the nohr option is not specified then hazard ratio estimates (ie. eβ) are shown in the first column of the output table (with respective confidence intervals in the last two columns). 10 6 Exercises 6.1 Dataset 1 (RATS.DTA) 1.1. Plot the K-M survival function estimates for groups 1 and 2. Interpret the graphs. 1.2. Add 95% pointwise confidence intervals to the K-M plots of Ex.1.1. Compute a 95% confidence interval for S(200) for groups 1 and 2. 1.3. Estimate median survival for groups 1 and 2. Compare the results with your findings from Ex.1.1. Compute a 95% confidence interval for me- dian survival for groups 1 and 2. 1.4. Test the equality of survival distributions for groups 1 and 2 by means of a logrank test. Interpret your findings and compare with previous results. 1.5. Use Cox regression to estimate the relative risk for members of group 2 relative to group 1. What is the estimate of the regression coefficient? Interpret the results and compare with the findings of Ex.1.1. 1.6. Determine if the relative risk in Ex.1.5 is significantly different from 1 by examining the output from the Cox procedure and by performing a likelihood ratio test. Compare the conclusions with the result of Ex.1.4. 11 6.2 Dataset 2 (TWINS.DTA) 2.1. How do you think should the K-M plots for male and female twins look like? Plot the respective K-M estimates to check your intuition. Interpret the graphs. 2.2. Compute 95% confidence intervals for S(80) for males and females. 2.3. Estimate median survival for males and females. Compare the re- sults with your findings from Ex.2.1. Compute 95% confidence intervals for median survival for males and females. 2.4. Is the difference in male and female survival statistically significant? Base your answer on a statistical test. 2.5. Investigate the effect of birth year and sex on survival (use the group of males born in 1870 as baseline). 2.6. Answer the following questions using sex and byear0 as covariates: Using the group of males born in 1870 as baseline, what is the relative risk for a male twin born in 1930? What is the relative risk for a female twin born in 1870 and in 1930? 2.7. Is the effect of birth year on relative risk different for the two sexes, i.e. is there any interaction between birth year and sex? 2.8. Answer the following questions taking into account the interaction be- tween birth year and sex. Using the group of males born in 1870 as baseline, what is the relative risk for a male and for a female twin born in 1870 and in 1930? 2.9. Investigate the effect of month of birth on twin survival by formulat- 12 ing relevant models and performing statistical tests (Hint: introduce dummy variables). 6.3 Dataset 3 (VETERAN.DTA) 3.1. Perform a preliminary evaluation of test treatment effect on cancer sur- vival by K-M plots and analysis of percentile estimates (e.g. median). 3.2. Formulate relevant models for estimation of test treatment effect on cancer survival controlling for additional covariates and apply the models to the data. 3.3. Based on statistical tests, determine what covariates have significant effects on cancer survival. 6.4 Dataset 4 (MALEMICE.DTA) 4.1. Considering death from thymic lymphoma as the true death time and death from all other causes as censoring, formulate and test hypotheses about the significance of germ-free environment for mortality. 4.2. Carry out the analysis of Ex.4.1. for reticulum cell sarcoma. 4.3. Interpret the results of Ex.4.1. and Ex.4.2. 13 7 Homework (CANCER.DTA) 1. Investigate the effect of test treatment on survival using Kaplan-Meier plots. Do you see any benefit of the test treatment? 2. Estimate median survival in the two treatment groups and report your estimates along with 95% confidence intervals. Use these results to compare the two treatments. 3. Determine the estimated survival probability at 230 days in the two treat- ment groups. Compare your findings with the conclusion of question 2. 4. Is survival after the test treatment significantly different from survival after the standard treatment? Base your answer on the logrank test. 5. Determine the effect of treatment on survival controlling only for age using Cox regression. Are the effects of treatment statistically significant? 6. Evaluate the effect of treatment on survival controlled for age and Karnof- sky score using Cox regression. Determine the 95% confidence interval for the relative risk in the test treatment group relative to the standard treatment group adjusted for age and Karnofsky score. Is the test treatment signifi- cantly better than the standard one when you control for age and Karnofsky score? Compare the result to your findings in question 5. 7. Consider again the analysis of question 6. What can you tell about the effects of age and Karnofsky score on survival? Why is it a good idea to control for Karnofsky score? 8. Write down the estimated Cox regression model. What is the relative risk for a 50 year old patient with Karnofsky score 90 in the test treatment group relative to a 50 year old patient with Karnofsky score 10 in the standard treatment group? 14 8 STATA survival analysis reference Some variable naming conventions: • time is always used for the survival time • event is used as death indicator: 1 - indicates death, 0 - censoring Note: event is not present in MALEMICE Start the data analysis by: • Loading the dataset (use the File-Open menu) • Preparing the data for survival analysis by: stset time event or for MALEMICE data by one of the following: stset time, failure(event0), stset time, failure(event1) or stset time, failure(event2) Survival analysis command summary: • Prepare a dataset for survival analysis: stset time, failure(event) • Plot K-M estimates by variable group with 95% CI sts graph, by(group) gwood • Compute K-M estimates at age 200 by variable group sts list, by(group) at(200 201) • Estimate median survival time (by group) stsum, by(group) • Determine the 95% CI for the median (by group) stci, by(group) • Test for equality of survival time distribution using logrank test. The subpopulations to be compared are given by the variable group sts test group, logrank 15 • Do Cox regression on the variable group reporting estimates of the regression coefficients stcox group, nohr • Do Cox regression on the nominal scale variable group by creating dummy variables. Report estimates of the hazard ratios. xi: stcox i.group • Test for significance of several variables x3,x4,x5 in Cox regression using likelihood ratio test. stcox x1 x2 x3 x4 x5 est store A stcox x1 x2 lrtest A • Do Cox regression on a null model (a model with no covariates). This is useful for likelihood ratio tests. stcox, estimate • Do Cox regression using a categorical variable sex and a continuous variable byear0 and their interaction (creating dummy variables for sex) xi: stcox i.sex*byear0 • Solution to Ex.2.7 and 2.8 using Cox regression: gen x=sex*byear0 st

                    本文档为【生存分析讲义】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

生存分析讲义

你可能还喜欢