首页 预测电信行业客户流失——基于一种SAS生存分析模式的应用程序

预测电信行业客户流失——基于一种SAS生存分析模式的应用程序

举报
开通vip

预测电信行业客户流失——基于一种SAS生存分析模式的应用程序预测电信行业客户流失——基于一种SAS生存分析模式的应用程序 标题:Predicting Customer Churn in the Telecommunications Industry –– An Application of Survival Analysis Modeling Using SAS 原文:ABSTRACT Conventional statistical methods (e.g. logistics regression, decision tree, and etc.) are v...

预测电信行业客户流失——基于一种SAS生存分析模式的应用程序
预测电信行业客户流失——基于一种SAS生存 分析 定性数据统计分析pdf销售业绩分析模板建筑结构震害分析销售进度分析表京东商城竞争战略分析 模式的应用程序 标题:Predicting Customer Churn in the Telecommunications Industry –– An Application of Survival Analysis Modeling Using SAS 原文:ABSTRACT Conventional statistical methods (e.g. logistics regression, decision tree, and etc.) are very successful in predicting customer churn. However, these methods could hardly predict when customers will churn, or how long the customers will stay with. The goal of this study is to apply survival analysis techniques to predict customer churn by using data from a telecommunications company. This study will help telecommunications companies understand customer churn risk and customer churn hazard in a timing manner by predicting which customer will churn and when they will churn. The findings from this study are helpful for telecommunications companies to optimize their customer retention and/or treatment resources in their churn reduction efforts. INTRODUCTION In the telecommunication industry, customers are able to choose among multiple service providers and actively exercise their rights of switching from one service provider to another. In this fiercely competitive market, customers demand tailored products and better services at less prices, while service providers constantly focus on acquisitions as their business goals. Given the fact that the telecommunications industry experiences an average of 30-35 percent annual churn rate and it costs 5-10 times more to recruit a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many incumbent operators, retaining high profitable customers is the number one business pain. Many telecommunications companies deploy retention strategies in synchronizing programs and processes to keep customers longer by providing them with tailored products and services. With retention strategies in place, many companies start to include churn reduction as one of their business goals. In order to support telecommunications companies manage churn reduction, not only do we need to predict which customers are at high risk of churn, but also we need to know how soon these high-risk customers will churn. Therefore the telecommunications companies can optimize their marketing intervention resources to prevent as many customers as possible from churning. In other words, if the telecommunications companies know which customers are at high risk of churn and when they will churn, they are able to design customized customer communication and treatment programs in a timely efficient manner. Conventional statistical methods (e.g. logistics regression, decision tree, and etc.) are very successful in predicting customer churn. These methods could hardly predict when customers will churn, or how long the customers will stay with. However, survival analysis was, at the very beginning, designed to handle survival data, and therefore is an efficient and powerful tool to predict customer churn. OBJECTIVES The objectives of this study are in two folds. The first objective is to estimate customer survival function and customer hazard function to gain knowledge of customer churn over the time of customer tenure. The second objective is to demonstrate how survival analysis techniques are used to identify the customers who are at high risk of churn and when they will churn. DEFINITIONS AND EXCLUSIONS This section clarifies some of the important concepts and exclusions used in this study. Churn – In the telecommunications industry, the broad definition of churn is the action that a customer’s telecommunications service is canceled. This includes both service-provider initiated churn and customer initiated churn. An example of service-provider initiated churn is a customer’s account being closed because of payment default. Customer initiated churn is more complicated and the reasons behind vary. In this study, only customer initiated churn is considered and it is defined by a series of cancel reason codes. Examples of reason codes are: unacceptable call quality, more favorable competitor’s pricing plan, misinformation given by sales, customer expectation not met, billing problem, moving, change in business, and so on. High-Value Customers – Only customers who have received at least three monthly bills are considered in the study. High-value customers are these with monthly average revenue of $X or more for the last three months. If a customer’s first invoice covers less than 30 days of service, then the customer monthly revenue is prorated to a full month’s revenue. Granularity – This study examines customer churn at the account level. Exclusions – This study does not distinguish international customers from domestic customers. However it is desirable to investigate international customer churn separately from domestic customer churn in the future.Also, this study does not include employee accounts, since churn for employee accounts is not of a problem or an interest for the company. SURVIVAL ANALYSIS AND CUSTOMER CHURN Survival analysis is a clan of statistical methods for studying the occurrence and timing of events. From the beginning, survival analysis was designed for longitudinal data on the occurrence of events. Keeping track of customer churn is a good example of survival data. Survival data have two common features that are difficult to handle with conventional statistical methods: censoring and time-dependent covariates. Generally, survival function and hazard function are used to describe the status of customer survival during the tenure of observation. The survival function gives the probability of surviving beyond a certain time point t. However, the hazard function describes the risk of event (in this case, customer churn) in an interval time after time t, conditional on the customer already survived to time t. Therefore the hazard function is more intuitive to use in survival analysis because it attempts to quantify the instantaneous risk that customer churn will take place at time t given that the customer already survived to time t. For survival analysis, the best observation plan is prospective. We begin observing a set of customers at some well-defined point of time (called the origin of time) and then follow them for some substantial period of time, recording the times at which customer churns occur. It’s not necessary that every customer experience churn (customers who are yet to experience churn are called censored cases, while those customers who already churned are called observed cases). Typically, not only do we predict the timing of customer churn, we also want to analyze how time-dependent covariates (e.g. customers calls to service centers, customers change plan types, customers change billing options, and etc.) impact the occurrence and timing of customer churn. SAS/STAT has two procedures for survival analysis: PROC LIFEREG and PROC PHREG. The LIFEREG procedure produces parametric regression models with censored survival data using maximum likelihood estimation. The PHREG procedure is a semi-parametric regression analysis using partial likelihood estimation. PROC PHREG has gained popularity over PROC LIFEREG in the last decade since it handles time dependent .However if the shapes of survival distribution and hazard function are known, PROC LIFEREG produces more efficient estimates (with smaller standard error) than PROC PHREG does. SAMPLING STRATEGY On August 16, 2000, a sample of 41,374 active high-value customers was randomly selected from the entire customer base from a telecommunications company. All these customer were followed for the next 15 months. Therefore August 16, 2000 is the origin of time and November 15, 2001 is the observation termination time. During this 15-month observation period, the timing of customer churn was recorded. For each customer in the sample, a variable of DUR is used to indicate the time that customer churn occurred, or for censored cases, the last time at which customers were observed, both measured from the origin of time (August 16, 2000). A second variable of STATUS is used to distinguish the censored cases from observed cases. It is common to have STATUS = 1 for observed cases and STATUS = 0 for censored cases. In this study, the survival data are singly right censored so that all the censored cases have a value of 15 (months) for the variable DUR. DATA SOURCES There are four major data sources for this study: block level marketing and financial information, customer level demographic data provided through a third party vendor, customer internal data, and customer contact records. A brief description of some of the data sources follows. Demographic Data – Demographic dada is from a third party vendor. In this study, the following are examples of customer level demographic information: - Primary household member’s age - Gender and marital status - Number of adults - Primary household member’s occupation - Household estimated income and wealth ranking - Number of children and children’s age - Number of vehicles and vehicle value - Credit card - Frequent traveler - Responder to mail orders - Dwelling and length of residence Customer Internal Data – Customer internal data is from the company’s data warehouse. It consists of two parts. The first part is about customer information like market channel, plan type, bill agency, customer segmentation code, ownership of the company’s other products, dispute, late fee charge, discount, promotion/save promotion, additional lines, toll free services, rewards redemption, billing dispute, and so on. The second part of customer internal data is customer’s telecommunications usage data. Examples of customer usage variables are: - Weekly average call counts - Percentage change of minutes - Share of domestic/international revenue Customer Contact Records – The Company’s Customer Information System (CIS) stores detailed records of customer contacts. This basically includes customer calls to service centers and the company’s mail contacts to customers. The customer contact records are then classified into customer contact categories. Among the customer contact categories are customer general inquiry, customer requests to change service, customer inquiry about cancel, and so on. MODELING PROCESS Model process includes the following four major steps. Explanatory Data Analysis (EDA) – Explanatory data analysis was conducted to prepare the data for the survival analysis. An univariate frequency analysis was used to pinpoint value distributions, missing values and outliers. Variable transformation was conducted for some necessary numerical variables to reduce the level of skewness, because transformations are helpful to improve the fit of a model to the data. Outliers are filtered to exclude observations, such as outliers or other extreme values that are suggested not to be included in the data mining analysis. Filtering extreme values from the training data tends to produce better models because the parameter estimates are more stable. Variables with missing values are not a big issue, except for those demographic variables. The demographic variables with more than 20% of missing values were eliminated. For observations with missing values, one choice is to use incomplete observations, but that may lead to ignore useful information from the variables that have nonmissing values. It may also bias the sample since observations that have missing values may have other things in common as well. Therefore, in this study, missing values were replaced by appropriate methods. For interval variables, replacement values were calculated based on the random percentiles of the variable’s distribution, i.e., values were assigned based on the probability distribution of the nonmissing observations. Missing values for class variables were replaced with the most frequent values (count or mode). Variable reduction – Started with 212 variables in the original data set, by using PROC FREQ, an initial univariate analysis of all categorical variables crossed with customer churn status (STATUS) was carried out to determine the statistically significant categorical variables to be included in the next modeling step. All the categorical variables with a chi-square value or t statistics of 0.05 or less were kept. This step reduced the number of variables to 115 (&VARLIST1) – including all the numerical variables and the kept categorical variables from the step one. The next step is to use PROC PHREG to further reduce the number of variables. A stepwise selection method was used to create a final model with statistically significant effects of 29 exploratory variables on customer churn over time. PROC PHREG DATA = SASOUT2.ALL2 OUTEST = SASOUT2.BETA; MODEL DUR*STATUS(0) = &VARLIST1 / SELECTION = STEPWISE SLENTRY = 0.0025 SLSTAY = 0.0025 DETAILS; Model Estimation – With only 29 exploratory variables, the final data set has reasonable number of variables to perform survival analysis. Before applying survival analysis procedures to the final data set, the customer survival function and hazard function were estimated using the following code. The purpose of estimating customer survival function and customer hazard function is to gain knowledge of customer churn hazard characteristics. From the shape of hazard function, customer churn in this study demonstrates a typical hazard function of a Log-Normal model. As previously discussed, since the shape of survival distribution and hazard function was known, PROC LIFEREG produces more efficient estimates (with smaller standard error) than PROC PHREG does. PROC LIFETEST DATA = SASOUT2.ALL3 OUTSURV = SASOUT2.OUTSURV METHOD = LIFE PLOT = (S, H) WIDTH = 1 GRAPHICS; TIME DUR*STATUS(0); RUN; The final step is to estimate customer churn. PROC LIFEREG was used to calculate customer survival probability. At this step the final data set was divided 50/50 into two data sets: model data set and validation data set. The model data set is used to fit the model and the validation data set is used to score the survival probability for each customer. A variable of USE is used to distinguish the model data set (set USE = 0) and validation data set (set USE = 1). In the validation data set, set both DUR and STATUS missing so that cases in the validation data set were not to be used in model estimation. 出处:Jun Xiang Lu, Ph.D. Predicting Customer Churn in the Telecommunications Industry –– An Application of Survival Analysis Modeling Using SAS: SAS User Group International (SUGI27) Online Proceedings,2002, Paper No. 114-27. 译文:预测电信行业客户流失——基于一种SAS生存分析模式的应用程序 Jun Xiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas 摘要 传统的统计方法(如logistic回归,决策树等等)都是能非常成功的预测客户流失的。但是,这些方法是很难预测什么时候客户会流失,或者这些客户还能保留多久。这项研究的目的是运用生存分析技术通过使用来自电信公司的数据来预测客户流失。这项研究将会帮助电信公司了解客户流失的风险和通过预测那些和何时客户将要流失的一种时间方式的危害。这一研究的结果有助于电信公司优化客户的保留和(或)处理资源来努力降低他们的客户流失。 引言 在电信行业,客户可以在多个提供服务的供应者中进行选择,积极运用他们从一个服务供应商转换到另一个供应商的权利。在这个竞争激烈的市场,客户需要用低价格获得的按 要求 对教师党员的评价套管和固井爆破片与爆破装置仓库管理基本要求三甲医院都需要复审吗 特质非产品和更好的服务, 服务的供应商要不断的专注于收购作为他们的业务目标。鉴于电信业的经验是30-35%的平均客户流失率,开发一个新客户的成本是保留原有客户成本的5-10倍。对于许多老牌的运营商,企业的主要头痛的是留住高利润的客户。许多电信公司在协调 方案 气瓶 现场处置方案 .pdf气瓶 现场处置方案 .doc见习基地管理方案.doc关于群访事件的化解方案建筑工地扬尘治理专项方案下载 和过程时使用保持战略通过提供量身定做的产品和服务来更长时间的保持客户。随着各地方使用客户保持战略,很多公司开始把降低客户流失作为他们业务的目标之一。 为了支持电信企业管理客户流失的减少,我们不仅需要预测那些客户存在流失的高风险,还需要知道什么时候这些高风险的客户要流失。因此,电信公司优化了其市场营销的资源来防止很多可能的客户流失。换句话说,如果电信公司知道他们的客户有流失的高风险和什么时候他们将要流失,他们就设计出与客户即使有效的交流沟通的方案。 传统的统计方法(如logistic回归,决策树等等)都是能非常成功的预测客户流失的。但是,这些方法是很难预测什么时候客户会流失,或者这些客户还 能保留多久。然而,生存分析的最初设计是用于处理存在的数据,因此是预测客户流失的一种有效和强大的工具。 目标 这项预测研究的目标有两个。第一个目标是为了建立客户生存函数和客户风险函数来获取在客户的任期时间的客户流失的知识。第二个目标是演示用来识别那些是高风险流失的客户和什么时候他们将要流失的生存分析技术。 定义和排除 本问澄清一些重要的概念和排除在本次研究之外的使用。 流失——在电信含有,客户流失的广泛定义是指一个客户的电信服务被取消了。这包括服务提供者引发的客户流失,和客户主动的流失。一个服务提供者引发的客户流失的例子有客户的账户因为客户欠费被关闭。客户主动流失就比较复杂,流失的原因也是不同的。在这项研究中只研究客户的主动流失,它被定义为由一系列取消原因代码,原因代码的举例有:不能接受通话质量,竞争对手的更优惠的定价 计划 项目进度计划表范例计划下载计划下载计划下载课程教学计划下载 ,在销售中误传了信息,客户的期望得不到满足,计费问题,移动,业务上的变化等等。 高价值客户——仅仅只那些已经接受至少有三个月账单的客户。高价值客户是那些在过去三个月每个月平均收益在x美元或以上的客户。如果客户的第一张发票少于30天的服务,那么客户的每个月的收益是按比例分配到一个整月的收入。 尺度——本研究讨论关于账户的客户流失率 排除——这项研究没有区分国内客户和国际客户,实际上把国际客户流失从国内客户流失中分开是值得做的。此外,这项研究不包括员工的账户,因为员工账户的流失不只是一个问题或是企业的一种权利。 生存分析和客户流失 生存分析是为学习发生的事情和实时的事件的一种统计研究方法。从一开始,生存分析对发生的事件的设计纵向数据。对客户流失的跟踪时一个生存数据的很好的例子。生存数据有两个共同的特点,很难用传统的统计方法处理:审查和时间上的依赖性变量。 一般情况下,生存函数和风险函数是用来描述在任期间观察客户存在的状态。生存函数给出了超过一定时间t的存在概率,而风险寒素描述在间隔时间t的事件风险(在这种情况下,客户流失)在时间t后的一段间隔时间,在时间t 中考虑已经生存下来的客户。因此,风险功能更直观的在生存分析中的使用,因为它试图把风险量化,客户流失将在这个客户存货的时间t内发生。 为了生存分析,最佳观测计划是有前瞻性,我开始观测在一些时间定义的明确点(成为时间的起源)的客户集,然后按照相当长的一段时间记录在那时间所发生的客户流失。每个客户体验流失(客户没有体验流失被称为审查情况,这些客户已经流失的称为观察情况)是不必要的。通常情况下,我们不仅预测客户流失的时间,我们也需要分析如何随着时间变化(如客户服务呼叫中心,客户变更计划类型,客户改变结算方式等)发生和时间影响流失的客户。 SAS/STAT对生存分析有两个程序:LIFEREG程序和PHREG程序。LIFEREG程序产生的参数回归模式对生存分析的数据使用最大可能的估计。PHREG过程时一个半参数回归分析使用部分可能的估计。PHREG程序在过去的十年里依赖它处理的时间性,已经获得了的普及超过LIFEREG程序。但是,如果生存分布和风险函数的形状是已知的,LIFEREG程序比PHREG程序更有效的估计(标准误差较小)。 抽样策略 2000年8月16日,41374活动的高价值客户的样本是从整个电信公司的客户群中随机挑选的。所有的客户在未来的15个月的跟随,2000年8月16日是时间的起点,2001年11月15日时观察的终止时间。在这15个月的观察期,客户流失的时间被记录。对于样本中的每一个客户,一个变量的总指数是用来表示在客户流失情况或者审查情况下的时间,最后一次客户进行观察,从开始的时间(2000年8月16日)进行测量。第二个变量状态是用来区分审查情况和观察情况的。在观察情况下状态=1和在审查情况下状态=0都是常见的。在这项研究中,生存数据是单独正确的审查情况,所有的审查情况有15个(月)有价值的总指数为变量值。 资料来源 这里有四个主要数据来源的研究:数据营销和财务信息,客户水平,通过第 三方的供应商提供的人口统计数据,客户内部数据和客户联系记录。一个数据源的一些简要说明如下。 人口数据——人口数据时来自第三方的厂商。在这项研究中,以下是客户级别的人口信息的例子: - 小学家庭成员的年龄 - 性别和婚姻状况 - 成人人数 - 小学家庭成员的职业 - 家用估计收入和财富排名 - 儿童和儿童人数的年龄 - 车辆辆数和车辆价值 - 信用卡 - 频繁游客 - 有响应的邮件订单 - 住宅与居住期限 客户内部数据 —— 客户内部数据是从该公司的数据仓库得到的。它由两部分组成。第一部分是关于客户如市场渠道,计划的类型,票据代理,客户细分的代码,该公司的其他产品的所有权,纠纷,滞纳金费用,折扣,促销信息/保存推广,额外的线路,免费服务,奖励赎回,结算纠纷等等。对客户内部数据的第二个部分是客户的电信使用数据。客户使用变量的例子有: - 每周平均通话次数 - 会议纪要变动百分率 - 应占的国内/国际业务收入 客户联系记录——该公司的客户信息系统(CIS)存储客户接触的详细记录。这基本上包括客户呼叫服务中心和公司的邮件往来的客户。客户联系记录为客户联系的类别分类。其中客户联系客户类别有一般查询,客户要求变更服务,客户查询有关取消等等。 模型建立过程 模型建立的过程包括以下四个主要步骤。说明资料分析(EDA)——说明数 据进行分析,以备生存分析的数据。一个的频率分析被使用于精确值分布,遗漏值和离群值。 变量变换进行了一些必要的数字变量,以减少偏度水平,因为有利于提高转换一种模式适合数据。离群的筛选,以排除如离群或其他不建议在数据挖掘分析包括极端值的观察。从训练数据筛选极端值往往会产生更好的模型,因为参数估计更稳定。变量有遗漏值不是一个大问题,除了这些人口统计变数。超过20,的人口遗漏值的变量被淘汰。对于遗漏值的观察,一个选择是使用不完整的意见,但可能导致忽略的变量有没有遗漏价值的有用信息。它也可能带有偏见的误差样本,因为意见有遗漏值在其他中可能有共同的东西。因此,在这项研究中,遗漏值改为适当的方法。 对于区间变量,重置价值计算依据变量的分布,即价值被分配的基础上,在没有遗漏观测概率分布的随机百分点。为类变量遗漏值被替换最频繁值(计数或模式)。 减少变项 ——212中的原始数据集的变量使用了FREQ程序,初步的交叉与客户的所有分类变量单因素分析,流失状态进行了以决定在未来包括分类变量显着建模步骤。所有一卡方值的分类变量或t为0.05统计或更小统计分类变量统统保留。这一步变量的数目减少了115(,变量1)---包括所有的数字变量,从一个步骤保持绝对的变数。 接下来的步骤是使用PHREG程序进一步减少变数。一个逐步选择方法被用于创建与探索29客户显着影响一个变量的最终模型随着时间的推移流失。 PHREG程序 数据 = SASOUT2.ALL2 OUTEST =SASOUT2. ,; 指数模型*状态(0) = &变量/ 选择 = 递进 SLENTRY = 0.0025 SLSTAY = 0.0025 详情; 模型的估计 ——只有29探索变量,最终的数据集有合理数量的变量进行生存分析。在申请程序,以存活分析最终数据集,客户生存函数和风险函数估计采用下面的代码。顾客的生存函数估计和客户风险函数的目的是为了获取客户知识流失的危险特性。从风险函数的形,状,这项研究的客户流失是对数正态模型典型的风险函数。如前所述,由于生存分布和危害函数的形状是众所周知的LIFEREG程序比PHREG程序的估计数(标准误差较小)更有效。 LIFETEST程序 数据 = SASOUT2.ALL3 OUTSURV SASOUT2.OUTSURV 方法 = 上升 容积= (面积, 高) 宽 = 1 图形; 时间总指数*状态(0); 运行; 最后一步是评估客户流失。LIFEREG程序是用来计算客户的生存概率。在这一步最后的数据集被分成50/50的两组数据:模型数据集和验证数据集。该模型的数据集是用于拟合模型和验证数据集是用于评分为每一个客户的生存概率。USE的一个变量是用来区分模型数据集(设置使用= 0)和验证数据集(设置使用= 1)。在验证数据集,总指数和状态都设置失踪,以便在验证数据集是不能在模型的估计使用。 出处:Jun Xiang Lu, Ph.D. Predicting Customer Churn in the Telecommunications Industry –– An Application of Survival Analysis Modeling Using SAS: SAS User Group International (SUGI27) Online Proceedings.2002, Paper No. 114-27,
本文档为【预测电信行业客户流失——基于一种SAS生存分析模式的应用程序】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_852287
暂无简介~
格式:doc
大小:52KB
软件:Word
页数:18
分类:生活休闲
上传时间:2017-10-08
浏览量:123