最近邻方法在填充和分类中应用新技术

最近邻方法在填充和分类中应用新技术最近邻方法在填充和分类中应用新技术最近邻方法在填充和分类中应用的新技术研究生:朱曼龙导师:张师超教授专业:计算机软件与理论研究方向:数据挖掘年级:2007 级摘要在当今互联网时代，海量信息处理已成为我国经济建设进程中的一个重大需求。最近邻方法是海量信息处理中最重要的理论与技术之一，运用已知的最邻近点估计或逼近问题的解，为海量信息计算与服务提供了简单、易理解、有效的理论和技术。本论文研究最近邻方法在缺失值填充与分类中应用的新技术和算法。首先，从缺失值填充和数据分类的应用角度对 ...

最近邻方法在填充和分类中应用新技术最近邻方法在填充和分类中应用的新技术研究生:朱曼龙导师:张师超教授专业:计算机软件与理论研究方向:数据挖掘年级:2007 级摘要在当今互联网时代，海量信息处理已成为我国经济建设进程中的一个重大需求。最近邻方法是海量信息处理中最重要的理论与技术之一，运用已知的最邻近点估计或逼近问题的解，为海量信息计算与服务提供了简单、易理解、有效的理论和技术。本论文研究最近邻方法在缺失值填充与分类中应用的新技术和算法。首先，从缺失值填充和数据分类的应用角度对 k 最近邻算法进行研究，详细地阐述了 k 最近邻算法的基本原理，分析它的优缺点和一些常用的改进方法。在此基础上，本论文以获取更高的填充(分类)准确率为主要目标，针对 k 最近邻算法的某些缺点提出新的改进策略，并从理论和实验两方面验证策略的有效性。一方面，本论文研究最近邻填充的新理论与算法。针对 k 最近邻填充算法(kNNI)在缺失数据的 k 个最近邻的选择上可能存在偏好，提出一种新的缺失填充算法:象限近邻填充算法 QENNI(是一种壳近邻填充算法)，它仅仅使用缺失数据象限方向的最近邻来填充该缺失值，避免了 kNNI 中选取的 k 个最近邻点有偏好这一情况。进一步，本论文采用三种可能的加权方法对壳近邻填充算法(SNI)[1,2]进行分析，通过实验总结出壳近邻填充中近邻数据重复选择有利于提高填充效果，而基于频率与距离加权方式的壳近邻填充算法 fdwSNI 具有最好的填充效果这一结论。关于缺失值填充，填充算法固然重要，但好的评价方式无疑能为算法的选择提供有效的指导。通过具体的实例分析，本论文指出常用的填充效果指标 RMSE 容易偏向严重的填充误差，并提出一种新的 goodness 评价方式。即使存在个别严重的填充误差，goodness 仍然能得出理想的结论。另一方面，本论文建立一个壳近邻分类算法 SNC，克服了 kNN 最近邻选择上可能存在偏好这一问题，该算法对距离度量不敏感，在大数据集上具有更好的分类效果。另外，在实际的数据挖掘应用中，面对的数据通常是质量差的或者不完全的，开发噪声鲁棒性较好的挖掘算法是一个具有实际应用价值的挑战性工作。噪声消除常常是困难且昻贵的，并且减少历史数据来换取信息的完备，会导致可分析的数据容量大量减少，造成资源的浪费，并丢弃了大量隐藏在这些数据中的信息。kNN 是基于距离的局部最优的算法，忽略了部分或整体上的数据分布可能对分类结果的影响，会导致分类器对训练集中噪声数据的敏感性增大。本论文提出综合考虑 k 近邻、簇和训练集中的数据分布特性，建立一个新的分类算 I 法，称为 NCT，由于充分利用了局部、部分和全局三方面的数据资源，该算法具有良好的噪声鲁棒性。实验结果表明，NCT 算法不仅具有更好的分类效果，而且在噪声环境中具有良好的鲁棒性。在不含噪声的环境中，NCT 算法稍微好于 kNN;而在含噪声的环境中， NCT 算法的分类准确率明显高于 kNN 算法，且噪声率越大，这种优势越显著。最后，对 NCT 算法引进的聚类信息和全局信息做其他形式的组合变型，通过实验证明在含噪声的环境中，无论用哪种组合方式加入新信息都有利于提高 kNN 算法的分类效果，而线性插值组合方式的 NCT 算法提高分类准确率的幅度最大。简言之，本论文的主要创新点可以概述如下: 提出一种象限近邻填充算法 QENNI，克服 kNNI 算法缺失数据的最近邻选择可能存在偏好; 提出一种新的 goodness 缺失值填充评价方式，当个别数据存在严重填充误差时， goodness 评价方式优于 RMSE; 构造一种新型的壳近邻分类算法 SNC，克服了 kNN 算法最近邻选择上可能存在偏好这一问题，该算法对距离度量不敏感，在大数据集上具有更好分类效果; 提出一种综合考虑 k 近邻、簇和训练集中的数据分布特性的 NCT 分类算法，有效增强对噪声的鲁棒性。为证明其有效性，本论文提出的算法均在真实数据集上进行大量的实验。实验结果表明，本论文提出的 QENNI、SNC 和 NCT 算法均优于 k 最近邻算法，特别 NCT 算法在噪声环境中分类效果的优势显著。关键词:k 最近邻算法，壳近邻，缺失值填充，分类 II New Technologies for Imputation and Classification Based on NN Approach Name: Manlong Zhu Supervisor: Professor Shichao Zhang Major: Computer Software & Theory Subject: Data Mining Grade: 2007 Abstract In today's Internet age, massive information processing has become a major need in China’s economic development process. Nearest neighbor method is one of the most important theories and techniques in massive information processing. Using the known nearest points to estimate and approximate user queries, it provides a simple, easy understanding, and effective theory and technology for the processing and sharing of massive information. This thesis studies new technologies and algorithms for imputation and classification based on nearest neighbor method. First of all, this paper studies the k nearest neighbor algorithm from the applications of missing data imputation and data classification. We describe the basic principle of k nearest neighbor algorithm in great detail and analyze its advantages/disadvantages. Also we summarize some improvement which is commonly used. On this basis, the paper presents new strategies to improve some of the shortcomings for k nearest neighbor algorithm and verifies the effectiveness by the theory and experiment for gaining more imputation (classification) accuracy. On one hand, this paper studies new theories and algorithms of nearest neighbor imputation. As the k-Nearest Neighbor Imputation (kNNI) algorithm is often biased in choosing the k nearest neighbors of missing datum, a new imputation method is put forward, Quadrant-Encapsidated-Nearest-Neighbor based Imputation method (QENNI), for missing values. The algorithm uses the quadrant nearest neighbors (points of the encapsulant) around a missing datum to impute the missing datum. It is not biased in selecting nearest neighbors. In addition, this paper analyzes three possible weighted methods of the Shell Neighbor Imputation (SNI) algorithm[1,2] and sums up that duplicate neighbors selection can help to improve the imputation effect of SNI. Also we obtain that the Frequency-Distance Weighted Shell Neighbor Imputation method (fdwSNI) has the best filling effect. For missing data imputation, the imputation algorithms are important, but no doubt a good evaluation method can provide effective guidance for the algorithm selection. This paper points out that the commonly used indicator RMSE tends to serious imputation errors by a specific example. As it is, we propose a new evaluation method called goodness to overcome the defect. Even there are a few serious imputation deviations, goodness still can come to ideal conclusion. III On the other hand, this paper establishes a Shell Neighbor Classification method(SNC). It is not biased in selecting nearest neighbors. The SNC algorithm is not sensitive to distance metrics and performs better at classification accuracy on large data sets. In practical data mining applications, the quality of data is usually poor or incomplete. So how to develop noise robustness mining algorithms is a practical and challenging work. Eliminating noises is often difficult and expensive. Also reducing the historical data (even noises) for the completeness of the information will lead to the analysis of data capacity greatly reduced. The k-nearest neighbor (kNN) classification is based on the distance and is thus locally optimum. It does not take into account the partial or whole data distribution that can impact on the classification accuracy. Therefore, the kNN classification is sensitive to noisy data. This paper designs a classification algorithm by incorporating the class distributions in k-nearest-Neighbor, Cluster and Training set, called NCT. This combination assists in enhancing the noise tolerance of classification, i.e., called noise-robust classification. Experimental results show that the proposed algorithm NCT not only has better classification performance, but also has good robustness in the noisy environment. In the environment without noises, NCT algorithm is slightly better than kNN; in kNN at the environment with noises, the NCT method is significantly better than traditional classification accuracy. And this noise tolerance is clearly distinguished when the noise ratio is increased. Finally, clustering and global information which is introduced to the NCT algorithm is varied to other combinational forms. The experimental results show in noisy environments, all the combinations could improve the classification result more or less, but the NCT algorithm of linear interpolation combination improves the classification accuracy most. In short, the main innovations of this paper can be summarized as follows: As the kNNI algorithm is often biased in choosing the k nearest neighbors of missing datum, a new imputation method is put forward, Quadrant-Encapsidated-Nearest-Neighbor based Imputation method (QENNI), for missing values. Propose a new evaluation method called goodness. When there are a few serious imputation deviations, goodness evaluation is better than RMSE. Establish a SNC classification model which is not biased in selecting nearest neighbors. It is not sensitive to distance metrics and performs better at classification accuracy on large data sets. Design a classification algorithm by incorporating the class distributions in k nearest neighbor, cluster and training set, called NCT. It assists in enhancing the noise tolerance of classification, i.e., called noise-robust classification. IV In order to verify the effectiveness and validity of the proposed algorithms and strategies, we conduce many experiments on real datasets. Experimental results show that the proposed algorithms QENNI, SNC and NCT are better than k nearest neighbor algorithm. Particularly the NCT method is significantly better in noisy environments. Keywords: k Nearest Neighbor Algorithm, Shell Neighbor, Missing Data Imputation, Classification V 感谢您试用AnyBizSoft PDF to Word。试用版仅能转换5页文档。要转换全部文档，免费获取注册码请访问

                    本文档为【最近邻方法在填充和分类中应用新技术】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

最近邻方法在填充和分类中应用新技术

你可能还喜欢