首页 GeneSrF and varSelRF a web-based tool and R package for gene selection and classification using random forest

GeneSrF and varSelRF a web-based tool and R package for gene selection and classification using random forest

举报
开通vip

GeneSrF and varSelRF a web-based tool and R package for gene selection and classification using random forestGeneSrF and varSelRF a web-based tool and R package for gene selection and classification using random forest BioMedCentral BMC Bioinformatics Open Access Software GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification ...

GeneSrF and varSelRF a web-based tool and R package for gene selection and classification using random forest
GeneSrF and varSelRF a web-based tool and R package for gene selection and classification using random forest BioMedCentral BMC Bioinformatics Open Access Software GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest Ramón Diaz-Uriarte Address: Statistical Computing Team, Structural Biology and Biocomputing Programme, Spanish National Cancer Center (CNIO), Melchor Fernández Almagro 3, Madrid, 28029, Spain Email: Ramón Diaz-Uriarte - rdiaz02@gmail.com Published: 3 September 2007 Received: 22 March 2007 Accepted: 3 September 2007 BMC Bioinformatics 2007, 8:328 doi:10.1186/1471-2105-8-328 This article is available from: ? 2007 Diaz-Uriarte; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Microarray data are often used for patient classification and gene selection. An appropriate tool for end users and biomedical researchers should combine user friendliness with statistical rigor, including carefully avoiding selection biases and allowing analysis of multiple solutions, together with access to additional functional information of selected genes. Methodologically, such a tool would be of greater use if it incorporates state-of-the-art computational approaches and makes source code available. Results: We have developed GeneSrF, a web-based tool, and varSelRF, an R package, that implement, in the context of patient classification, a validated method for selecting very small sets of genes while preserving classification accuracy. Computation is parallelized, allowing to take advantage of multicore CPUs and clusters of workstations. Output includes bootstrapped estimates of prediction error rate, and assessments of the stability of the solutions. Clickable tables link to additional information for each gene (GO terms, PubMed citations, KEGG pathways), and output can be sent to PaLS for examination of PubMed references, GO terms, KEGG and and Reactome pathways characteristic of sets of genes selected for class prediction. The full source code is available, allowing to extend the software. The web-based application is available from http:/ /genesrf2.bioinfo.cnio.es . All source code is available from Bioinformatics.org or The Launchpad. The R package is also available from CRAN. Conclusion: varSelRF and GeneSrF implement a validated method for gene selection including bootstrap estimates of classification error rate. They are valuable tools for applied biomedical researchers, specially for exploratory work with microarray data. Because of the underlying technology used (combination of parallelization with web-based application) they are also of methodological interest to bioinformaticians and biostatisticians. issues. First, it should provide unbiased estimates of the Background Patient classification and gene selection related to classifi- prediction error rate of the procedure. Most users are by cation are common uses of microarray data (e.g., review now aware of "selection bias", as originally reported in and references in [1]), but statistically rigorous and user- [2,3], but bias caused by trying different methods and/or friendly tools for gene selection in the context of class pre- sets of genes, and choosing the one with the smallest diction are rare. Such a tool should address two major cross-validated error rate [4] is still not widely recognized. Page 1 of 7 (page number not for citation purposes) BMC Bioinformatics 2007, 8:328 [4] or double or full the biased error rate problem (it reports the error rate as cross-validation [5] to estimate the error rate of the rule or that of the classifier with smallest cross-validated error procedure. Second, we need to assess the so called multi- rate, without evaluating the error rate of the rule itself). plicity (or lack of uniqueness) problem: variable selection with microarray data can lead to many solutions that have Implementation similar prediction errors, but that share few common The core statistical functionality is provided by the var- genes [1,6-9]. Choosing any one particular set of genes SelRF package for R [19]. This package implements the without being aware of the variability in solutions can procedure in [1] for gene selection using random forests, lead to a false sense of certainty in the selected set. building upon the randomForest package [20], an R port by A. Liaw and M. Wiener of the original code by L. Brei- From a users' perspective, an ideal tool should also be user man and A. Cutler. We use MPI [21] for parallelization via friendly and provide additional resources to ease the inter- the R-packages Rmpi [22] by H. Yu, and Snow [23] by L. pretation of results [10]. Web-based tools are an excellent Tierney, A. J. Rossini, Na Li and H. Sevcikova. In the web- platform as they do not require software installation or based application, the CGI, initial data validation, and the upgrades from the user. In addition, web based tools, can setting-up and closing of the parallel infrastructure (boot- be designed to allow easy access to information such as ing and halting the LAM/MPI universes) is implemented Gene Ontology terms, the UCSC and Ensembl databases, with Python. Our installation runs on a cluster of 30 KEGG and Reactome pathways, or PubMed references, nodes, each with two dual-core AMD Opteron processors thus enhancing the biological interpretation of results (see Figure 1 for details). [11]. Moreover, web-based tools, if implemented appro- priately, can harvest computational resources rarely avail- The input for the web-based application are either plain able to most individual users [10], including the text files, or files that come from other tools of the Asterias increasing availability of multicore processors and easily suite [12]. GeneSrF has been running in production use accessible clusters made with off-the-shelf components for over a year. Further documentation and examples for [12]. Currently, the major opportunities for improved the web-based application are available from its on-line performance as well as the ability to analyze ever larger help, and for the R package from the standard R documen- data sets do not lie in faster CPUs but in being able to use tation system. A fully commented example of the output parallel and distributed computing to exploit multi-core is provided in the on-line help [24]. Sample output is servers and clusters [13,14]. In addition to providing a shown in Figure 1. Bug-tracking and additional tests are benefit to the end user (decreased execution time), tools available from Bioinformatics.org and The Launchpad. that combine parallelization with web-based program- ming are important methodological developments. Benchmarks and run time The parallelization has been implemented over bootstrap Finally, a tool that fulfills the above requirements is of resamples. The speedups achieved by parallelizing are much greater relevance if it makes its source code availa- shown in Figure 2a), where we plot the fold increase in ble under an open-source license. Source code availability speed achieved by increasing the number of Rslaves (con- allows the research community to experiment with, and currently executing R processes). Parallelization makes a improve upon, the method and fix bugs, encourages dramatic difference in speed for all the data sets shown. reproducible research, allows to verify claims by method Up to 20 Rmpi slaves, the increases in speed are almost developers, makes the international research community linear with number of slaves. Beyond 20 slaves, speed the owner of the tools needed to carry out its work and, increases are slower with number of slaves: as is known thus, creates the conditions for swift progress upon previ- from the parallelization literature [21,25], in addition to ous work, concerns of particular importance in bioinfor- number of CPUs other factors can become limiting, in our matics [15,16]. case most likely bandwith and latency of inter-node com- munication, and potential bottlenecks from memory and We have developed GeneSrF and varSelRF (a web-based cache in nodes made of dual-core processors [26]. application and R package, respectively), that satisfy the above requirements. The only available web-based tools The scaling of user wall time of the R code (varSelRFBoot) with similar scope are M@CBETH [17] and Prophet [18]. with number of arrays and number of genes is shown in These tools, however, do not examine the multiplicity Figure 2b), with the default parallelization scheme and problem, cannot benefit from multicore processors or with a data set that allows for exploring a range of num- computing clusters, and do not make source code availa- bers of arrays and genes. User wall time increases approx- ble. M@CBETH, in addition, is restricted to two-class imately linearly with the number of arrays and number of problems and does not focus on the gene selection prob- genes over a realistic range of arrays and genes (e.g., when lem. Prophet, in turn, does not seem to solve satisfactorily Page 2 of 7 (page number not for citation purposes) BMC Bioinformatics 2007, 8:328 Example output. Some figures from the output of the web-based application (see [24]). a) Out-of-bag error rate vs. the number of genes in the class prediction model, for both the complete, original data set (red line) and the 200 bootstrap sam- ples (black lines). These figures can help identify the best number of genes in the class prediction model. It seems, that we can do fairly well using just 2 genes in our model. This is the conclussion we reach both with the complete, original data set and the bootstrap samples. b) Probability of class membership of each sample, from out-of-bag samples (i.e., bootstrap runs where the sample was not included in the training group). Most samples are well classified, specially those from class ALL (their average out of bag probability of membership in their true class is larger than 0.75). c) Importance spectrum plots can help decide on the number of "relevant variables": we compare the variable importance plots from the original data with variable importance plots that are generated when the class labels and the predictors are independent (class labels are randomly permuted). In this case the first 30 variables have importances well above those from sets with randomly permuted class labels. d) Selection prob- ability plots: for each of the top ranked genes from the original sample, the probability that it is included among the top ranked k genes (blue: k = 20; red: k = 100) from the (200) bootstrap samples. Thus, these plots can be a measure of our confidence in the stability of choosing a number of k ranked genes. In this case, with k = 20 only the two or three most important genes are repeatedly chosen among the best 20. If we select the first 100 genes, the 30 best ranked ones appear at least in 75% of the bootstrap samples. Page 3 of 7 (page number not for citation purposes) BMC Bioinformatics 2007, 8:328 Benchmarks and run time. a) Fold increase in speed from parallelization. Ratios of the user wall time of execution of the R time code (varSelRFBoot without previous model fit) between a run with a single Rmpi slave and runs with different numbers of Rmpi slaves (the number of simultaneously executing R processes) for five data sets (see [1] for details). In the legend, in paren- theses the user wall time of the execution with a single Rmpi slave for each data set. In all cases (except "1", "60(2)", and "90(3)") there were four Rmpi slaves per node. The timings were obtained in an otherwise idle cluster with 30 nodes, each with two dual-core AMD Opteron 2.2 GHz CPUs and 6 GB RAM, running Debian GNU/Linux and a stock 2.6.8 kernel, with version 7.1.2 of LAM/MPI and version 2.1.4 (patched) of R. The values for "60(2)" refer two a configuration with 2 slaves per node (recall that a node with two dual core CPUs is not identical to a node with 4 CPUs), and the value "90(3)" to a configura- tion with 3 slaves per node. b) Scaling of user wall time. User wall time as a function of number of arrays and number of genes when executing the R function varSelRFBoot without previous model fit. Shown are three replicate runs. In each run, the arrays and genes are selected randomly from the complete original data set. Further details about the Prostate data set from [1]. Hardware and software as above. We used 4 Rmpi slaves per node (and, thus, a total of 120 slaves). c) User wall time of the web-based application. User wall time for complete runs (i.e., including upload of files and return of complete HTML page) for ten different data sets (see details in [1]). Under the name of each data set, the number of arrays and the number of genes are indicated. For each data set, three replicate runs were conducted. Hardware and software configuration as above, with the default settings for the web-based application (4 Rmpi slaves per node, and thus a total of 120 slaves). Page 4 of 7 (page number not for citation purposes) BMC Bioinformatics 2007, 8:328 , including frequencies wall time increases by a factor of slightly over 2). of every gene selected in the solutions. Moreover, the bio- logical interpretation of the results is enhanced by the The run time for the web-based application for a wide access to additional information. If the input file contains range of data sets is shown Figure 2c). These timings gene identifiers for either human, mouse, or rat genomes include the time needed to upload the files (and thus can (in the form of Affymetrix IDs, Clone IDs, GenBank Acces- be affected by internet connection speed) and to prepare sion numbers, Ensembl Gene IDs, Unigene clusters, or and return to the user the final figures. Note that in most Entrez Gene IDs), for each gene in the results, the web- cases the complete analysis is finished within 20 minutes. based application provides a link to IDClight [11], which allows the user to obtain additional information, includ- Scripts for timing experiments are included with the ing mapping between gene and protein identifiers, source code (directory "Benchmarks"). PubMed references, Gene Ontology terms, and KEGG and Reactome pathways. The multiple solutions can be further studied by sending sets of selected genes to our tool PaLS Results and discussion Our procedure is explicitly targeted to select very small [30] to examine PubMed references, Gene Ontology terms, KEGG pathways, or Reactome pathways that are sets of genes, and has been shown [1] to have a classifica- tion error rate on-par with other, state-of-the-art, classifi- common to a user-selected percentage of genes or lists cation procedures. Additionally, our programs allow the (bootstrap solutions). A fully commented example of the output is provided in the on-line help [24]. exploratory usage of random forest for identifying large subsets of genes potentially relevant for class prediction. In contrast to other tools, such as M@CBETH [17], we are Finally, GeneSrF is one of the very few tools for the analy- not restricted to two-class problems. sis of gene expression data that uses parallelization and, as far as we know, the only web-based tool to use paralleli- To avoid underestimating the error rate of the classifica- zation for gene selection and classification. This is an tion procedure, we use the bootstrap (the 0.632+ important methodological novelty, as we can no longer approach of [27]). As in [1], we bootstrap the complete expect that increases in single-CPU speed will allow us to procedure, including selecting the classifier with minimal analyze larger data sets in shorter time: the rate of increase out-of-bag error rate (thus, this is a "full" or "double" in CPU speed has slowed down considerably in the last bootstrap procedure, sensu [5]), and thus our estimates of five years but, in contrast, increasing numbers of CPU error rate are not affected by selection biases. This con- cores (either in individual machines – including laptops – trasts, for instance, with Prophet [18], where the error rate or via off-the-shelf computing clusters) are becoming reported is that of the classifier with the smallest cross-val- much more affordable [13,14]. Thus, further decreases in idated error rate. Based upon the bootstrap results, we also user wall time (time to wait for a result) and ability to tackle more complex problems will depend on our ability show the average out-of-bag predictions for each sample, allowing to easily asses poorly predicted samples and to use parallel, distributed, and concurrent programming. potential outliers. There are other tools available for per- GeneSrF therefore represents a case example on combin- ing parallel computing with a user-friendly web-based forming cross-validation and bootstrap of classification methods, such as the R package ipred [28] by A. Peters and application for the analysis of gene expression data and, T. Hothorn, the BioConductor package MCRestimate [29] by making the full source code available, allows other by M. Ruschhaupt, U. Mansmann, P. Warnat, W. Huber researchers to build upon our developments. and A. Benner, specifically targeted to computing misclas- sification error rates combining the gene selection and Future work focuses on extending the software to use ran- classification steps, or the caGEDA web application [10] dom forest-related techniques applicable to heterogene- that incorporates bootstrap, leave-one-out, and random ous types of variables such as addition of categorical data resampling validation of several classifiers. Our approach, [31] and other clinical information. As well, we are however, has been tailored to our own variable selection exploring alternative mechanisms and languages for par- procedure and has been parallelized. A unique feature of allelizing and distributing computations, and we are GeneSrF and varSelRF are their emphasis on examining rewriting most of the code using Pylons [32], a Python possible multiple solutions. web framework, to try to simplify installation of the web- based application. Installation now involves several steps Since we obtain 200 resamples in the process of boot- (see [33]), and the most time consuming are setting up strapping (see above) there is little added computational and verifying the LAM/MPI environment, and using the cost to providing analysis of stability and multiplicity of correct paths in files involved in controlling the MPI envi- solutions. We report the number of genes selected and the ronment and executing and controlling R. identity of the individual genes selected in the original Page 5 of 7 (page number not for citation purposes) BMC Bioinformatics 2007, 8:328 provided by Fundación de Investigación Médica Mutua Madrileña and Conclusion Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and varSelRF and GeneSrF implement a validated method for Science (MEC). R.D.-U. is partially supported by the Ramón y Cajal pro- gene selection and provide bootstrap estimates of classifi- gramme of the Spanish MEC. cation error rate, take advantage of computing clusters and multicore processors, and encourage careful examina- References tion of the multiplicity of solutions problems. Thus, these 1. Díaz-Uriarte R, Alvarez de Andrés S: Gene selection and classifi- are both useful tools for applied biomedical researchers cation of microarray data using random forest. BMC Bioinfor- matics 2006, 7():3. using microarray and gene expression data, and represent 2. Ambroise C, McLachlan GJ: Selection bias in gene extraction on unique methodological developments in the area of web- the basis of microarray gene-expression data. Proc Natl Acad based gene expression analysis tools. Sci USA 2002, 99(10):6562-6566. 3. Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic Availability and requirements classification. Journal of the National Cancer Institute 2003, 95:14-18. 4. For GeneSrF: Varma S, Simon R: Bias in error estimation when using cross- 5. validation for model selection. BMC Bioinformatics 2006, 7:. Project name: GeneSrF Dudoit S, Fridlyand J: Classification in microarray experiments. In Statistical analysis of gene expression microarray data Edited by: Speed 6. Somorjai RL, Dolenko B, Baumgartner R: Class prediction and dis- T. New York: Chapman & Hall; 2003:93-158. Project home page: covery using gene microarray and proteomics mass spec- troscopy data: curses, caveats, cautions. Bioinformatics 2003, 19:1484-1491. Operating system: Platform independent (web-based 7. Pan KH, Lih CJ, Cohen SN: Effects of threshold choice on biolog- application) ical conclusions reached during analysis of gene expression Proc Natl Acad Sci USA 2005, by DNA microarrays. 102:8961-8965. Programming language: R, Python 8. Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breat cancer: is there a unique set? Bioinformatics 2005, Other requirements: A web browser. 21:171-178. 9. Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. License: None for usage. Web-based code: Affero GPL Lancet 2005, 365:488-492. 10. (open source). Patel S, Lyons-Weiler J: caGEDA: a web application for the inte- grated analysis of global gene expression patterns in cancer. 11. Applied Bioinformatics 2004, 3:49-62. Any restrictions to use by non-academics: None. Alibés A, Yankilevich P, Cañada A, Diaz-Uriarte R: IDconverter and IDClight: conversion and annotation of gene and protein IDs. 12. For varSelRF: BMC Bioinformatics 2007, 8:9. Diaz-Uriarte R, Alibés A, Morrissey ER, Cañada A, Rueda O, Neves ML: Asterias: integrated analysis of expression and aCGH Project name: varSelRF data using an open-source, web-based, parallelized software 13. suite. Nucleic Acids Research 2007, 35:W75-W80. Sutter H: The Free Lunch Is Over: A Fundamental Turn Project home page: 's Journal 2005, 14. Kontoghiorghes EJ: Handbook of Parallel Computing and Statistics Boca 30(3):202-210. Operating system: Linux, Unix Raton, FL: Chapman & Hall, CRC; 2006. 15. Dudoit S, Gentleman RC, Quackenbush J: Open source software for the analysis of microarray data. Biotechniques 2003:45-51. 16. Programming language: R, Python Díaz-Uriarte R: Supervised methods with genomic data: a review and cautionary view. In Data analysis and visualization in genomics and proteomics Edited by: Azuaje F, Dopazo J. New York: Other requirements: LAM/MPI Wiley; 2005:193-214. 17. Pochet NL, Janssens FA, De Smet F, Marchal K, Suykens JA, De Moor License: GNU GPL BL: M@CBETH: a microarray classification benchmarking 18. tool. Bioinformatics 2005, 21:3185-3186. Medina I, Montaner D, Tarraga J, Dopazo J: Prophet, a web-based Any restrictions to use by non-academics: None tool for class prediction using microarray data. Bioinformatics 19. 2006 in press. R Development Core Team: R: A language and environment for statisti- Abbreviations cal computing 2004 []. R Foundation for Sta- CGI, Common Gateway Interface; GO, Gene Ontology; 20. tistical Computing, Vienna, Austria [ISBN 3-900051-00-3] R News 2002, 2(318-22 [, Kyoto Encyclopedia of Genes and Genomes; LAM, Forest. Liaw A, Wiener M: Classification and Regression by random- Rnews/]. Local Area Multicomputer; MPI, Message Passing Inter- 21. Pacheco P: Parallel programming with MPI San Francisco: Morgan face. kaufman; 1997. 22. Yu H: Rmpi: Interface (Wrapper) to MPI (Message-Passing Interface) []. Acknowledgements 23. Tierney L, Rossini AJ, Li N, Sevcikova H: snow: Simple Network of Work- A. Alibés and A. Cañada for their work on IDClight and PaLS. Two anony- stations []. ics.org and The Launchpad for project and repository hosting. Funding Page 6 of 7 (page number not for citation purposes) BMC Bioinformatics 2007, 8:328 24. GeneSrF on-line commented example []. 25. Foster I: Designing and building parallel programs Edited by: . Boston: Addison Wesley; 1995. 26. Dongarra J, Gannon D, Fox G, Kenned K: The Impact of Multicore on Computational Science Software. CTWatch Quarterly 2007, 3:3-10. 27. Efron B, Tibshirani RJ: Improvements on cross-validation: the .632+ bootstrap method. J American Statistical Association 1997, 92:548-560. 28. Peters A, Hothorn T: ipred: Improvedt Predictors [ project.org/src/contrib/Descriptions/ipred.html]. 29. Ruschhaupt M, Mansmann U, Warnat P, Huber W, Benner A: MCRestimate []. 30. PaLS []. Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in Random 31. Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics 2007, 8:25. 32. Pylons []. 33. Download page []. Publish with BioMedCentral and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and publishedimmediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: Page 7 of 7 (page number not for citation purposes)
本文档为【GeneSrF and varSelRF a web-based tool and R package for gene selection and classification using random forest】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_482581
暂无简介~
格式:doc
大小:130KB
软件:Word
页数:20
分类:
上传时间:2018-04-29
浏览量:6