ChIPOTle a user-friendly tool for the analysis of ChIP-chip data

ChIPOTle a user-friendly tool for the analysis of ChIP-chip dataChIPOTle a user-friendly tool for the analysis of ChIP-chip data So are ftwChIPOTle: a user-friendly tool for the analysis of ChIP-chip data Michael J Buck*, Andrew B Nobel† and Jason D Lieb* Addresses: *Department of Biology and Carolina Center for Genome...

ChIPOTle a user-friendly tool for the analysis of ChIP-chip data So are ftwChIPOTle: a user-friendly tool for the analysis of ChIP-chip data Michael J Buck*, Andrew B Nobel† and Jason D Lieb* Addresses: *Department of Biology and Carolina Center for Genome Sciences, CB 3280, 202 Fordham Hall, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-3280, USA. †Department of Statistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599- 3260, USA. Correspondence: Jason D Lieb. E-mail: jlieb@bio.unc.edu Published: 19 October 2005 Received: 7 June 2005 Revised: 2 August 2005 Genome Biology 2005, 6:R97 (doi:10.1186/gb-2005-6-11-r97) Accepted: 22 September 2005 The electronic version of this article is the complete one and can be found online at ? 2005 Buck et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract ChIPOTle (Chromatin ImmunoPrecipitation On Tiled arrays) takes advantage of two unique properties of ChIP-chip data: the single-tailed nature of the data, caused by specific enrichment but not specific depletion of genomic fragments; and the predictable enrichment of DNA fragments adjacent to sites of direct protein-DNA interaction. Implemented as a Microsoft Excel macro written in Visual Basic, ChIPOTle uses a sliding window approach that yields improvements in the identification of bona fide sites of protein-DNA interaction. Rationale After growing the cells of interest under the desired condi- tions, chromatin is usually cross-linked with formaldehyde to Interactions between proteins and DNA facilitate and regu- late many basic cellular functions, including transcription, preserve sites of interaction between proteins and DNA. The DNA replication, recombination, and DNA repair. For exam- cross-linked chromatin is then sheared by sonication or enzy- ple, the process of transcription is regulated by a class of pro- matic digestion. Shearing creates a population of chromatin teins referred to as transcription factors, which often bind to fragments of varying size, generally ranging from 200 to specific DNA sequences upstream of gene coding regions. 1,000 base-pairs. The protein of interest, along with the DNA This control mechanism allows cells to respond to develop- associated with it, is then isolated by using an antibody spe- mental or environmental signals by using the same transcrip- cific to that protein or by affinity purification utilizing an epitope or affinity tag fused to the protein. The ChIPed DNA tion factor to coordinate expression of many genes. is then purified. Because yields from most samples are low, Therefore, it is of interest to determine where regulatory pro- teins of this and other types are bound to the genome. amplification is often required. DNA fragments enriched in the procedure are then detected by comparative hybridization to a DNA microarray. Standard technical recommendations The genomic-binding location of transcription factors can be determined using chromatin immunoprecipitation (ChIP) common to all microarray experiments (for example, the need for dye swaps) apply equally to ChIP-chip experiments. followed by detection of the enriched fragments by DNA microarray hybridization. This procedure, also known as The result of the hybridization allows one to identify which ChIP-chip, has been reviewed extensively [1-5]. To appreciate segments of the genome were bound by the protein of interest during immunoprecipitation. the unique properties of the data generated by the ChIP-chip procedure, it is useful to review briefly the main points of the experimental procedure (Figure 1). The interpretation of data generated by a ChIP-chip experi- ment is in many respects similar to interpretation of Genome Biology 2005, 6:R97 R97.2 Genome Biology 2005, Volume 6, Issue 11, Article R97 Buck et al. Binding site (a) s agment Crosslink and lyse iched fr Enr (b) 3 2 Sonicate 1 E C B D F G A 0 -1 0 1.5 1 0.5 0.5 1 1.5 2 2.5 2.5 2 Distance from binding site (kb) 2 Reference ChIP (c) 1 Add specific antibody 0 1 (d) 10-4 Immunoprecipitation 10 Reverse cross --1822 links and purify 10 total DNA 10-26 Reverse cross 10 links and purify -30 ChIP-enriched DNA Figure 2 Amplify and label The neighbor effect and calculation of P values. (a) After ChIP, purified DNA fragments bound by the protein of interest will be of various lengths. + Cy5 + Cy3 (b) Actual log2 ratios reported by arrayed elements for Rap1p binding to promoter region of RPL1B (array element 'A') from the Rap1p binding Hybridize to dataset reported by Lieb and coworkers [13]. Arrayed element 'A' microarray contains the actual site of protein-DNA interaction, and so this spot will have the highest ratio (red = high positive ratio; yellow = low ratio; green Figure 1 = negative ratio). Arrayed elements 'B' (RPL1B open reading frame [ORF]) A summary of the ChIP-chip procedure. See the text for details. and 'C' (MRM2 ORF), which are within about 1 kilobase (kb) of the binding site, are also enriched above noise. Arrayed element 'B' has a higher ratio then spot 'C', because the binding site is located closer to element 'B'. The traditional gene expression microarrays, but it differs in two arrayed elements 'D', 'E', 'F', and 'G' are too far from the binding site to be important ways. First, in traditional expression experiments, enriched. (c) Using a 1 kb window with a 0.25 kb step, the value of each each element on the microarray measures the abundance of window is plotted. The location of each window is defined by its central coordinate. (d) The P value of each window is plotted. The Bonferroni RNA molecules of a fixed length. (Note that we shall use the corrected P values were calculated based on the observed data, which had term 'arrayed elements' hereafter to describe DNA fragments a log2 background standard deviation of 0.32 with 21,208 comparisons. that are deposited on the surface of the array; the term 'probe' Note that the window with the smallest P value (about 10-30) does not is sometimes used by others.) In contrast, with ChIP-chip correspond to the highest window average. This is due to the fact that the most significant window contains three arrayed elements (A, B, and C), experiments each element measures the abundance of a pop- whereas the windows with the highest average contain only two elements ulation of fragments of various lengths due to the effects of (A and B). In this case, the center of the window with the highest P value is chromatin shearing. As a consequence, arrayed elements rep- located about 80 bases from the actual binding site. resenting genomic regions both at the binding site and near the binding site will detect enrichment (Figure 2). is, there is biological significance associated with both low and high ratio measurements, and these measurements often Depending on the method and degree of chromatin shearing, occur with similar frequencies. In contrast, the measure- and the resolution of the arrayed elements, this effect pro- ments derived from ChIP-chip experiments arise as a mixture duces a 'peak' of signal centered over the binding site, which of two distributions. The first corresponds to the population may span several arrayed elements representing genomically of genomic fragments specifically enriched by the ChIP, and adjacent DNA. This 'neighbor effect' is not an expected prop- the second corresponds to the remaining population of erty of noise or other spuriously high ratio measurements, and thus is a source of information that can be used for genomic DNA that is not ChIP enriched and therefore repre- analysis. sents background, or noise. The observed distribution of the log2 ratios is therefore asymmetric about zero, with a distinct, The second difference in the interpretation of ChIP-chip and positively oriented skew (Figure 3a). The left-hand side of the traditional gene expression data is that in expression experi- distribution (the negative log ratios) is approximately Gaus- ments, the data are two-tailed and roughly symmetric. That sian, but the positive log ratios exhibit a heavier non-Gaus- Genome Biology 2005, 6:R97 , Volume 6, Issue 11, Article R97 Buck et al. R97.3 The type of microarray used in a ChIP-chip experiment (a) affects how the data can be analyzed. Two array designs are Q-Q plot for Rap1p ChIP-chip versus Gaussian distribution typically used for ChIP-chips: tiled or promoter-specific 5 arrays. Promoter-specific arrays generally contain a single arrayed element to represent each regulatory region of inter- 4 est. These arrays are valuable when binding is known to be Rap1p confined to regulatory sequences close to transcriptional start 3 sites of the selected genes [7], but they become less powerful 2 when binding is not as well characterized or is spread over a large genomic area. The other type, namely tiled arrays, are 1 best suited to ChIP-chip. The term 'tiled array', or sometimes 'tiling-path array', refers to arrays containing DNA fragments 0 designed to cover large genomic regions or whole chromo- Simulated data somes with few or no gaps between arrayed elements [8,9]. Gaussian region -1 mean = 0 Tiled arrays are advantageous because they do not require Standard deviation = 0.35 prior knowledge of potential binding targets, and they allow -4 -2 0 2 4 one to utilize the 'neighbor effect' in data analysis. Gaussian quantiles In this report we describe ChIPOTle (Chromatin Immuno- Precipitation On Tiled arrays), software created expressly for the analysis of ChIP-chip data obtained using tiled arrays, (b) which allow us to exploit both the 'single-tail' and 'neighbor Rap1p binding to chromosome 6 effect'. ChIPOTle uses a sliding window approach to identify potential sites of enrichment, and then estimates the signifi- cance of enrichment for a genomic region using a standard Gaussian error function. ChIPOTle is delivered as a Microsoft Excel macro written in Visual Basic, which should facilitate widespread adoption and provide a platform for custom applications. Before ChIPOTle, to our knowledge the only publicly available program designed expressly for ChIP-chip data analysis was PeakFinder [10]. ChIPOTle offers several improvements, including accurate and powerful P value esti- mation and improved usability. ChIPOTle is available online (Additional data file 1) [11]. Genomic location (bp) Figure 3 The ChIPOTle algorithm Characteristics of ChIP-chip data. (a) A quantile-quantile plot (QQ plot) for one representative Rap1p ChIP-chip experiment (red) against Gaussian ChIPOTle first sorts the arrayed elements by genomic loca- distribution with a standard deviation of 0.35 and a mean of 0 (black bars). tion. To find potential areas of ChIP enrichment, a window of The upper and lower bounds of the black dashed line represent extreme user-defined size (default 1 kilobase) is then moved stepwise values for 10,000 simulated Gaussian distributions with the above (user-defined step size; default 0.25 kilobase) along the tiled parameters. For Rap1 about 92% of the data fit the Gaussian distribution. The top 8% is skewed away from the simulated data. (b) A sliding window region. At each step the average log2 ratio for the window is analysis for yeast chromosome VI produced by ChIPOTle for four Rap1p calculated by taking the simple average of all ratios reported replicates [13]. Window size is 1 kilobase (kb) with 0.25 kb step size. The by arrayed elements that overlap with the window to any Rap1p binding sites are identified with arrows. degree. The average is unweighted, and therefore it is not dependent on the proportion of the element within the win- dow; it depends only on whether it is present or absent. The window is then moved unidirectionally along the chromo- some by the step size and the same calculation is repeated for sian tail. For the vast majority of ChIP-chip experiments, the genomic regions of biological interest will be confined to the each distinct window, until the end of the chromosome is positive side of the distribution, and the negative log ratios reached. The arrayed elements need not be evenly spaced or will arise solely from fragments that are considered to be of equal lengths. ChIPOTle can be used with any genome. background. Under the additional assumption that the distri- bution of unenriched fragments is symmetric about zero, we As described in more detail below, the resulting sliding win- can estimate the distribution of background ratios using only dow averages can be represented as a graph, with genomic the observed negative log ratios as a guide [6]. position on the horizontal axis and average log2 ratio on the Genome Biology 2005, 6:R97 R97.4 Genome Biology 2005, Volume 6, Issue 11, Article R97 Buck et al. , genomic binding locations are rep- Through a dialog window, ChIPOTle will ask for the location resented as a series of peaks (Figure 3b). Averaging the log2 of each data column. The user will also be prompted to pro- ratios of elements in a window accounts for the neighbor vide the window size, step size, and the desired technique for effect, because the peak generated by a spuriously high signal determining peak significance. For the latter parameter, the will be reduced by averaging its value with the ratios of neigh- user can choose (1) a simple peak height cutoff; (2) assume a boring elements, which are very unlikely also to be high Gaussian background distribution for calculation of window purely by chance. average P values; or (3) estimate the background distribution for calculation of window average P values via a permutation- ChIPOTle assigns a P value to the average log ratio within based simulation. If option 1 is selected then the user is each window, under the null hypothesis that the observed log prompted to enter the peak height; for option 2 the user is ratios are independent, identically distributed, and random prompted to provide the significance P value cutoff; and for variables, having a Gaussian distribution with a mean of zero. option 3 the user is prompted to provide the number of simu- The variance of the observations is estimated by the average lations and the significance P value cutoff to be used in the sum of the squared negative log ratios. Under the null hypoth- permutation analysis. Any region with a P value lower then esis, the distribution of the average log2 ratio within each win- the selected cutoff will be recorded and summarized in the dow is again Gaussian, with mean zero and variance equal to "Significant Regions" and "Peaks" worksheets. the variance of a single log ratio divided by the number of ele- ments in the window. Thus, the nominal P value for a window with average ratio w can be calculated using the standard Parameter optimization error function (ERF) as follows: As described above, ChIPOTle has three important user- defined parameters: P value cutoff, window size, and step size. These parameters will affect the output, and can be Pwindow = 1? ERF w (1) n adjusted according to the experiment and the array design. σ The P value cutoff should be set at a level that produces a false whereσ is the standard deviation for the background distri- discovery rate with which the user is comfortable. The "Sig- bution and n is the number of microarray elements used in nificant Negative Regions" sheet provides an empirical esti- the window. The P values reported by ChIPOTle are corrected mate of the number of false-positive findings for the selected for multiple comparisons using the conservative Bonferroni P value cutoff, and so the user can use this information to esti- correction. As an alternative to using a Gaussian distribution mate the false-positive rate and adjust the P value cutoff (see for the background, ChIPOTle can estimate the P value for a below). The numbers of acceptable false-positive and false- region using a permutation-based approach (Additional data negative findings will vary depending on the goals of the file 2). study. The next parameter to set is window size. Ideally, for a given protein-DNA interaction, one would like to capture the maxi- Using ChIPOTle Detailed instructions for the installation and use of ChIPOTle mal amount of ChIP signal associated with a single binding are available in the read-me file that accompanies the pro- event, and none of the noise, in a single window. Therefore, in gram (Additional data file 2). Once ChIPOTle has been cor- most ChIP-chip experiments the window size should be rectly added to the Excel Add-Ins menu or opened manually, adjusted to approximately the average shear size of the chro- a new menu option will appear in the Excel Tools menu. matin. The average shear size is suggested because the size of ChIPOTle must be run from an active Excel spreadsheet con- the window must be balanced against making it so large that taining five columns: the name of each arrayed element, chro- noise from adjacent genomic regions is included in the meas- mosome name, start coordinate in base-pairs, end coordinate urement, and against making the window so small that data in base-pairs, and the log2 ratio from the ChIP-chip experi- from adjacent spots is excluded, diminishing the power of ment(s). The ratio values supplied to ChIPOTle can be a sin- windowing to utilize the neighbor effect. Although this gle measurement from a single experiment or an average, parameter is largely independent of array platform or array weighted average, or median of ratio values calculated from resolution, slightly smaller windows may be more effective on multiple replicates. When using data from multiple repli- higher resolution arrays. cates, before combining the data each array must be appro- priately normalized to remove systematic nonbiological Optimization of step size depends on both the array resolu- effects that might otherwise influence the results [1]. For sin- tion and the window size. The step size should be adjusted gle channel experiments, pseudo-ratios must be created such that it is less than half of the array resolution, with array before using ChIPOTle. Pseudo-ratios may be created by resolution defined as the distance between the start of one dividing the intensity value at each arrayed element by the arrayed element and the start of the next. Thereby, the meas- median intensity value for all arrayed elements. urement recorded at each arrayed element will be used in the calculation of at least three windows, ensuring that every Genome Biology 2005, 6:R97 , Volume 6, Issue 11, Article R97 Buck et al. R97.5 arrayed element has the opportunity to be centered under a of significantly enriched peaks, and the total number of peak. Window size is also an important factor because some windows. overlap of windows is desirable in order to detect peaks at unknown locations. Taken together, we suggest setting the step size to the maximum value that is both less than half of Properties of ChIP-chip data the array resolution and less than or equal to one-quarter of A plot of the sliding window values generated by ChIPOTle for the window size. For very high-resolution arrays (less than a Rap1p ChIP-chip reveals two important characteristics of about 50 base-pairs), step sizes smaller than the array resolu- this type of data (Figure 3b). The first is an absence of deep tion may not improve results. negative peaks. In ChIP-chip experiments, negative log ratios are not caused by specific depletion of genomic fragments but by noise. Therefore, after averaging with neighboring genomic elements, their window average will tend to be small. ChIPOTle output The second is the presence of tall positive peaks that extend ChIPOTle creates several output sheets with the following names: SummarySheet, Significant Regions, Significant Neg- well above background. ative Regions, Chromosomes aveP, Peaks, and Description. The SummarySheet contains all the input data used to run ChIPOTle, now sorted by chromosome and start coordinate. Comparing ChIPOTle with other techniques For each window that meets the significance criteria specified used to analyze ChIP-chip data by the user, the Significant Regions sheet contains the follow- We compared ChIPOTle with three other analysis techniques ing: chromosome assignment, center coordinate, number of commonly used to analyze ChIP-chip experiments: the single independent arrayed elements within each window, and array error model (SAEM) [6,7,12], percentile rank analysis names of the arrayed elements that comprise the window. [13], and PeakFinder (smoothing settings: n = 5, rounds = 7) Significant Negative Regions is similar to Significant Regions, [10]. All four techniques were used to analyze four biological but instead it contains all of the windows that meet the signif- replicates (experiments 5, 6, 8, and 9) from the Rap1p binding icance criteria but are sign-flipped. The number of windows dataset in yeast reported by Lieb and coworkers [13]. To com- reported in this sheet can be used as an estimate of the pare the power of the four techniques quantitatively, they number of false-positive findings expected for the selected or were judged by their ability to identify the 127 promoters of the ribosomal protein genes (RPGs) as targets of Rap1p bind- estimated cutoff. Chromosome aveP contains the names of ing. As a group, these promoters are known targets of Rap1p, the arrayed elements that comprise each window, and the and almost all contain consensus Rap1p-binding sites [14]. By chromosome, center coordinate, and value of all windows, regardless of whether they meet the significance criterion. using this functionally defined set, we avoided using any par- The values from this sheet, for example, were used to make ticular ChIP dataset to define our 'gold standard'. The targets identified by each technique were sorted by P value Figure 3b. (ChIPOTle and SAEM), median percentile rank (percentile rank), or ySmooth value (PeakFinder). We then used receiver The data written to the "Peaks" sheet are similar to those reported in "Significant Regions", except that all neighboring operator characteristic (ROC) plots to show how true posi- windows meeting the significance criteria are collapsed into a tives (sensitivity) were captured in relation to false positives single peak. Therefore, a peak is defined as any window with (specificity) for all values output by each method (Figure 4a). a P value that meets the significance criterion defined by the The power of each technique was then quantitated as the area user and all neighboring windows that also meet the signifi- under the ROC curve (AUC). An analysis technique that cance criteria. In this sheet, each peak is listed in order of its selected targets randomly would have an AUC of about 0.5; occurrence along the chromosome, along with the highest higher values are better (maximum = 1). window for each peak, highest raw log2 ratio for any element within the peak, start coordinate of the peak, the width of por- In using the Rap1p ChIP-chip data to identify the promoters tion of the peak above the significance cutoff, 'array density' of RPGs, all of the techniques worked well, but ChIPOTle of the peak, and the P value for that peak. The array density (Figure 4a, black line; AUC = 0.963) performed considerably value is defined as the average number of arrayed elements better then the other techniques (SAEM: AUC = 0.906, per- used to calculate the window values for all windows that com- centile rank AUC = 0.897; PeakFinder: AUC = 0.838). The prise the peak. Therefore, the array density value provides an 95% confidence interval for each AUC value (Figure 4b) was estimate of the number of actual raw data measurements that estimated by bootstrap resampling of RPG occurrence and underlie each peak. enrichment value as measured in each technique (P value, percentile rank, or ySmooth) [15]. The last sheet, Description, contains a summary of the ChIPOTle execution parameters, which include the date and We next compared the ability of ChIPOTle, SAEM, and Peak- time, the selected window size, the step size, the significance Finder to identify accurately the RPG promoters from a ChIP- method chosen and corresponding parameters, the number chip hybridization to a single microarray. This analysis Genome Biology 2005, 6:R97 R97.6 Genome Biology 2005, Volume 6, Issue 11, Article R97 Buck et al. (a) Figure 4Identification of ribosomal protein Comparison of ChIPOTle with other ChIP-chip analysis approaches. (a) gene promoters as targets of Rap1p ChIPOTle, the single-array error model (SAEM), median percentile rank, 1 and PeakFinder were used to analyze the same four Rap1p ChIP-chip replicates reported by Lieb and coworkers [13], and judged by their ability 0.8 to determine enrichment of ribosomal protein gene (RPG) promoters. The binding site for Rap1p is found in most (>90%) RPG promoters [14], 0.6 which represent approximately half of Rap1p's total in vivo targets. ChIPOTle (0.963) Receiver operating characteristic (ROC) curves summarize the power of SAEM (0.906) 0.4 each technique and are equivalent to a plot of the true-positive rate Percentile rank (0.897) (fraction of ribosomal promoters) versus the false-positive rate (fraction PeakFinder (0.838) of all genomic elements other than ribosomal promoters). Each technique 0.2 is judged by means of the area under the ROC curve (AUC). An AUC value of 0.5, corresponding to a diagonal ROC curve, is expected by chance, whereas a value of 1.0 indicates a technique that predicts targets 0.2 0.4 0.6 0.8 1 0 perfectly. ChIPOTle (AUC = 0.963) outperformed the other techniques Fraction of all genomic elements tested here (SAEM: AUC = 0.906; median percentile rank: AUC = 0.897; other than ribosomal promoters and PeakFinder: AUC = 0.823). When comparing ChIPOTle with PeakFinder, we used the default settings for smoothing (n = 5 [11-point] (b) The 95% confidence intervals smoothing with 7 rounds). In addition, we attempted to optimize the for ROC AUC settings by trying varying levels of smoothing, including 7-point and 13- 0.98 point, which produced similar results. Rap1p's strongest binding sites are 0.94 located at the telomeres, which are not included with our defined 'true positive' set of RPG promoters. Therefore, the false-positive rate will be 0.90 somewhat inflated, which will decrease the AUC for all techniques. This is 0.86 reflected in the ROC curves by the low true-positive rate at the extreme left of the plot. (b) The 95% confidence interval for the AUC for each 0.82 analysis technique was estimated by bootstrap resampling of RPG occurrence and enrichment value (1,000 iterations) as measured in each technique (P value, percentile rank, or ySmooth). Boostrapping of raw data was not practical because of inability to automate all four analysis methods. (c) ROC curves comparing ChIPOTle, SAEM, and PeakFinder with respect to their ability to identify enrichment of RPG promoters from a single experiment. The average true-positive rate (fraction of (c) Identification of ribosomal protein ribosomal promoters) versus false-positive rate (fraction of all genomic gene promoters as targets of Rap1p elements other than ribosomal promoters) for the four individual from a single experiment experiments is plotted. The three techniques performed extremely well, 1 but ChIPOTle (AUC = 0.885) outperformed both SAEM (AUC = 0.835) and PeakFinder (AUC = 0.833). 0.8 0.6 cannot be performed with the percentile rank analysis ChIPOTle (0.885) 0.4 because this technique requires experimental replicates. We SAEM (0.835) PeakFinder (0.833) analyzed each individual experiment independently and 0.2 determined the average true-positive rate versus the false- positive rate (Figure 4c). All three techniques performed extremely well, but ChIPOTle (AUC = 0.885) outperformed 1 0 0.2 0.4 0.6 0.8 both SAEM (AUC = 0.835) and PeakFinder (AUC = 0.833). In Fraction of all genomic elements addition, ChIPOTle produced higher AUC values than both other than ribosomal promoters SAEM and PeakFinder for each individual experiment (data Figure 4 not shown). Discussion ChIPOTle is a Microsoft Excel macro that is designed for use in the analysis of data from ChIP-chip experiments. ChIPOTle exploits the unique characteristics of ChIP-chip data, including enrichment of DNA genomically adjacent to sites of protein-DNA interaction, and the single-tailed nature of the data, to define peaks of enrichment and their signifi- cance. ChIPOTle is very quick and easy to use. The user is prompted to select the five columns containing their data and Genome Biology 2005, 6:R97 , Volume 6, Issue 11, Article R97 Buck et al. R97.7 the significance technique to be used. The program then arrays, preliminary experiments using whole-genome arrays returns the genomic regions that were enriched by the ChIP can be used to find likely targets. Once these likely targets are according to the data and the specified statistical parameters. identified, the array could be redesigned to include all pro- In its current implementation, ChIPOTle is restricted in func- spective targets and appropriate controls on a single array. In tionality by the limitations of Excel worksheets to 65,536 addition to its utility as a general ChIP-chip analysis tool, rows by 256 columns. Therefore, if the dataset of interest is ChIPOTle will make prescreening more accurate and will derived from an array containing more then 65,536 unique enhance the power and accuracy of this approach. elements or if the total number of windows generated exceeds 5.5 million, then the data will have to be separated into sub- sets (for example, individual chromosomes) if they are to be Additional data files analyzed using ChIPOTle. The following additional files are included with the online version of this paper: The Excel Add-In ChIPOTle v 1.0 (Addi- tional data file 1), a pdf file containing detailed instructions As currently implemented, the significance analysis in ChIPOTle is carried out under the assumption that the log2 for the installation and use of ChIPOTle (Additional data file ratios of the arrayed elements are independent and Gaussian 2), and an Excel file containing the Rap1p binding data used distributed, with mean zero and common variance. Under to make the comparisons between the different techniques this assumption, a nominal P value may be assigned to each (Additional data file 3). window using the standard Gaussian cumulative distribution function, or an appropriate bound having closed form. Multi- ple comparisons can then be addressed via a Bonferroni cor- Acknowledgements rection or through an estimated false-discovery rate. In either This work was supported by NIH grants to M.J.B. (F32HG002989) and J.D.L. (R01GM072518) and by an NSF grant to A.B.N. (DMS-0406361). case, the tail behavior of the Gaussian distribution will have a strong effect on the corrected P values. References As a more conservative alternative to the Gaussian approach, 1. Buck MJ, Lieb JD: ChIP-chip: considerations for the design, one could derive nominal P values from each window using a analysis, and application of genome-wide chromatin immu- noprecipitation experiments. Genomics 2004, 83:349-360. null distribution with heavier tails than the Gaussian. A natu- 2. Kurdistani SK, Grunstein M: In vivo protein-protein and protein- ral choice, consistent with the observed histogram of log2 DNA crosslinking for genomewide binding microarray. Meth- ratios, is a t-type distribution. Formally, one may adopt the ods 2003, 31:90-95. 3. Wells J, Farnham PJ: Characterizing transcription factor bind- null hypothesis that the observed log2 ratios are independent ing sites using formaldehyde crosslinking and and distributed as cT, where c is a positive scaling factor and immunoprecipitation. Methods 2002, 26:48-56. 4. Lieb JD: Genome-wide mapping of protein-DNA interactions T has a standard t distribution with v degrees of freedom. In by chromatin immunoprecipitation and DNA microarray order to obtain nominal P values, one then needs estimates of hybridization. Methods Mol Biol 2003, 224:99-109. 5. Hanlon SE, Lieb JD: Progress and challenges in profiling the c and v, and bounds on the probability that a sum of inde- dynamics of chromatin and transcription factor binding with pendent t-distributed random variables exceeds a threshold. DNA microarrays. Curr Opin Genet Dev 2004, 14:697-705. 6. Estimates of c and v can be obtained through moment-based Li Z, Van Calcar S, Qu C, Cavenee WK, Zhang MQ, Ren B: A global methods. Suitable probability bounds with good small-sam- transcriptional regulatory role for c-Myc in Burkitt's lym- phoma cells. Proc Natl Acad Sci USA 2003, 100:8164-8169. ple properties are currently under investigation. 7. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al.: Tran- ChIPOTle, while using novel approaches, identifies a set of scriptional regulatory networks in Saccharomyces cerevisiae. sites similar to that defined by other techniques (PeakFinder, Science 2002, 298:799-804. 8. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, SAEM, and percentile rank analysis) used for analysis of data Fodor SP, Gingeras TR: Large-scale transcriptional activity in chromosomes 21 and 22. Science 2002, 296:916-919. from ChIP-chip experiments. However, the use of a sliding 9. Horak CE, Mahajan MC, Luscombe NM, Gerstein M, Weissman SM, window allows ChIPOTle to identify enriched regions more Snyder M: GATA-1 binding sites mapped in the beta-globin accurately, especially after only one experiment. This is useful locus by using mammalian chIp-chip analysis. Proc Natl Acad Sci U S A 2002, 99:2924-2929. because when one is performing a ChIP-chip experiment for 10. Glynn EF, Megee PC, Yu HG, Mistrot C, Unal E, Koshland DE, DeRisi the first time with a new protein or antibody, it is often diffi- JL, Gerton JL: Genome-wide mapping of the cohesin complex in the yeast Saccharomyces cerevisiae. PLoS Biol 2004, 2:E259. cult to determine whether the ChIP was successful, especially 11. ChIPOTle: a user-friendly tool for the analysis of ChIP-chip for a protein with an undefined binding pattern. The ability to data [] 12. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour be very important for larger, more complex genomes. Com- CD, Bennett HA, Coffey E, Dai H, He YD, et al.: Functional discov- plete high-density tiled arrays for mammalian genomes ery via a compendium of expression profiles. Cell 2000, require many arrays for each experiment, meaning that per- 102:109-126. 13. Lieb JD, Liu X, Botstein D, Brown PO: Promoter-specific binding forming the ideal number of replicates can be prohibitively of Rap1 revealed by genome-wide maps of protein-DNA expensive. In mammalian systems, instead of performing all association. Nat Genet 2001, 28:327-334. 14. Lascaris RF, Mager WH, Planta RJ: DNA-binding requirements of of the replicates of a ChIP-chip experiment on whole-genome the yeast protein Rap1p as selected in silico from ribosomal Genome Biology 2005, 6:R97 R97.8 Genome Biology 2005, Volume 6, Issue 11, Article R97 Buck et al. protein gene promoter sequences. Bioinformatics 1999, 15:267-277. 15. Efron B, Gong G: A leisurely look at the bootstrap, the jack- knife, and cross-validation. Am Stat 1983, 37:36-48. Genome Biology 2005, 6:R97

                    本文档为【ChIPOTle a user-friendly tool for the analysis of ChIP-chip data】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

ChIPOTle a user-friendly tool for the analysis of ChIP-chip data

你可能还喜欢