首页 ESP_BrowserEngine_Guide

ESP_BrowserEngine_Guide

举报
开通vip

ESP_BrowserEngine_Guide FAST Enterprise Search Platform version:5.3 BrowserEngine Document Number: ESP1046, Document Revision: A, December 03, 2009 Copyright Copyright © 1997-2009 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrighted by FAST’s licensors. All ...

ESP_BrowserEngine_Guide
FAST Enterprise Search Platform version:5.3 BrowserEngine Document Number: ESP1046, Document Revision: A, December 03, 2009 Copyright Copyright © 1997-2009 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrighted by FAST’s licensors. All rights reserved.The documentation is protected by the copyright laws of Norway, the United States, and other countries and international treaties. No copyright notices may be removed from the documentation. No part of this document may be reproduced, modified, copied, stored in a retrieval system, or transmitted in any form or any means, electronic or mechanical, including photocopying and recording, for any purpose other than the purchaser’s use, without the written permission of FAST. Information in this documentation is subject to change without notice.The software described in this document is furnished under a license agreement and may be used only in accordance with the terms of the agreement. Trademarks FAST ESP, the FAST logos, FAST Personal Search, FAST mSearch, FAST InStream, FAST AdVisor, FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FAST Contextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective, NXT, FAST Unity, FAST Radar, RetrievalWare, AdMomentum, and all other FAST product names contained herein are either registered trademarks or trademarks of Fast Search & Transfer ASA in Norway, the United States and/or other countries. All rights reserved. This documentation is published in the United States and/or other countries. Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Netscape is a registered trademark of Netscape Communications Corporation in the United States and other countries. Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Red Hat is a registered trademark of Red Hat, Inc. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. AIX and IBM Classes for Unicode are registered trademarks or trademarks of International Business Machines Corporation in the United States, other countries, or both. HP and the names of HP products referenced herein are either registered trademarks or service marks, or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries. Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United States and/or other countries. XML Parser is a trademark of The Apache Software Foundation. All other company, product, and service names are the property of their respective holders and may be registered trademarks or trademarks in the United States and/or other countries. Restricted Rights Legend The documentation and accompanying software are provided to the U.S. government in a transaction subject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure of the documentation and software by the government is subject to restrictions as set forth in FAR 52.227-19 Commercial Computer Software-Restricted Rights (June 1987). Contact Us Web Site Please visit us at: http://www.fastsearch.com/ Contacting FAST FAST Cutler Lake Corporate Center 117 Kendrick Street, Suite 100 Needham, MA 02492 USA Tel: +1 (781) 304-2400 (8:30am - 5:30pm EST) Fax: +1 (781) 304-2410 Technical Support and Licensing Procedures Technical support for customers with active FAST Maintenance and Support agreements, e-mail: fasttech@microsoft.com For obtaining FAST licenses or software, contact your FAST Account Manager or e-mail: fastcsrv@microsoft.com For evaluations, contact your FAST Sales Representative or FAST Sales Engineer. Product Training E-mail: fastuniv@microsoft.com To access the FAST University Learning Portal, go to: http://www.fastuniversity.com/ Sales E-mail: sales@fastsearch.com Contents Preface..................................................................................................ii Copyright..................................................................................................................................ii Contact Us...............................................................................................................................iii Chapter 1: About BrowserEngine.......................................................7 About the BrowserEngine.........................................................................................................8 Architecture..............................................................................................................................8 Chapter 2: Configuring the BrowserEngine....................................11 Enterprise Crawler considerations.........................................................................................12 Configuration via XML File.....................................................................................................12 Modifying BrowserEngine server settings...................................................................12 Setting browser attributes............................................................................................13 Configuring the extractor pipeline................................................................................15 Flash settings..............................................................................................................19 Example.......................................................................................................................19 Chapter 3: Operating the BrowserEngine........................................21 Starting and Stopping.............................................................................................................22 Starting from the administrator interface.....................................................................22 Stopping from the administrator interface....................................................................22 Starting from the command line..................................................................................22 Stopping from the command line.................................................................................22 Logging...................................................................................................................................22 Change the BrowserEngine logging............................................................................22 Monitoring...............................................................................................................................23 Tuning.....................................................................................................................................23 Restrictions.............................................................................................................................24 Chapter 4: BrowserEngine reference information..........................25 BrowserEngine binary............................................................................................................26 XML-RPC Browser Interface..................................................................................................26 XML-RPC Status Interface.....................................................................................................27 Extractor processing examples..............................................................................................28 5 6 FAST Enterprise Search Platform Chapter 1 About BrowserEngine The BrowserEngine is a highly scalable and configurable component that extracts links and text from JavaScript and Adobe Flash files.The BrowserEngine is used Topics: by the FAST Enterprise Crawler and may be called from the Document Processing pipeline. • About the BrowserEngine • Architecture About the BrowserEngine The BrowserEngine is a highly scalable and configurable component that extracts links and text from JavaScript and Adobe Flash files. The BrowserEngine is used by the FAST Enterprise Crawler (EC) and can also be used from the Document Processing pipeline. The BrowserEngine is a new component that replaces functionality previously available only to the Enterprise Crawler. It is intended to provide superior web page content, through the following new features: • Improved Document Object Model (DOM) coverage • Cookie extraction • Frame support • Extensibility and customization • Scalable architecture • Link and metadata extraction from Flash The new BrowserEngine will enable more links to be extracted, improving the scope of a crawl, as well as improved document content, enhancing the index quality. In addition, customers and partners can modify the behavior of the engine according to individual needs. Because more thorough emulation of a browser environment requires additional system resources, the design allows the crawler to take advantage of multiple BrowserEngines (on one or more hosts) in order to distribute the load and scale the number of pages processed. Architecture The BrowserEngine is a stand-alone ESP component, capable of processing HTML documents containing javascripts and Flash files. It accomplishes this by emulating a browser's internal environment, without the need for a display. The BrowserEngine is implemented in Java and runs as a separate process. This provides isolation from other components (in particular, from the Enterprise Crawler), in the case of a fatal error. This design also allows a component to use multiple BrowserEngines, or multiple components can use the same BrowserEngine. The following diagram illustrates the major functional modules within the BrowserEngine, and shows the datapaths that will be referenced in the following discussion. Figure 1: BrowserEngine Architecture To give an overview of how the BrowserEngine works, consider the flow of an HTML page through the internal processing. When the BrowserEngine receives a processing request, it assigns the task to a thread from its pool of idle threads. If the file is a Flash binary content file, it is simply parsed for text and links and the result 8 FAST Enterprise Search Platform returned. Otherwise, it is delivered to the JavaScript Handler. The first step is to run a user-definable page preprocessor to initialize the DOM tree, before any processing of the page contents takes place. This allows the BrowserEngine to simulate support for browser plug-ins such as Adobe Reader, Apple QuickTime or Windows Media Player, and also permits initialization of settings such as User-Agent, or the screen size.The page preprocessor is written in JavaScript, in order to provide quick and easy customization. After the page preprocessor has initialized the DOM tree, the BrowserEngine parses the HTML document, fetches external dependencies and populates the DOM tree with HTML elements. External dependencies, such as scripts and frames, will be looked up in a local dependency cache, or fetched indirectly via the Enterprise Crawler, which acts as a cacheing proxy. It is also capable of fetching resources directly from the network, if used by components other than the crawler.The document is loaded just as a real browser would, by executing scripts and onLoad handlers. In addition to the page preprocessor, there is an optional script preprocessor that can modify the source code of every snippet of JavaScript code before it is executed. After the document is loaded, the constructed DOM tree is passed to a configurable pipeline of extractors. The pipeline stages create a text representation of the HTML document, extract cookies, generate a document checksum, simulate user interactions and extract links. This data and metadata is returned to the calling component. 9 About BrowserEngine Chapter 2 Configuring the BrowserEngine The BrowserEngine can run out of the box with Fast ESP. However, you may want change the preprocessors and/or the pipeline to fit your needs. Topics: • Enterprise Crawler considerations • Configuration via XML File Enterprise Crawler considerations The BrowserEngine does work on behalf of, and in conjunction with, the Enterprise Crawler, and that component must be configured properly to make use of the BrowserEngine. There are two requirements in configuring the Enterprise Crawler to make use of the BrowserEngine. The first is that one of the following attributes must be enabled, by setting it to the value Yes: • JavaScript support • Macromedia Flash support The Enterprise Crawler also needs to be configured with the location of all available BrowserEngines in the FAST ESP installation. Normally this setup is done by the FAST ESP installation itself, as each BrowserEngine is enabled. For information about the details, please see the section CrawlerGlobalDefaults.xml options in the FAST Enterprise Crawler Guide. Configuration via XML File The BrowserEngine is configured with default settings that are appropriate for most Fast ESP installations. You can change the configuration, including the preprocessors and the pipeline, to fit the needs of your installation. The BrowserEngine is configured through an XML file, located on the Config Server node at: $FASTSEARCH/etc/config_data/BrowserEngine/BrowserConfig.xml Changes made to this file, or any other files used by the BrowserEngine configuration, will not take effect until the BrowserEngine is restarted. Modifying BrowserEngine server settings The BrowserEngine XML file includes a server tag that defines the port number range, and other attributes used to tune the performance. DescriptionParameter Base port number, which is used to listen for requests from the Enterprise Crawler.port Note: The BrowserEngine also uses port number "port+1". Both ports must be free. The number of BrowserEngine threads created to process documents. This attribute limits the number of documents which can be processed concurrently. Note that setting this value too maxThreads high can result in wasted CPU utilization due to scheduling, resulting in lower document throughput. Also, it can cause the BrowserEngine to run out of Java heap space. Thus, a better solution is to start multiple instances of the BrowserEngine. The limit on requests that may be accepted and queued, waiting for an available processing thread. If the queue becomes full, the BrowserEngine will deny further requests from the Enterprise Crawler until processing threads become available. maxQueueSize Example: 12 FAST Enterprise Search Platform Setting browser attributes The browser tag in the BrowserEngine XML file includes general browser attributes, and cache, blacklist, and javascript sub-tags with corresponding attributes. Browser Tag DescriptionParameter Specifies the browser type to emulate. Legal values are:type • Mozilla • InternetExplorer Allow pop-ups in BrowserEngine or not.allowPopups Specifies if the BrowserEngine should use SSL when requesting external dependencies from the Enterprise Crawler.The attribute should be set to false when used in a FAST ESP installation with the crawler. useSSL Note: This setting only affects the BrowserEnginer interactions with the Enterprise Crawler, which may still use SSL to retrieve the dependency. The total maximum time (in seconds) that a document can use on processing. This includes time used on waiting for external dependencies. Documents which uses a longer time than this evaluationTimeout specified value is aborted by the BrowserEngine. In this case the Enterprise Crawler will store the original document and follow the links it finds. The terminateTimeout option sets the maximum time (in seconds) a thread can run before the BrowserEngine is shutdown. This prevents potential endless spinning threads, not properly timed out by the evolutionTimeout mechanism, of hogging all system recourses. terminateTimeout Example: Browser sub-tags Within the browser tag, there are four configurable tags: • cache • blacklist • flash • javascript Cache DescriptionParameter Specifies the cache size in megabytes (MB). The cache improves the performance by reducing the traffic between the BrowserEngine and the Enterprise Crawler whenever there are external dependencies. size The maximum time (in milliseconds) that a cache entry may exist in the cache. If the cache becomes full, cache entries are removed in a Least Recently Used order. ttl Example: 13 Configuring the BrowserEngine Blacklist DescriptionParameter The blacklist tag contains a list of regular expressions used to exclude requests for external dependencies. Before the BrowserEngine requests an external dependency, it checks if the reqexp value URI matches a regular expression. If there is a match, the request is not submitted, and the BrowserEngine will continue to process the document without downloading the dependency. A common usage is to block advertisements. Example: JavaScript DescriptionParameter Specifies the maximum time (in milliseconds) that the JavaScript engine is allowed to execute a snippet of JavaScript code. If the timeout limit is reached the execution of the JavaScript code will be aborted. This prevents the BrowserEngine from becoming stuck in endless loops. timeout Specifies the URL or java resource path to the script preprocessor JavaScript code.scriptPreprocessor Specifies the URL or java resource path to the pre preprocessor JavaScript code.pagePreProcessor Example: Specifying a customized page preprocessor The page preprocessor is regular text file containing JavaScript code. The purpose of the page preprocessor is to initialize the DOM tree before document processing begins. This allows the BrowserEngine to simulate support for browser plug-ins, such as Adobe Reader, and allows browser settings such as screen size to be set. 1. Create or modify the page preprocessor file according to your needs, and save it to the directory containing the BrowserEngine configuration file. 2. Edit the BrowserEngine configuration file to specify this page preprocessor. 3. Restart the BrowserEngine. Example: A page preprocessor which emulates support for the Adobe Reader. navigator.plugins = new Array(); navigator.plugins[0].name = “Adobe Reader 7.0” navigator.plugins[0].description = "The Adobe Reader plug-in is used to enable viewing of PDF and FDF files from within the browser." 14 FAST Enterprise Search Platform Specifying a customized script preprocessor The purpose of the script preprocessor is to modify JavaScript code before processing begins. The script preprocessor is a text file containing JavaScript code to be executed before the BrowserEngine executes the current document's JavaScript code. This allows the BrowserEngine to modify the source code before it is executed. A script preprocessor file must define a function that accepts four parameters: • page • sourceCode • sourceName • htmlElement The last line of the script must return the output of that function. See the example below. 1. Create or modify the script preprocessor file according to your needs and save it to the directory containing the BrowserEngine configuration file. 2. Edit t
本文档为【ESP_BrowserEngine_Guide】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_930093
暂无简介~
格式:pdf
大小:165KB
软件:PDF阅读器
页数:0
分类:互联网
上传时间:2012-01-13
浏览量:37