FAST Enterprise Search Platform
version:5.3
BrowserEngine
Document Number: ESP1046, Document Revision: A, December 03, 2009
Copyright
Copyright © 1997-2009 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrighted
by FAST’s licensors. All rights reserved.The documentation is protected by the copyright laws of Norway,
the United States, and other countries and international treaties. No copyright notices may be removed
from the documentation. No part of this document may be reproduced, modified, copied, stored in a
retrieval system, or transmitted in any form or any means, electronic or mechanical, including
photocopying and recording, for any purpose other than the purchaser’s use, without the written
permission of FAST. Information in this documentation is subject to change without notice.The software
described in this document is furnished under a license agreement and may be used only in accordance
with the terms of the agreement.
Trademarks
FAST ESP, the FAST logos, FAST Personal Search, FAST mSearch, FAST InStream, FAST AdVisor,
FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FAST
Contextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective,
NXT, FAST Unity, FAST Radar, RetrievalWare, AdMomentum, and all other FAST product names
contained herein are either registered trademarks or trademarks of Fast Search & Transfer ASA in
Norway, the United States and/or other countries. All rights reserved. This documentation is published
in the United States and/or other countries.
Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks or
registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
Netscape is a registered trademark of Netscape Communications Corporation in the United States and
other countries.
Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.
Red Hat is a registered trademark of Red Hat, Inc.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
AIX and IBM Classes for Unicode are registered trademarks or trademarks of International Business
Machines Corporation in the United States, other countries, or both.
HP and the names of HP products referenced herein are either registered trademarks or service marks,
or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries.
Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United States
and/or other countries.
XML Parser is a trademark of The Apache Software Foundation.
All other company, product, and service names are the property of their respective holders and may be
registered trademarks or trademarks in the United States and/or other countries.
Restricted Rights Legend
The documentation and accompanying software are provided to the U.S. government in a transaction
subject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure of
the documentation and software by the government is subject to restrictions as set forth in FAR 52.227-19
Commercial Computer Software-Restricted Rights (June 1987).
Contact Us
Web Site
Please visit us at: http://www.fastsearch.com/
Contacting FAST
FAST
Cutler Lake Corporate Center
117 Kendrick Street, Suite 100
Needham, MA 02492 USA
Tel: +1 (781) 304-2400 (8:30am - 5:30pm EST)
Fax: +1 (781) 304-2410
Technical Support and Licensing Procedures
Technical support for customers with active FAST Maintenance and Support agreements, e-mail:
fasttech@microsoft.com
For obtaining FAST licenses or software, contact your FAST Account Manager or e-mail:
fastcsrv@microsoft.com
For evaluations, contact your FAST Sales Representative or FAST Sales Engineer.
Product Training
E-mail: fastuniv@microsoft.com
To access the FAST University Learning Portal, go to: http://www.fastuniversity.com/
Sales
E-mail: sales@fastsearch.com
Contents
Preface..................................................................................................ii
Copyright..................................................................................................................................ii
Contact Us...............................................................................................................................iii
Chapter 1: About BrowserEngine.......................................................7
About the BrowserEngine.........................................................................................................8
Architecture..............................................................................................................................8
Chapter 2: Configuring the BrowserEngine....................................11
Enterprise Crawler considerations.........................................................................................12
Configuration via XML File.....................................................................................................12
Modifying BrowserEngine server settings...................................................................12
Setting browser attributes............................................................................................13
Configuring the extractor pipeline................................................................................15
Flash settings..............................................................................................................19
Example.......................................................................................................................19
Chapter 3: Operating the BrowserEngine........................................21
Starting and Stopping.............................................................................................................22
Starting from the administrator interface.....................................................................22
Stopping from the administrator interface....................................................................22
Starting from the command line..................................................................................22
Stopping from the command line.................................................................................22
Logging...................................................................................................................................22
Change the BrowserEngine logging............................................................................22
Monitoring...............................................................................................................................23
Tuning.....................................................................................................................................23
Restrictions.............................................................................................................................24
Chapter 4: BrowserEngine reference information..........................25
BrowserEngine binary............................................................................................................26
XML-RPC Browser Interface..................................................................................................26
XML-RPC Status Interface.....................................................................................................27
Extractor processing examples..............................................................................................28
5
6
FAST Enterprise Search Platform
Chapter
1
About BrowserEngine
The BrowserEngine is a highly scalable and configurable component that extracts
links and text from JavaScript and Adobe Flash files.The BrowserEngine is used
Topics:
by the FAST Enterprise Crawler and may be called from the Document Processing
pipeline.
• About the BrowserEngine
• Architecture
About the BrowserEngine
The BrowserEngine is a highly scalable and configurable component that extracts links and text from JavaScript
and Adobe Flash files. The BrowserEngine is used by the FAST Enterprise Crawler (EC) and can also be
used from the Document Processing pipeline.
The BrowserEngine is a new component that replaces functionality previously available only to the Enterprise
Crawler. It is intended to provide superior web page content, through the following new features:
• Improved Document Object Model (DOM) coverage
• Cookie extraction
• Frame support
• Extensibility and customization
• Scalable architecture
• Link and metadata extraction from Flash
The new BrowserEngine will enable more links to be extracted, improving the scope of a crawl, as well as
improved document content, enhancing the index quality. In addition, customers and partners can modify the
behavior of the engine according to individual needs. Because more thorough emulation of a browser
environment requires additional system resources, the design allows the crawler to take advantage of multiple
BrowserEngines (on one or more hosts) in order to distribute the load and scale the number of pages
processed.
Architecture
The BrowserEngine is a stand-alone ESP component, capable of processing HTML documents containing
javascripts and Flash files. It accomplishes this by emulating a browser's internal environment, without the
need for a display.
The BrowserEngine is implemented in Java and runs as a separate process. This provides isolation from
other components (in particular, from the Enterprise Crawler), in the case of a fatal error. This design also
allows a component to use multiple BrowserEngines, or multiple components can use the same BrowserEngine.
The following diagram illustrates the major functional modules within the BrowserEngine, and shows the
datapaths that will be referenced in the following discussion.
Figure 1: BrowserEngine Architecture
To give an overview of how the BrowserEngine works, consider the flow of an HTML page through the internal
processing. When the BrowserEngine receives a processing request, it assigns the task to a thread from its
pool of idle threads. If the file is a Flash binary content file, it is simply parsed for text and links and the result
8
FAST Enterprise Search Platform
returned. Otherwise, it is delivered to the JavaScript Handler. The first step is to run a user-definable page
preprocessor to initialize the DOM tree, before any processing of the page contents takes place. This allows
the BrowserEngine to simulate support for browser plug-ins such as Adobe Reader, Apple QuickTime or
Windows Media Player, and also permits initialization of settings such as User-Agent, or the screen size.The
page preprocessor is written in JavaScript, in order to provide quick and easy customization.
After the page preprocessor has initialized the DOM tree, the BrowserEngine parses the HTML document,
fetches external dependencies and populates the DOM tree with HTML elements. External dependencies,
such as scripts and frames, will be looked up in a local dependency cache, or fetched indirectly via the
Enterprise Crawler, which acts as a cacheing proxy. It is also capable of fetching resources directly from the
network, if used by components other than the crawler.The document is loaded just as a real browser would,
by executing scripts and onLoad handlers.
In addition to the page preprocessor, there is an optional script preprocessor that can modify the source code
of every snippet of JavaScript code before it is executed.
After the document is loaded, the constructed DOM tree is passed to a configurable pipeline of extractors.
The pipeline stages create a text representation of the HTML document, extract cookies, generate a document
checksum, simulate user interactions and extract links. This data and metadata is returned to the calling
component.
9
About BrowserEngine
Chapter
2
Configuring the BrowserEngine
The BrowserEngine can run out of the box with Fast ESP. However, you may
want change the preprocessors and/or the pipeline to fit your needs.
Topics:
• Enterprise Crawler considerations
• Configuration via XML File
Enterprise Crawler considerations
The BrowserEngine does work on behalf of, and in conjunction with, the Enterprise Crawler, and that component
must be configured properly to make use of the BrowserEngine.
There are two requirements in configuring the Enterprise Crawler to make use of the BrowserEngine. The
first is that one of the following attributes must be enabled, by setting it to the value Yes:
• JavaScript support
• Macromedia Flash support
The Enterprise Crawler also needs to be configured with the location of all available BrowserEngines in the
FAST ESP installation. Normally this setup is done by the FAST ESP installation itself, as each BrowserEngine
is enabled. For information about the details, please see the section CrawlerGlobalDefaults.xml options in
the FAST Enterprise Crawler Guide.
Configuration via XML File
The BrowserEngine is configured with default settings that are appropriate for most Fast ESP installations.
You can change the configuration, including the preprocessors and the pipeline, to fit the needs of your
installation.
The BrowserEngine is configured through an XML file, located on the Config Server node at:
$FASTSEARCH/etc/config_data/BrowserEngine/BrowserConfig.xml
Changes made to this file, or any other files used by the BrowserEngine configuration, will not take effect
until the BrowserEngine is restarted.
Modifying BrowserEngine server settings
The BrowserEngine XML file includes a server tag that defines the port number range, and other attributes
used to tune the performance.
DescriptionParameter
Base port number, which is used to listen for requests from the Enterprise Crawler.port
Note: The BrowserEngine also uses port number "port+1". Both ports must be free.
The number of BrowserEngine threads created to process documents. This attribute limits the
number of documents which can be processed concurrently. Note that setting this value too
maxThreads
high can result in wasted CPU utilization due to scheduling, resulting in lower document
throughput. Also, it can cause the BrowserEngine to run out of Java heap space. Thus, a better
solution is to start multiple instances of the BrowserEngine.
The limit on requests that may be accepted and queued, waiting for an available processing
thread. If the queue becomes full, the BrowserEngine will deny further requests from the
Enterprise Crawler until processing threads become available.
maxQueueSize
Example:
12
FAST Enterprise Search Platform
Setting browser attributes
The browser tag in the BrowserEngine XML file includes general browser attributes, and cache, blacklist, and
javascript sub-tags with corresponding attributes.
Browser Tag
DescriptionParameter
Specifies the browser type to emulate. Legal values are:type
• Mozilla
• InternetExplorer
Allow pop-ups in BrowserEngine or not.allowPopups
Specifies if the BrowserEngine should use SSL when requesting external dependencies from
the Enterprise Crawler.The attribute should be set to false when used in a FAST ESP installation
with the crawler.
useSSL
Note: This setting only affects the BrowserEnginer interactions with the Enterprise Crawler,
which may still use SSL to retrieve the dependency.
The total maximum time (in seconds) that a document can use on processing. This includes
time used on waiting for external dependencies. Documents which uses a longer time than this
evaluationTimeout
specified value is aborted by the BrowserEngine. In this case the Enterprise Crawler will store
the original document and follow the links it finds.
The terminateTimeout option sets the maximum time (in seconds) a thread can run before the
BrowserEngine is shutdown. This prevents potential endless spinning threads, not properly
timed out by the evolutionTimeout mechanism, of hogging all system recourses.
terminateTimeout
Example:
Browser sub-tags
Within the browser tag, there are four configurable tags:
• cache
• blacklist
• flash
• javascript
Cache
DescriptionParameter
Specifies the cache size in megabytes (MB). The cache improves the performance by reducing
the traffic between the BrowserEngine and the Enterprise Crawler whenever there are external
dependencies.
size
The maximum time (in milliseconds) that a cache entry may exist in the cache. If the cache
becomes full, cache entries are removed in a Least Recently Used order.
ttl
Example:
13
Configuring the BrowserEngine
Blacklist
DescriptionParameter
The blacklist tag contains a list of regular expressions used to exclude requests for external
dependencies. Before the BrowserEngine requests an external dependency, it checks if the
reqexp value
URI matches a regular expression. If there is a match, the request is not submitted, and the
BrowserEngine will continue to process the document without downloading the dependency. A
common usage is to block advertisements.
Example:
JavaScript
DescriptionParameter
Specifies the maximum time (in milliseconds) that the JavaScript engine is allowed to execute
a snippet of JavaScript code. If the timeout limit is reached the execution of the JavaScript code
will be aborted. This prevents the BrowserEngine from becoming stuck in endless loops.
timeout
Specifies the URL or java resource path to the script preprocessor JavaScript code.scriptPreprocessor
Specifies the URL or java resource path to the pre preprocessor JavaScript code.pagePreProcessor
Example:
Specifying a customized page preprocessor
The page preprocessor is regular text file containing JavaScript code. The purpose of the page preprocessor
is to initialize the DOM tree before document processing begins. This allows the BrowserEngine to simulate
support for browser plug-ins, such as Adobe Reader, and allows browser settings such as screen size to be
set.
1. Create or modify the page preprocessor file according to your needs, and save it to the directory containing
the BrowserEngine configuration file.
2. Edit the BrowserEngine configuration file to specify this page preprocessor.
3. Restart the BrowserEngine.
Example: A page preprocessor which emulates support for the Adobe Reader.
navigator.plugins = new Array();
navigator.plugins[0].name = “Adobe Reader 7.0”
navigator.plugins[0].description = "The Adobe Reader plug-in is used to
enable viewing of PDF and FDF files from within the browser."
14
FAST Enterprise Search Platform
Specifying a customized script preprocessor
The purpose of the script preprocessor is to modify JavaScript code before processing begins.
The script preprocessor is a text file containing JavaScript code to be executed before the BrowserEngine
executes the current document's JavaScript code. This allows the BrowserEngine to modify the source code
before it is executed. A script preprocessor file must define a function that accepts four parameters:
• page
• sourceCode
• sourceName
• htmlElement
The last line of the script must return the output of that function. See the example below.
1. Create or modify the script preprocessor file according to your needs and save it to the directory containing
the BrowserEngine configuration file.
2. Edit t