首页 Automatic Structured Query Transformation

Automatic Structured Query Transformation

举报
开通vip

Automatic Structured Query Transformation Automatic Structured Query Transformation Over Distributed Digital Libraries M. Elena Renda � I.I.T. – C.N.R. and Scuola Superiore Sant’Anna I-56100 Pisa, Italy elena.renda@iit.cnr.it Umberto Straccia I.S.T.I. – C.N.R. I-56100 Pisa, Italy umberto.stracc...

Automatic Structured Query Transformation
Automatic Structured Query Transformation Over Distributed Digital Libraries M. Elena Renda � I.I.T. – C.N.R. and Scuola Superiore Sant’Anna I-56100 Pisa, Italy elena.renda@iit.cnr.it Umberto Straccia I.S.T.I. – C.N.R. I-56100 Pisa, Italy umberto.straccia@isti.cnr.it ABSTRACT Structured data and complex schemas are becoming the main way to represent the information many Digital Libraries provide, thus impacting the services they offer. When searching information among distributed Digital Libraries with heterogeneous schemas, the structured query with a given schema (the global or target schema) has to be transformed into a query over the schema of the digital library it will be submitted to (the source schema). Schema mappings define the rules for this query transformation. Schema matching is the problem of learning these mappings. In this paper we address the issue of automatically learning these mappings and transforming a structured query over the tar- get schema into a new structured query over the source schema. We propose a simple and effective schema matching method based on the well known CORI selection algorithm and two ways of ap- plying it. By evaluating the effectiveness of the obtained struc- tured queries we show that the method works well in accessing distributed, heterogeneous digital libraries. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Query formulation; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search pro- cess; H.3.4 [Information Search and Retrieval]: Systems and Software—Distributed systems; H.2.5 [Database Management]: Heterogeneous Databases General Terms Algorithms, Experimentation, Measurement, Performance. Keywords Digital Libraries, Structured Queries, Automatic Query Transfor- � This work was done when the author was a research assistant at I.S.T.I. – C.N.R. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’06, April, 23-27, 2006, Dijon, France. Copyright 2006 ACM 1-59593-108-2/06/0004 ...$5.00. mation, Automatic Mappings, Schema Matching. 1. INTRODUCTION Digital Libraries (DLs) [14] differ in the kind, quantity and qual- ity of information and services they provide, and in what kind of users they are supposed to be addressed to (see, for instance ACM DL1, Medline2, NCSTRL3). In particular, they may be extremely heterogeneous under two main aspects: 1. the topic of the information they provide. Imagine a wide- area library, with up to thousands of collections available where to search for, dealing with, for instance, Physics, Medicine, Computer Science, and so on; and 2. the metadata schema they use to describe the information they provide. Indeed, DLs use different sets of attributes to represent their document content. As users cannot deal efficiently with the number and the hetero- geneity of available DLs, it is becoming increasingly needed that they only deal with a unique system with a unified interface for searching over multiple DLs simultaneously, using one system- wide target metadata schema which is defined independently of the libraries. For instance, let us assume that: - we have a target schema � with attributes ��������� ��� ��� (also called global or mediated schema); - a user query over the target metadata schema � is based on single-valued attributes ����������� � ��� , i.e. the query is a set of attribute-value pairs of the form: �������������������� ��� ���������ff�fi� where each �ffifl is an attribute of the target schema � and ��fl is a type correct value for � fl 4; - we have available distributed DLs (called information re- sources) !"�#��$%������� ���&$(')� to search in; - each resource $(fl has its own metadata schema *+fl (the source schema) based on attributes � fl-, ��� ����� � fl-. ; 1http://www.acm.org/dl 2http://igm.nlm.nih.gov 3http://www.ncstrl.org 4Note that a query attribute �ffifl may appear in the query multiple times. 1078 - a query � over the source schema * fl is a set of attribute-value pairs of the form: � ����� fl-, ��� fl-, ���������&� fl . ��� fl . �ffi� where each �ffifl � is an attribute of the target schema *+fl and � fl � is a type correct value for � fl � . In such a context, three major tasks have to be executed in order to answer the user query submitted over the target schema � : (i) the Automatic Resource Selection task (see, e.g. [4, 15]) in or- der to select a subset of most promising DLs among all the DLs available, as it is not cost-effective to submit the query to all pos- sible resources; (ii) the Schema Matching task (see, e.g. [28]) to reformulate the information need into a new query � over the query language provided by the selected resource(s) using schema mapping rules; and (iii) the Rank Fusion (see, e.g. [29]) to merge together the ranked lists obtained by each queried resource. In this paper we deal with the schema matching problem, i.e. with the problem of automatically learning the schema map- ping rules for querying heterogeneous distributed DLs. Schema Matching is based on mappings that usually are of the form ����� ��� , stating that attribute ��� of the target schema can be mapped into the attribute ��� of the source schema. For instance, suppose we have the metadata target schema � ��� ��������������ff�fi�fl����ffifi�ff and the resource $ fl with the metadata source schema !fl" #ffifi��$fl�������%�'&�$��flffifi� (*)�� (fi�fi+, . An example of user query over the target schema may be: -�. / �� �������� .10*243fl5fi6,798;:�<�=�>�? �@���ff�fi�fl����ffifi� .A0@BC3fiD�>FEG? � ���ff�fi�fl����ffifi� .A0@EG3�HJILK,M#7fiO7QPREG7fi?�SNT whose intended meaning is to “retrieve all documents written by Moshe Vardi, which talk about logic in computer science”. Using the mappings: �� ���������UVffifi��$fl������� ���ff�fi�fl����ffifi��UW&�$��flffifi� (*)�� (fi�fi+X� the above query will be rewritten as the query � over the source schema:Y -�. / ffifi��$fl������� .10*243fl5fi6,798;:�<�=�>�? �Z&�$��flffifi� (*)�� (fi�fi+ .[0@BC3fiD�>FEG? � &�$��flffifi� (*)�� (fi�fi+ .[0@EG3�HJILK,M#7fiO7QPREG7fi?�S]\ We propose a simple and effective method to automatically learn schema mappings in such a scenario, which is based on the well known CORI resource selection method [7]. Indeed, similarly to the resource selection problem, where we have to automatically identify the most relevant libraries with respect to a given query, in the schema matching problem we have to identify, for each target attribute, the most relevant source attribute with respect to a given structured query. That is, given (i) an attribute-value pair �X^ � � ^ , with � ^ being an attribute of the target schema � , and (ii) a resource $ fl with the source schema *+fl , we want to identify among all the attributes � fl �`_ * fl the most relevant one to map ��^ to. While most of the schema matching methods presented in the literature need an a priori schema learning process, thus requiring instances of the target schema, a major feature and, to the best of our knowledge, the novelty of our approach is that the method we propose can directly be applied to queries over the target schema without requiring instances of it. This is often the case when we deal with queries over distributed DLs, as training data is missing. The structure of the paper is the following: Section 2 presents an overview of the schema matching problem and its main applica- tion fields. Section 3 introduces our formal framework for schema matching applied to query transformation. In Section 4 we report the evaluation of the proposed approach on different data sets. Sec- tion 5 concludes. 2. RELATED WORK Matching is a fundamental task in the manipulation of struc- tured information; it consists in taking two schemas/ontologies, each consisting of a set of entities as input (e.g., tables, XML ele- ments, classes, properties, predicates, metadata) and producing as output a mapping, i.e. a semantic relationships (e.g., equivalence, subsumption) between elements of the two given schemas [8, 10, 23, 27]. With the proliferation of DLs over the Web, the development of automated tools for schema matching is of particular importance to automatize distributed information retrieval. Indeed, matching has a central role in many well-known application domains, such as Semantic Web, schema integration, Electronic Commerce (E- Commerce), Ontology integration, web-oriented data integration, XML mapping, Catalog matching, etc. Manually performing Schema Matching, as it is usually done when searching the Web, is time-consuming, and expensive. More- over, the number and the complexity of Web information resources to integrate grows enormously, thus making it difficult to handle the Schema Matching process. Matching has been addressed by many researchers in two re- lated areas: the matching problem for ontologies and the matching problem for database schemas. Ontology Matching differs from Schema Matching substantially for two reasons ([25]): presence of explicit semantics and knowledge models. Indeed, while ontolo- gies are logical systems that themselves incorporate semantics (in- tuitive or formal), database schemas often do not provide explicit semantics for their data. Furthermore, ontology data models are richer than schema data models. The approaches proposed for On- tology Matching [12, 20, 21, 26] try to exploit knowledge explicitly encoded in the ontologies, while Schema Matching approaches try to guess the meaning encoded in the schemas. However, even considering these differences, these two areas can be seen as closely related. Indeed, schemas can be considered as, e.g., simple ontologies with restricted relationship types; on the other hand, the techniques applied in schema matching can be ap- plied to ontology matching as well, taking care of the hierarchies. All the Schema Matching approaches proposed in the litera- ture [3, 9, 18, 22, 24] involve the definition of mappings among schemas and, possibly, of a global integrated schema. Most re- cent approaches, either implicitly or explicitly, perform schema mapping based on attribute name comparison and/or comparing properties of the underlying data instances using machine learn- ing techniques [1, 11]. In particular, applying machine learning techniques requires instances from both the target schema and the source schema. In these cases both the target schema and the source schema are relational database tables. The attribute matching pro- cess is based on some comparison between the values in the source table and the target table. An extensive comparison can be found in [28]. Typical application for schema matching are Schema Integra- tion, i.e. the problem of constructing a global view, given a set of independently developed schemas [30], “message translation” for E-Commerce portals, and also the Semantic Web [2, 13], where matching can be used for mapping messages between autonomous agents. In this paper we consider another scenario for Schema Matching: given a user query, written with the target schema � of the library a�b , automatically learn for each attribute � fl _ � of the query the most promising attribute *�c of the source schema a ' , in order to submit the query over a b to the resource a ' . [16] has proved that automatic structured queries may significantly improve the search process over structured DLs. 1079 3. QUERY TRANSFORMATION In this Section we describe our approach for solving the schema matching problem in order to automatically learn schema mappings and transform queries over heterogeneous DLs. It is based on a reformulation of the CORI resource selection framework [7]. In the following, in order to make the paper self-contained, we first introduce the automatic resource selection task and describe how it is performed using the CORI method; after that we fit the CORI resource selection method into our context. Automatic Resource Selection Using CORI The Automatic Resource Selection task is becoming a crucial step when searching over a large number of distributed, heterogeneous resources. Indeed, resource selection improves search efficiency, reducing the costs of distributed search (search time, network band- width, computation, and so on), and can also improve search effec- tiveness, since it is supposed that after the selection the resources with no relevant documents are not queried, thus gaining in time and documents relevance. Some approaches presented in the literature rely on short re- source descriptions [17, 31], a sort of content summary that usu- ally includes the terms that appear in the resource and their doc- ument frequency; furthermore, it may include other simple infor- mation, such as the total number of documents (the resource di- mension). Web resources and their corresponding search servers are usually non-cooperative, while all the approaches relying on the resource description require their cooperation in order to obtain content summary or information about the documents they index, such as term occurrence statistics. For this reason, query-based sampling techniques have been proposed for deriving content de- scription approximations [5, 19]. These approaches use queries (called probe queries) to retrieve and download a relatively small set of documents (called the resource sample), representative of the topic covered by the resource; the resource sample is then used to build the content summary (the description or approximation of the resource) by extracting term occurrence statistics. In [6] it has been demonstrated that resource samples provide statistics quite representative of the statistics of the full resource. The resource de- scription is then used to compute the resource score for each infor- mation resource, i.e. a measure of the relevance of a given resource to the query. The resource score establishes the relatedness of an information resource to a given user query. CORI is one of the methods used in automatic resource selection to compute the resource score with respect to the query. In the following we describe an adapted version of the CORI resource selection method in case documents are represented via metadata records. Consider the user query � ��� � � � � ���������&� � � � � � over the target schema � . At first, we unfold the complex query into a simple query � where the attributes � fl have been removed, i.e. � � ��� ��� � � � � ���ff� . For each resource $ fl _ ! , we compute the resource score ��� �&$ fl�� (also called goodness), representing the relevance of resource $ fl to the query , as follow: ��� �&$ fl � � ��� �� �� �� � � ^�� $(fl�� � � � � (1) where � � � is the number of values in the simple query � . The be- lief � � � ^�� $(fl�� is computed for each value � ^ _ � using the CORI algorithm [4, 7]: � � � ^�� $(fl�� � � fl�� ^�����^��ff� ^ (2) � flfi� ^ � fl�ffi fl�� ^ fl ffi flfi� ^"!$#&%'!�()#*% �,+�-,. + - (3) � ^ � /10�24365 7�5 8:9ff; < +>= @? /10�2 � � ! � !A(�� %B� (4) where: � ^ is the weight of the term in the query; fl�ffi flfi� ^ is the number of records in the approximation of $ fl containing the value � ^ ; C � fl is the number of values in the approximation of $ fl ; C � is the mean value of all the C � fl ; C ffi ^ is the number of approximated resources containing the value � ^ ; � ! � is the number of the resources. In the above formulae, � fl�� ^ indicates how many records contain the term � ^ in the resource $ fl , while � ^ , defined in terms of the re- source frequency C ffi ^ , is the inverse resource frequency: the higher C ffi ^ the smaller � ^ , i.e., the more a term occurs among the resources the less it is a discriminating term. The belief � � � ^ � $ fl � combines these two measures. Informally, a resource is more relevant if its approximation, computed by query- based sampling, contains many terms related to the original query, but if a query term occurs in many resources, this term is not a good one to discriminate between relevant and not relevant resources. Finally, all the information resources $ fl _ ! are ranked ac- cording to their resource goodness value ��� ��&$ fl � , and the top- D are selected as the most relevant ones. Schema Matching Using CORI Let � ��� ��� ����� ������� ��� �#���ff� be a user query over the meta- data target schema � and $ ^ _ ! a relevant selected resource, with metadata source schema * ^ . Our task is to find out how to map each attribute-value pair � fl � � fl _ into one or more attribute- value pairs ��^ � ��� fl , where ��^ � _ * ^ , using CORI. The basic and simple idea is the following. Consider a query attribute-value pair �ffifl�� �fffl _ and let $ ^ _ ! be a selected resource where to search into. Let ��^ , ��� � � � ��^ . _ * ^ be all the attributes of the metadata schema of $ ^ . Our idea is to map the attribute ��fl to the attribute � ^ � if the value ��fl is relevant to the collection of all the values the attribute �X^ � takes in the resource $ ^ . Essentially, for each of the attributes ��^ � _ * ^ , we make a collection E;^ � c of the values the attribute ��^ � takes in the resource $ ^ and then ask CORI which of these collections are most relevant to the query value � fl . If E ^ � c is among the answers of CORI , then we build the mapping � fl � ��^ � . More formally, consider the resource $ ^ and the records F � ����� ��� FHG of the approximation of $ ^ � �B� F&I*J � $ ^H� (computed by query-based sampling). Each record F&K _ � �B� F*I&J � $ ^&� is a set of attribute-value pairs FHK ������^ , ���ff^ , ������� � ��^ . ��� ^ . � . From � �B� F&I*J � $ ^ � , we make a projection on each attribute, i.e. we build a new set of records for each attribute �X^ � of the schema: E ^ � c � L M N ��OQPRP M>SRT�U V �W � F � FYX � ����^ � ���ff^ � � �&��^ � ���ff^ � _ FHK � � The basic idea is that each projection E;^ � � � ������� E ^ � ^ . can be seen as a new collection, and we apply CORI to select which of these is the most relevant for each attribute-value pairs � fl � � fl of the query . 1080 By using each projection E ^ � c as a resource, we can apply the resource selection framework for attribute matching: in order to find out whether to match a target attribute-value pair �fifl ����fl _ into a source attribute-value pair �X^ � � � fl , we verify whether the resource E ^ � c has been selected among the top- relevant re- sources to the query � � ���ffifl � ��fl � . That is, we build the query � � ����fl � �fffl � and then compute all the goodnesses ��� � � E ^ � � � � ������� ��� � � E ^ � ^ . � . If ��� � ��E ^ � c � is the top score, then we map � fl � � fl into the attribute-value pair ��^ � � � fl . Once we apply the procedure to all �ffifl � ��fl _ , a complex query � � ����^ , �#� � ������� � ��^ . ��� � � over the selected source schema $ ^ _ ! is obtained and can be submitted to the resource $ ^ . 4. EXPERIMENTS In this section we describe the data (documents and queries), the evaluation measures, and the experimental setup used to evaluate our approach. EXPERIMENTAL SETUP. The schema mapping task involves a target schema, and one source schema with its corresponding re- source approximation. The experiments were performed on three different data sets: - the OAI-DC collection5, which is an Open Archive Initiative collection containing more than � %�� %*%�% scientific documents in XML format; - a set of ��%*% computer science documents in the XML OAI-RFC (��*%�� format
本文档为【Automatic Structured Query Transformation】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_936410
暂无简介~
格式:pdf
大小:145KB
软件:PDF阅读器
页数:6
分类:互联网
上传时间:2011-03-25
浏览量:6