Automatic Structured Query Transformation Over
Distributed Digital Libraries
M. Elena Renda
�
I.I.T. – C.N.R.
and
Scuola Superiore Sant’Anna
I-56100 Pisa, Italy
elena.renda@iit.cnr.it
Umberto Straccia
I.S.T.I. – C.N.R.
I-56100 Pisa, Italy
umberto.straccia@isti.cnr.it
ABSTRACT
Structured data and complex schemas are becoming the main way
to represent the information many Digital Libraries provide, thus
impacting the services they offer. When searching information
among distributed Digital Libraries with heterogeneous schemas,
the structured query with a given schema (the global or target
schema) has to be transformed into a query over the schema of the
digital library it will be submitted to (the source schema). Schema
mappings define the rules for this query transformation. Schema
matching is the problem of learning these mappings.
In this paper we address the issue of automatically learning
these mappings and transforming a structured query over the tar-
get schema into a new structured query over the source schema.
We propose a simple and effective schema matching method based
on the well known CORI selection algorithm and two ways of ap-
plying it. By evaluating the effectiveness of the obtained struc-
tured queries we show that the method works well in accessing
distributed, heterogeneous digital libraries.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search
and Retrieval—Query formulation; H.3.3 [Information Storage
and Retrieval]: Information Search and Retrieval—Search pro-
cess; H.3.4 [Information Search and Retrieval]: Systems and
Software—Distributed systems; H.2.5 [Database Management]:
Heterogeneous Databases
General Terms
Algorithms, Experimentation, Measurement, Performance.
Keywords
Digital Libraries, Structured Queries, Automatic Query Transfor-
�
This work was done when the author was a research assistant at
I.S.T.I. – C.N.R.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SAC’06, April, 23-27, 2006, Dijon, France.
Copyright 2006 ACM 1-59593-108-2/06/0004 ...$5.00.
mation, Automatic Mappings, Schema Matching.
1. INTRODUCTION
Digital Libraries (DLs) [14] differ in the kind, quantity and qual-
ity of information and services they provide, and in what kind of
users they are supposed to be addressed to (see, for instance ACM
DL1, Medline2, NCSTRL3). In particular, they may be extremely
heterogeneous under two main aspects:
1. the topic of the information they provide. Imagine a wide-
area library, with up to thousands of collections available
where to search for, dealing with, for instance, Physics,
Medicine, Computer Science, and so on; and
2. the metadata schema they use to describe the information
they provide. Indeed, DLs use different sets of attributes to
represent their document content.
As users cannot deal efficiently with the number and the hetero-
geneity of available DLs, it is becoming increasingly needed that
they only deal with a unique system with a unified interface for
searching over multiple DLs simultaneously, using one system-
wide target metadata schema which is defined independently of the
libraries.
For instance, let us assume that:
- we have a target schema � with attributes ��������� ���
��� (also
called global or mediated schema);
- a user query
over the target metadata schema � is based on
single-valued attributes ����������� �
��� , i.e. the query is a set of
attribute-value pairs of the form:
�������������������� ���
���������ff�fi�
where each �ffifl is an attribute of the target schema � and ��fl
is a type correct value for � fl 4;
- we have available distributed DLs (called information re-
sources) !"�#��$%������� ���&$(')� to search in;
- each resource $(fl has its own metadata schema *+fl (the source
schema) based on attributes �
fl-,
��� �����
�
fl-.
;
1http://www.acm.org/dl
2http://igm.nlm.nih.gov
3http://www.ncstrl.org
4Note that a query attribute �ffifl may appear in the query multiple
times.
1078
- a query �
over the source schema * fl is a set of attribute-value
pairs of the form:
�
����� fl-, ��� fl-, ���������&� fl . ��� fl . �ffi�
where each �ffifl � is an attribute of the target schema *+fl and
� fl
� is a type correct value for � fl � .
In such a context, three major tasks have to be executed in order
to answer the user query
submitted over the target schema � :
(i) the Automatic Resource Selection task (see, e.g. [4, 15]) in or-
der to select a subset of most promising DLs among all the DLs
available, as it is not cost-effective to submit the query to all pos-
sible resources; (ii) the Schema Matching task (see, e.g. [28]) to
reformulate the information need
into a new query �
over the
query language provided by the selected resource(s) using schema
mapping rules; and (iii) the Rank Fusion (see, e.g. [29]) to merge
together the ranked lists obtained by each queried resource.
In this paper we deal with the schema matching problem,
i.e. with the problem of automatically learning the schema map-
ping rules for querying heterogeneous distributed DLs.
Schema Matching is based on mappings that usually are of the
form ����� ��� , stating that attribute ��� of the target schema can
be mapped into the attribute ��� of the source schema.
For instance, suppose we have the metadata target schema
�
���
��������������ff�fi�fl����ffifi�ff
and the resource $ fl with the metadata source
schema !fl" #ffifi��$fl�������%�'&�$��flffifi�
(*)��
(fi�fi+, . An example of user query over
the target schema may be:
-�. /
��
��������
.10*243fl5fi6,798;:�<�=�>�?
�@���ff�fi�fl����ffifi�
.A0@BC3fiD�>FEG?
�
���ff�fi�fl����ffifi�
.A0@EG3�HJILK,M#7fiO7QPREG7fi?�SNT
whose intended meaning is to “retrieve all documents written by
Moshe Vardi, which talk about logic in computer science”. Using
the mappings:
��
���������UVffifi��$fl�������
���ff�fi�fl����ffifi��UW&�$��flffifi�
(*)��
(fi�fi+X�
the above query
will be rewritten as the query �
over the source
schema:Y
-�. /
ffifi��$fl�������
.10*243fl5fi6,798;:�<�=�>�?
�Z&�$��flffifi�
(*)��
(fi�fi+
.[0@BC3fiD�>FEG?
�
&�$��flffifi�
(*)��
(fi�fi+
.[0@EG3�HJILK,M#7fiO7QPREG7fi?�S]\
We propose a simple and effective method to automatically learn
schema mappings in such a scenario, which is based on the well
known CORI resource selection method [7]. Indeed, similarly to
the resource selection problem, where we have to automatically
identify the most relevant libraries with respect to a given query, in
the schema matching problem we have to identify, for each target
attribute, the most relevant source attribute with respect to a given
structured query. That is, given (i) an attribute-value pair �X^ �
�
^ , with � ^ being an attribute of the target schema � , and (ii) a
resource $ fl with the source schema *+fl , we want to identify among
all the attributes � fl �`_ * fl the most relevant one to map ��^ to.
While most of the schema matching methods presented in the
literature need an a priori schema learning process, thus requiring
instances of the target schema, a major feature and, to the best of
our knowledge, the novelty of our approach is that the method we
propose can directly be applied to queries over the target schema
without requiring instances of it. This is often the case when we
deal with queries over distributed DLs, as training data is missing.
The structure of the paper is the following: Section 2 presents
an overview of the schema matching problem and its main applica-
tion fields. Section 3 introduces our formal framework for schema
matching applied to query transformation. In Section 4 we report
the evaluation of the proposed approach on different data sets. Sec-
tion 5 concludes.
2. RELATED WORK
Matching is a fundamental task in the manipulation of struc-
tured information; it consists in taking two schemas/ontologies,
each consisting of a set of entities as input (e.g., tables, XML ele-
ments, classes, properties, predicates, metadata) and producing as
output a mapping, i.e. a semantic relationships (e.g., equivalence,
subsumption) between elements of the two given schemas [8, 10,
23, 27].
With the proliferation of DLs over the Web, the development of
automated tools for schema matching is of particular importance
to automatize distributed information retrieval. Indeed, matching
has a central role in many well-known application domains, such
as Semantic Web, schema integration, Electronic Commerce (E-
Commerce), Ontology integration, web-oriented data integration,
XML mapping, Catalog matching, etc.
Manually performing Schema Matching, as it is usually done
when searching the Web, is time-consuming, and expensive. More-
over, the number and the complexity of Web information resources
to integrate grows enormously, thus making it difficult to handle
the Schema Matching process.
Matching has been addressed by many researchers in two re-
lated areas: the matching problem for ontologies and the matching
problem for database schemas. Ontology Matching differs from
Schema Matching substantially for two reasons ([25]): presence of
explicit semantics and knowledge models. Indeed, while ontolo-
gies are logical systems that themselves incorporate semantics (in-
tuitive or formal), database schemas often do not provide explicit
semantics for their data. Furthermore, ontology data models are
richer than schema data models. The approaches proposed for On-
tology Matching [12, 20, 21, 26] try to exploit knowledge explicitly
encoded in the ontologies, while Schema Matching approaches try
to guess the meaning encoded in the schemas.
However, even considering these differences, these two areas can
be seen as closely related. Indeed, schemas can be considered as,
e.g., simple ontologies with restricted relationship types; on the
other hand, the techniques applied in schema matching can be ap-
plied to ontology matching as well, taking care of the hierarchies.
All the Schema Matching approaches proposed in the litera-
ture [3, 9, 18, 22, 24] involve the definition of mappings among
schemas and, possibly, of a global integrated schema. Most re-
cent approaches, either implicitly or explicitly, perform schema
mapping based on attribute name comparison and/or comparing
properties of the underlying data instances using machine learn-
ing techniques [1, 11]. In particular, applying machine learning
techniques requires instances from both the target schema and the
source schema. In these cases both the target schema and the source
schema are relational database tables. The attribute matching pro-
cess is based on some comparison between the values in the source
table and the target table. An extensive comparison can be found
in [28].
Typical application for schema matching are Schema Integra-
tion, i.e. the problem of constructing a global view, given a set of
independently developed schemas [30], “message translation” for
E-Commerce portals, and also the Semantic Web [2, 13], where
matching can be used for mapping messages between autonomous
agents.
In this paper we consider another scenario for Schema Matching:
given a user query, written with the target schema � of the library
a�b
, automatically learn for each attribute � fl _ � of the query the
most promising attribute *�c of the source schema a ' , in order to
submit the query over a b to the resource a ' . [16] has proved that
automatic structured queries may significantly improve the search
process over structured DLs.
1079
3. QUERY TRANSFORMATION
In this Section we describe our approach for solving the schema
matching problem in order to automatically learn schema mappings
and transform queries over heterogeneous DLs. It is based on a
reformulation of the CORI resource selection framework [7]. In
the following, in order to make the paper self-contained, we first
introduce the automatic resource selection task and describe how
it is performed using the CORI method; after that we fit the CORI
resource selection method into our context.
Automatic Resource Selection Using CORI
The Automatic Resource Selection task is becoming a crucial step
when searching over a large number of distributed, heterogeneous
resources. Indeed, resource selection improves search efficiency,
reducing the costs of distributed search (search time, network band-
width, computation, and so on), and can also improve search effec-
tiveness, since it is supposed that after the selection the resources
with no relevant documents are not queried, thus gaining in time
and documents relevance.
Some approaches presented in the literature rely on short re-
source descriptions [17, 31], a sort of content summary that usu-
ally includes the terms that appear in the resource and their doc-
ument frequency; furthermore, it may include other simple infor-
mation, such as the total number of documents (the resource di-
mension). Web resources and their corresponding search servers
are usually non-cooperative, while all the approaches relying on
the resource description require their cooperation in order to obtain
content summary or information about the documents they index,
such as term occurrence statistics. For this reason, query-based
sampling techniques have been proposed for deriving content de-
scription approximations [5, 19]. These approaches use queries
(called probe queries) to retrieve and download a relatively small
set of documents (called the resource sample), representative of
the topic covered by the resource; the resource sample is then used
to build the content summary (the description or approximation of
the resource) by extracting term occurrence statistics. In [6] it has
been demonstrated that resource samples provide statistics quite
representative of the statistics of the full resource. The resource de-
scription is then used to compute the resource score for each infor-
mation resource, i.e. a measure of the relevance of a given resource
to the query. The resource score establishes the relatedness of an
information resource to a given user query.
CORI is one of the methods used in automatic resource selection
to compute the resource score with respect to the query. In the
following we describe an adapted version of the CORI resource
selection method in case documents are represented via metadata
records.
Consider the user query
� ��� � � � � ���������&� � � � � � over
the target schema � . At first, we unfold the complex query
into
a simple query
�
where the attributes � fl have been removed, i.e.
�
� ��� ��� � � � �
���ff� . For each resource $ fl _ ! , we compute the
resource score
���
�&$ fl�� (also called goodness), representing the
relevance of resource $
fl
to the query
, as follow:
���
�&$
fl
� �
���
��
��
��
�
�
^��
$(fl��
�
�
�
� (1)
where �
�
� is the number of values in the simple query
�
. The be-
lief
�
�
�
^��
$(fl�� is computed for each value � ^ _
�
using the CORI
algorithm [4, 7]:
�
�
� ^�� $(fl�� �
�
fl��
^�����^��ff� ^ (2)
�
flfi�
^ �
fl�ffi
fl��
^
fl ffi
flfi�
^"!$#&%'!�()#*% �,+�-,.
+ -
(3)
� ^ �
/10�24365 7�5 8:9ff; <
+>=
@?
/10�2 �
� ! � !A(�� %B�
(4)
where:
� ^ is the weight of the term in the query;
fl�ffi
flfi�
^ is the number of records in the approximation of $ fl
containing the value �
^ ;
C
�
fl is the number of values in the approximation of $ fl ;
C
� is the mean value of all the C � fl ;
C
ffi
^ is the number of approximated resources containing
the value �
^ ;
� ! � is the number of the resources.
In the above formulae, �
fl��
^ indicates how many records contain
the term �
^ in the resource $ fl , while � ^ , defined in terms of the re-
source frequency C ffi ^ , is the inverse resource frequency: the higher
C
ffi
^ the smaller � ^ , i.e., the more a term occurs among the resources
the less it is a discriminating term.
The belief
�
�
�
^
�
$
fl
� combines these two measures. Informally, a
resource is more relevant if its approximation, computed by query-
based sampling, contains many terms related to the original query,
but if a query term occurs in many resources, this term is not a good
one to discriminate between relevant and not relevant resources.
Finally, all the information resources $ fl _ ! are ranked ac-
cording to their resource goodness value
���
��&$
fl
� , and the top- D
are selected as the most relevant ones.
Schema Matching Using CORI
Let
� ��� ��� ����� �������
��� �#���ff� be a user query over the meta-
data target schema � and $ ^ _ ! a relevant selected resource,
with metadata source schema * ^ . Our task is to find out how to map
each attribute-value pair � fl � � fl _
into one or more attribute-
value pairs ��^ � ��� fl , where ��^ � _ * ^ , using CORI.
The basic and simple idea is the following. Consider a query
attribute-value pair �ffifl�� �fffl _
and let $ ^ _ ! be a selected
resource where to search into. Let ��^
,
��� � � � ��^
.
_
* ^ be all the
attributes of the metadata schema of $ ^ . Our idea is to map the
attribute ��fl to the attribute � ^ � if the value ��fl is relevant to the
collection of all the values the attribute �X^ � takes in the resource
$ ^ . Essentially, for each of the attributes ��^ � _ * ^ , we make a
collection E;^
� c
of the values the attribute ��^ � takes in the resource
$ ^ and then ask CORI which of these collections are most relevant
to the query value �
fl
. If E ^
� c
is among the answers of CORI , then
we build the mapping � fl � ��^ � .
More formally, consider the resource $ ^ and the records
F
�
����� ���
FHG of the approximation of $ ^ �
�B�
F&I*J
�
$ ^H� (computed
by query-based sampling). Each record F&K _ �
�B�
F*I&J
�
$ ^&� is a set
of attribute-value pairs FHK ������^
,
���ff^
,
������� � ��^
.
���
^
.
� .
From �
�B�
F&I*J
�
$
^
� , we make a projection on each attribute,
i.e. we build a new set of records for each attribute �X^ � of the
schema:
E ^
� c
� L
M N
��OQPRP
M>SRT�U V
�W
�
F
�
FYX
� ����^
�
���ff^
�
� �&��^
�
���ff^
�
_
FHK
� �
The basic idea is that each projection E;^
�
�
� �������
E ^
�
^
.
can be seen
as a new collection, and we apply CORI to select which of these
is the most relevant for each attribute-value pairs � fl � � fl of the
query
.
1080
By using each projection E ^
� c
as a resource, we can apply the
resource selection framework for attribute matching: in order to
find out whether to match a target attribute-value pair �fifl ����fl _
into a source attribute-value pair �X^ � � � fl , we verify whether
the resource E ^
� c
has been selected among the top- relevant re-
sources to the query
�
� ���ffifl � ��fl � . That is, we build the
query
�
� ����fl � �fffl � and then compute all the goodnesses
���
�
�
E ^
� �
� � �������
���
�
�
E ^
�
^
.
� . If
���
�
��E ^
� c
� is the top score,
then we map � fl � � fl into the attribute-value pair ��^ � � � fl .
Once we apply the procedure to all �ffifl � ��fl _
, a complex query
�
� ����^
,
�#� � ������� �
��^
.
��� � � over the selected source schema
$ ^
_
! is obtained and can be submitted to the resource $ ^ .
4. EXPERIMENTS
In this section we describe the data (documents and queries), the
evaluation measures, and the experimental setup used to evaluate
our approach.
EXPERIMENTAL SETUP. The schema mapping task involves a
target schema, and one source schema with its corresponding re-
source approximation. The experiments were performed on three
different data sets:
- the OAI-DC collection5, which is an Open Archive Initiative
collection containing more than � %��
%*%�% scientific documents
in XML format;
- a set of ��%*% computer science documents in the XML
OAI-RFC (��*%�� format
本文档为【Automatic Structured Query Transformation】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。