This document is intended to explain further the objectives and
technical aspects of the QIS. For more questions related to this documentation email: Luis Marenco,
Prakash Nadkarni or
Rixin Wang.
Query Integrator System (QIS) is a ongoing database mediator
framework designed to provide data integration from databases in Bioscience that
continuously evolve. This application framework is being build primarily to
provide data integration among distributed databases in
Neuroscience.
Motivation: This system spawned from primary needs to integrate
information from EAV/CR databases such as
SenseLab, with other common
data source types (Relational, XML and flat files, custom databases) used in
Neuroscience.
The primary goal of QIS is to explore database integration mechanisms that
will take advantage of the current resources and functionality in an incremental
fashion providing users with collection of federated data scattered in multiple
data sources.
Following in Road Map, will describe the incremental QIS functionality
provided while is built.
The current implementation of QIS uses physical databases and functional
structural queries to deliver data from multiple distributed data sources to
requesting applications. Current developments will leverage ontologies that will
be mapped from physical databases to create a semantic mediated queries. At long
range, textual queries engines could use the knowledge in the ontology to
facilitate textual interpretation of free text queries. See figure below

|
At the outer layer the three main units involved in communication: Users, Data
and Knowledge are joined at the inner layer by a series of servers that conform
the QIS. QIS servers communicate using the Internet (blue solid lines); and DSS
use institutional Intranets (red dotted lines) to connect to specific databases.
Each of the QIS servers is intimately related with the type of
functionality provided by each communication unit.
At the Data end, multiple heterogeneous databases (Relational, EAV/CR,
XML, Text, and other) hosted at specific institution connect to a specific Data
Source Server (DSS) serving as a gateway to system.
Users can be human or automated agents (i.e.: A web server or a desktop
application requesting data from QIS) that connect to the system by means of a
Integrator Server (IS). This server processes queries that target specific data
sources.
Knowledge as described in standardized ontologies (i.e: UMLS) is
correlated to a QIS Ontology Server (OS) where subsets of this ontologies are
merged with non yet associated or newly discovered terms found in federated
databases.
Metadata and conceptual data elements for multiple data sources are mapped to OS
concepts to mediate semantic based queries.
Structured Queries: QIS was initially built to allow data integration
using structured queries based on physical descriptions of each database.
QIS mediation is currently using this approach but it requires knowledge of each database
structure to build queries.
Semantic Queries: During the process of building an ontology, the user
can identify elements of interest on it, composing appropriate queries in an
automated or assisted fashion. The specifications of such queries can be saved
for reuse, so that even if there is currently very little data of interest to a
specific query within the federation, the same query may return more results
when re-run in future, as the contents of the federated databases expand.
|
The basic QIS system is composed by three loosely
coupled units: Data Source Server (DSS), Integrator Server (IS) and Ontological Server (OS).
Inter server communication is XML encoded and HTTP transported to avoid network
firewall limitations. For more detailed information check the QIS
components page.
Each of the QIS units is based MS Platform and require the
following applications: Windows 2000, Internet Information Server, SQL Server or
Access database and VB6 runtimes. Specific requirements for each of the QIS
units are explained in the QIS
components page.
The system is in the process of migration to the MS.NET
framework to take advantage of its features and to facilitate the creation of
future Java versions.
Like most mediator systems, QIS uses its own query language to
query disparate types of data sources. QIS query description is directly derived
from SQL-like languages but represented in XML to facilitate legibility, syntax
validation and future feature extensibility. In essence, the query is decomposed
(“pre-parsed”) into its constituent elements, which are represented in terms of
metadata-repository “unique identifiers”. Further, for atomic/column elements in
a query, the IS records, in stored form, whether the element is part of the
output (i.e., one of the fields to be displayed), whether it is used in the
equivalent of a “join” to bridge between two tables, and whether it is part of a
query criterion/filter.
Next we give a brief introduction of the QIS query structure by
analyzing a QIS query expressed in XML, for this example
we use the "getReceptorGeneChromosomeProtein _structure" query from the Membrane
property resource database, with explanations of its constituents:
QIS_query
info
query name= 'getReceptorGeneChromosomeProtein
_structure' description= 'Retrieves genes,
chromosome location and protein structure of membrane receptors' owner= 'anonymous' database= 'Membrane
Properties Resource' serverdsID= 'd53'
from
set id= 'g1' gId= 'Receptor_properties' name= 'Receptor
Properties' alias= '' version= '1'
select
atom id= 'c1' aId= 'g1.Subtype' name= 'Subtype'
atom id= 'c2' aId= 'g1.Gene_chromosome' name= 'Gene
chromosome'
atom id= 'c3' aId= 'g1.structure' name= 'structure'
conditions
cond id= 'n1' aId= 'g1.Subtype' operator= 'LIKE' value= '?'
cond id= 'n2' aId= 'g1.Subtype' operator= 'LIKE' value= '?'
expression value= '(n1 or n2)'
join value='<<join clause>>'
combine
value= '<<combine clause>>'
The previous query was automatically generated by the IS query
design tool (see snapshots: 1 and 2). In general, you should be using the
query design tool to compose your query because it makes the choices
applicable to a particular field as pull-down lists, minimizing typing (and
typographical errors that would cause the query to fail). Once the query has
been determined to give the correct results, you can then save its XML for
future use. (The above text is a "friendly" equivalent of the XML. without the
XML tags., for the purposes of easier explanation)
The explanation below assumes that you are familiar with the
principles of SQL and Boolean searching: if you are not, please study a book
on SQL (such as C.J. Date's excellent "Introduction
to Database Systems"). Jim
Melton's
recent book on SQL-99 is also helpful.
First, the "QIS_query" node encloses the query information
distributed in the following unique nodes:
-
info: Query information, name, description, author,
(version and date: not shown here)
-
from: specifies the sets which will be used for
the query. Sets are the equivalents of tables in an RDBMS; we use the term
"sets" because you may be accessing an XML-based data set instead of tables.
-
When you pick a set, the query designer automatically generates
an alias (id) for the set. In the above example, the alias for the set "Receptor_properties"
is "g1". In the SQL that accesses relational tables, Aliases are often necessary
because sometimes you can use the same set more than once in the same query (in
relational lingo, you perform what is called a "self-join") and you must
distinguish between the two instances of this set. Also, where the table/set has
a long name, aliases are short-forms that reduce typing.
-
The set has an internal identifier (gId) and a name (caption).
The two are almost identical, except that in the caption, underscores are
typically replaced by human-friendly spaces; the caption may also be more (or
less) verbose than the internal name.
-
select: contains one or more "atom" nodes (the
equivalents of columns in an
RDBMS, which obviously must be qualified by the prefix indicating the set) whose
values will be requested.
Each "atom" node contins the
following XML attributes:
-
"id" an alias for the
atom/column, by which the column is referred to elsewhere;
-
"aId" for atom Id (the unique identifier of the atom within the
data source)
-
"name" is the printed name (caption) for the column
. The role of aID and name are similar to gId and name for the set, as described
above.
-
conditions: Uses one or more "cond" nodes to filter
results based on conditions of any atom element. A condition consists of the
atom/column that serves as a constraint, the relational operator used
in the constraint, and the value to which the atom is constrained.
In the above example, there are two conditions: the Subtype column will
be constrained based on two values specified by the user or application. (this
is indicated by the placeholder "?", indicating that this is a parameter
to the query). For example, at runtime, the user may wish to search for alpha-1,
alpha-2, or beta-1 receptor subtypes, and may enter the pattern "alpha%" for the
first condition, and "beta-1" for the second. If you know your SQL, the operator
for string pattern matching is "LIKE".
If you specify multiple conds one after the other, the individual conds,
by default, are combined using the AND Boolean operators. If you wish to combine
them in any other way (e.g., using OR or NOT operators) then the expression
clause (below) is used to build complex Boolean expressions.
-
expression: Optional. Used to build complex Boolean
conditions when you want something other than the default AND behavior. In the
above case, we wish to combine the two conditions to return a broader result,
using the OR operator.
-
join: Is used for joining two sets. While the example
here does not show joins, the principle is similar to that in SQL. A join
essentially defines a pair of atoms/columns in two sets and a join operator that
combines them.
The join operator is one of "INNER", "LEFT_OUTER", "RIGHT_OUTER", "FULL_OUTER".
Refer to a SQL text book for explanations of INNER and OUTER joins. The INNER
join is the default. an example: if we had two columns g1.c1 and g2.c4, and we
wish to perform an inner join, we would use the clause:
operator="INNER" attr1='g1.c1'
attr2='g2.c4'.
-
combine: this lets you combine the results of two or more
sets, where the atoms/columns in these sets are logically identical in type and
number (in the sequential order in which they have been defined in the query).
The operators are "UNION" (set union), "INTERSECT" (set and "MINUS" (set
difference). For example, to return the union of all the rows returned from set
g1 and g2, we would use the clause :
operator="UNION" set1 = "g1" set2="g2".
Yale Center for Medical Informatics. 2004