KWTR: Mapping / Translation / Matching / Aligning (Heterogeneity)

From semanticweb.org
Jump to: navigation, search

Contents

[edit] Mapping / Translation / Matching / Aligning (Heterogeneity)

[edit] Trends in theories and methods: short term (0-3 years)

Heterogeneity is typically reduced in two steps (see, for recent surveys on the topic [Shvaiko and Euzenat, 2005; Noy, 2004; Doan, Halevy, 2005; Kalfoglou and Schorlemmer, 2003; Rahm and Bersnstein, 2001]): (i) match two ontologies, thereby determining the alignment (mappings) and (ii) process the alignment according to an application needs (e.g., query answering, web service integration). Also, the number and variety of solutions to the matching problem keep growing at a fast pace. In particular, Figure 4 shows (approximately) how many works devoted to diverse aspects of matching have been published at various conferences all over the world in the recent years.

Image:Example.jpg

Figure 4. Dynamics of publications devoted to matching (source: http://www.ontologymatching.org)

In the future, we expect a continuing growth of works on matching due to the constantly increasing interest in intelligent solutions for semantic heterogeneity problem from both academia and industry.

Disregarding the timelines, there are some general trends to be mentioned:

  • gradual and incremental improvement of the existing approaches;
  • emergence of the new approaches by modifying existing ones (usually performed by different group(s) of people with respect to the original approaches);
  • emergence of the completely new approaches.

In the following part of this section, we discuss first matching approaches and then their evaluations. Matching approaches, in turn, are analyzed according to the input, process, and output dimensions.

Algorithms can be analyzed taking into consideration different aspects. First of all, let us consider data / conceptual models in which ontologies are expressed. There are a lot of tools and systems such as the Artemis [Castano, Antonellis, De Capitani di Vimercati, 2001] system which supports the relational, OO, and ER models; Cupid [Madhavan, Bernstein, Rahm, 2001] supports XML and relational models; QOM [Ehrig and Staab, 2004] supports RDF and OWL models. Also, looking at the kind of data that the algorithms exploit, different approaches exploit different information of the input, some of them rely only on schema-level information (e.g., Cupid [Madhavan, Bernstein, Rahm, 2001], COMA [Do and Rahm, 2001]), others rely only on instance data (e.g., GLUE [Doan, Madhavan, Domingos, and Halevy, 2003]), or exploit both schema- and instancelevel information (e.g., QOM [M. Ehrig and S. Staab, 2004]). Even with the same data models, matching systems do not always use all available constructs, e.g., S-Match [Giunchiglia, Shvaiko, Yatskevich, 2005] when dealing with attributes discards information about datatypes (e.g., string, integer), and uses only the attributes names.

Some trends are:

  • Most of the approaches tend to be more and more generic, i.e., handle multiple input data/conceptual models.
  • New types of input, such as plain text and query interfaces from the Deep Web should enter intensively into practice.
  • Approaches will try to suitably handle more and more constructs available from the input (e.g., constraints).
  • Finally different (new) internal representations of the input data, e.g., descriptors of the entries for the learning algorithm, should appear.

Considering the general properties of the matching process, and in particular, the approximate or exact nature of its computation, another distinction can be done. It is based on the components of the matching process and their organization, namely distinguishing between basic (elementary) matchers and matching strategies, i.e., how the elementary matchers can be combined.

Below, we discuss the expected trends first in basic matchers, then in matching strategies, and finally, generally, in matching approaches. Thus, the expected short term trends are:

New types of basic automatic matchers addressing a larger variety and more sophisticated situations with respect to the current state of affairs. Some possibly emerging examples are:

  • Methods for matching glosses (comments) against entities;
  • Methods for matching processes;
  • Methods for alignment reuse (e.g., by reasoning with the given mappings to deduce the new mappings, verify if the mappings are still correct, and repair them if necessary);
  • Methods exploiting various (new) external resources, e.g., upper level ontologies, such as DOLCE [Gangemi, Guarino, Masolo, Oltramari, 2003], domain specific corpuses [Madhavan, Bernstein, Doan, Halevy, 2005];
  • Approximate (e.g., semantic-based) methods.

New libraries of matchers (or extensions of the existing libraries), which group together the basic automatic matchers based on their common characteristics, e.g., name-based matchers.

New approaches to automate the combination of individual matchers and libraries of matchers. Some existing solutions here can be found in [Doan, Domingos, Halevy, 2001], [Ehrig and Sure, 2004]. Some possibly emerging examples are:

  • Methods for learning the optimal weight assignments, given a set of basic matchers;
  • Combining different techniques (e.g., collaborative filtering, genetic algorithms, statistics) for the optimal/near optimal weight assignments.

New general matching solutions or default combinations of basic matchers which prove themselves equally good for most of the tasks.

New approaches to tune automatically matching solutions in general (e.g., thresholds, weights, coefficients, which basic matchers to use). An existing example is [Sayyadian, Lee, Doan, Rosenthal, 2005].

Various application specific approaches, which are particularly tailored to the input/output characteristics.

New matching approaches investigating the quality vs. efficiency trade off.

New ways of viewing/resolving the matching problem by reducing it to the other, already known problem. Some existing examples of these translations are graph matching [Melnik, Garcia-Molina, Rahm, 2002; Euzenat and Valtchev, 2004], propositional validity [Bouquet, Serafini, Zanobini, 2003; Giunchiglia, Shvaiko, 2003], probabilistic inference [Pan, Ding, Yu, Peng, 2005; Mitra, Noy, Jaiswal, 2005].

In view of graded answer, equivalence between entities can be expressed through: (i) the confidence measure in each correspondence, usually in [0,1], range, see, for example, [Euzenat and Valtchev, 2003, Madhavan, Bernstein, Rahm, 2001]; (ii) the kind of relations between entities. Most of the systems focus on equivalence, while a few others are able to provide a more expressive result (e.g., equivalence, subsumption, incompatibility), see for details [Bouquet, Serafini, Zanobini, 2003; Giunchiglia, Shvaiko, Yatskevich, 2004]). We expect the following short term trends:

  • Translations between alignments specified with the help of coefficients in [0,1] range and logical relations;
  • Expressiveness of alignment (atomic vs. complex);
  • Language(s) for alignment;
  • Formal semantics of alignment;
  • Alignment format;
  • Scalability of alignment;
  • Framework(s) for characterizing the alignment;
  • Application specific alignment.

Finally, we expect the following trends in evaluation of matching approaches in the short term:

  • Continuous (at least annual) ontology matching contests (note 4);
  • Improvements of the ontology matching evaluation methodology;
  • New dataset construction methodologies:
    • New large real-world datasets;
    • New systematic (artificial) tests, e.g., robustness to data noises.
  • New quality measures:
    • Combinations of precision and recall;
    • Application specific measures.

[edit] Trends in theories and methods: medium term (3-6 years)

Regarding matching approaches, standard(s) for the internal representations of the input data/conceptual models are required. Also, concerning process dimensions within industrial contexts, new methods should tackle the following topics:

  • Knowledge incompleteness. Recent industrial-strength evaluations of matching systems, see, e.g., [Avesani, Giunchiglia, Yatskevich, 2005; Euzenat, Stuckenschmidt, Yatskevich, 2005], show that lack of background knowledge, most often domain specific knowledge, is one of the key problems of matching systems. In fact, most state of the art systems, for the tasks of matching thousands of entities, perform with lower values of recall (~30%) than in cases of toy examples, where the recall was most often around 80-90%. Thus, we expect emergence of the frameworks leveraging the knowledge incompleteness problem, ultimately in a fully automated way.
  • Performance. Following the above mentioned examples from the industrialstrength evaluations, besides the effectiveness of the results, there is an issue of performance. In fact, there are applications which require at least some weak form of real time performance (to avoid having a user waiting too long for the system to respond). Execution time indicator shows scalability properties of the matchers and their potential to become industrial-strength systems. Also, referring to the above mentioned evaluations, the fact that some systems ran out of memory on some test cases, although being fast on small and medium test cases, suggests that their performance time was achieved by using a large amount of main memory. Therefore, usage of main memory should also be taken into account. We expect significant improvements of the matching approaches with respect to their performance characteristics.
  • Interactive approaches (semi-automatic matching). As from above, automatic ontology matching usually cannot be performed with high quality, especially on huge datasets. We believe that semi-automatic matching is a plausible way to improve the effectiveness of the results. There are tasks at which machines are good, and others at which human users are good. An important point here is to involve the user only when his/her input is maximally useful.
  • Explanations and transparency. Mappings produced by matching systems may not be intuitively obvious to human users, and therefore they need to be explained (see [Shvaiko, Giunchiglia, Pinheiro da Silva, McGuinness, 2005; Dhamankar, Lee, Doan, Halevy, Domingos, 2004]). In fact, if Semantic Web users are going to trust the fact that two terms may have the same meaning, then they need to understand the reasons leading a matching system to produce such a result. Explanations are also useful in semi-automatic matching, especially when matching (large) applications with thousands of entities (e.g., business catalogues, such as UNSPSC and eCl@ss). In these cases automatic matching solutions will find a number of plausible mappings, hence some human effort for performing the rationalization of the mapping suggestions is inevitable. Generally, the key issue here is to represent explanations in a simple and clear way to the user.
  • Social aspects. The impact of social networks, web communities and direct involvement of humans (in a distributed fashion) on ontology matching has to be analyzed and distilled. Let us consider one example. Eventually, once an alignment has been determined, it can be saved, and further reused just like any other data on the Web. Thus, on the one hand, a (large) repository of mappings has the potential to increase the effectiveness of matching systems by providing yet another source of domain specific knowledge. On the other hand, users can publish different and even contradicting alignments. Hence, one of the open problems here is how to manage the contradictory mappings in the repositories.

In addition to this, other research on output dimensions, in particular annotations (codifying social aspects) of the alignment, and standard(s) for expressing the alignment should be addressed.

Regarding the evaluation of matching approaches in the medium term, we expect the following trends:

Extensive experiments across different domains with multiple test cases from each domain:

  • New hard and large real-world datasets.

More accurate evaluation measures:

  • User-related measures.

Automating acquisition of expert mappings, especially for large applications.

[edit] Trends in theories and methods: long term (6-12 years)

In the long term we expect the appearance of multilingual matching approaches, i.e., those matching across multiple languages, such as English, Italian, and French. Also, a substantial progress in the field should have been done by that time in general, which in turn should cause some paradigm shifts. Thus, new visions and requirements of matching should appear. Addressing the multilingual matching approaches, we expect the following trends in evaluation of matching approaches in the long term:

  • Evaluation methodology for multilingual matching approaches;
  • Multilingual datasets;
  • Quality measures for multilingual matching approaches.

[edit] Trends in tools: short term (0-3 years)

Below, we discuss the future trends in tools, distinguishing between (relevant) commercially available ones and research prototypes. Most of the commercially available matching tools focus on visualization of the input ontologies expressed in e.g. XML, database, flat file formats, and the correspondences between them. It is also possible to specify (over the correspondences) some data transformation operations (e.g. by means of functoids) such as adding, multiplying, and dividing the values of fields in the source document and storing the result in a field in the target document. However, the matching operation itself is not automated at all, namely all the correspondences have to be specified manually. Some examples of these tools are Altova MapForce (note 8), BizTalk Schema Mapper (note 9), Cape Clear XSLT Mapper (note 10), Stylus Studio XSLT Mapper (note 11). In the short term we expect an increase in the number of such tools. Obviously, contrary to the commercial tools, research matching prototypes focus on automating the correspondence discovery operation and related themes. In general, the majority of the research tools focus only on one of the steps of reducing the heterogeneity, namely on matching ontologies, fewer on processing the alignments, and only some of them can be called infrastructures, since they consider match as one (among others) operation. Since the quality of match in general still has to be improved, there is an effort on design and development of the matching testbed environment [Euzenat, 2004]. It is early to speak about software quality in research tools. However, some positive trends are worth mentioning, such as modularity and extensibility of the architectures in most of the research prototypes. We expect gradual and incremental improvements along the lines mentioned above in the short term.

[edit] Trends in tools: medium term (3-6 years)

We expect the following trends in the medium term:

  • Scalability of visualization of the alignment between input ontologies;
  • User interfaces;
  • Configuration/customizing technology;
  • Industrial-strength research prototypes, including tools for matching ontologies, processing the alignment, and infrastructures.

[edit] Trends in tools: long term (6-12 years)

In the long term, we expect emergence of good quality matching tools: in the sense of system characteristics, e.g., complexity, design features, performance, quality, and process characteristics, e.g., maintenance. Finally, it is worth noting that, for example, engineers of information integration systems would rather use existing matching systems than build their own. However, it is quite difficult to connect state of the art matching systems to other systems or embed them into the new environments. They are usually packaged as stand alone systems, designed for communication with a human user. In addition, they are not provided with an interface described in terms of abstract data types and logical functionality. We expect some substantial progress on the frameworks for integration of different matching systems into the new environments in the long term.

[edit] Trends in services and applications: short term (0-3 years)

Matching is an important operation in traditional applications, such as schema integration, data warehousing, enterprise information integration (EII), and so on. Some examples of commercially available, e.g., EII tools, are IBM Information Integrator, Liquid Data for WebLogic from BEA systems, SAP NetWeaver, and EII platform from Denodo Technologies. However, it is worth mentioning that, even in these tools, a support for handling the semantic heterogeneity problem is still in its early stages. Let us describe a concrete example of a traditional application, which is catalogue ntegration. In B2B applications, trade partners store their products in electronic catalogues. Catalogues are tree-like structures, namely concept hierarchies with properties. Typical examples of catalogues are product directories of http://www.amazon.com, http://www.ebay.com, etc. In order for a private company to participate in the marketplace (e.g., eBay), it is used to determine correspondences between entries of its catalogues and entries of a single catalogue of a marketplace. This process of mapping entries among catalogues is referred to the catalog matching problem, see [Bouquet, Serafini, Zanobini, 2003]. Having identified the correspondences between the entries of the catalogues, they are further analyzed in order to generate query expressions that automatically translate data instances between the catalogues (see, for example, [Velegrakis, Miller, Mylopoulos, 2005]). Finally, having aligned the catalogues, users of a marketplace have a unified access to the products which are on sale. We expect the above mentioned applications to play as crucial a role in the short term as in the medium and long term. For example, according to Aberdeen Group, the EII market will grow by 60% annually with around $250M in revenue in 2005 (note 13). Notice, below, we discuss only the new applications as an addition to those already mentioned.

[edit] Trends in services and applications: medium term (3-6 years)

There is an emerging line of applications which can be characterized by their dynamics (e.g., agents, peer-to-peer systems, web services). Such applications, on the contrary to traditional ones, require a run-time matching operation and take advantage of more ”explicit” conceptual models. Let us discuss some of them.

P2P Databases. P2P networks are characterized by an extreme flexibility and dynamics. Peers may appear and disappear on the network, their databases are autonomous in their language, contents, how they can change their schemas, and so on. Since peers are autonomous, they might use different terminology, even if they refer to the same domain of interest. Thus, in order to establish (meaningful) information exchange between peers, one of the steps is to identify and characterize relationships between their schemas. Having identified the relationships between schemas, the next step is to use these relationships for the purpose of query answering, for example, using techniques applied in data integration systems, namely Local-as-View (LAV), Global-as-View (GAV), or Global-Local-as-View (GLAV) [Lenzerini, 2002]. However, P2P applications pose additional requirements on matching algorithms. In P2P settings an assumption that all the peers rely on one global schema, as in data integration, cannot be made, because the global schema may need to be updated any time the system evolves (see [Giunchiglia, Zaihrayew, 2002]). Thus, if in the case of data integration schema matching operations can be performed at design time, in P2P applications peers need a means of coordinating their databases on the fly, therefore requiring a run time schema matching operation.

[edit] Trends in services and applications: long term (6-12 years)

It is hard to see what is going to happen in the long term, since semantic web in particular and computer science in general are very dynamic and continuously evolving fields. Of course, in the long term, we expect different variations (e.g., P2P trading grid) of the applications mentioned so far. However, as one of the new possible scenarios, we could see embedding of the semantic matching services inside operation systems.

Personal tools
Namespaces

Variants
Actions
Navigation
services
Toolbox