KWTR: Web Data Extraction / Information Extraction

From semanticweb.org
Jump to: navigation, search

Main Contributors:

Diana Maynard - University of Sheffield (USFD) (d.maynard@dcs.shef.ac.uk)

York Sure - University of Karlsruhe (UKARL) (sure@aifb.uni-karlsruhe.de)

See the list of contributors


  • 1. CURRENT TRENDS IN SEMANTIC WEB (In the following part we intend to identify the state of the art of Semantic Web based theories, methods, applications and tools in your research field.)
    • 1.1. One or more examples (case studies) in which semantic web has been used.
  Name of the institutions: Bayer, IChemE
  Industry / sector: Chemical Engineering / Employment	  
  Business activities improved by the SW solutions: Employment, Awareness of Activities in the Field
  Research Needs: tracking of information over time, monitoring new business trends and markets
  Name of the project: h-TechSight
  Tools and applications implemented in the project: GATE, WebQL, MASH, ToolBox
  Name of the institutions: Innovantage
  Industry / sector: Recruitment	  
  Business activities improved by the SW solutions: information syndication from websites of UK companies
  Research Needs: instance unification
  Name of the project: JOCI
  Tools and applications implemented in the project: 
     Tools and applications implemented in the project.	
     Recruitment Intelligence Collector
    • 1.2. The first 4 Semantic Web based tools used in your research fields.
Name: GATE
Website: http://gate.ac.uk
White paper: 
Main characteristics: architecture for language engineering, contains Information 
   Extraction system plus many processing resources for ontology-based applications
Open problems: 
                     
Name: MAGPIE
Website:  http://kmi.open.ac.uk/projects/magpie/main.html
White paper: http://owl.man.ac.uk/api.shtml
Main characteristics: tool to semantically mark web documents on the fly using ontologies
Open problems: 
 
Name: KIM
Website: http://www.ontotext.com/kim/
White paper: 
Main characteristics: software platform for semantic annotation of text, 
indexing and retrieval, and query and exploration of formal knowledge
Open problems: 
                                            
Name: T-rex
Website: http://tyne.shef.ac.uk/t-rex/
White paper: 
Main characteristics: architecture acting as a testbed for experimenting 
with several extraction algorithms and several extraction scenarios, especially 
extraction from the web
Open problems: 
Others:
* Text2Onto – ontology learning from text
* JENA – Semantic Web Java framework – RDF, RDFS, OWL support

    • 1.3. A short summary of the first 3 best papers in the field.
Reference:  
* H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th  Anniversary Meeting of the Association for Computational Linguistics (ACL’02), 2002.
Short abstract: 

Reference:

* K. Bontcheva, V. Tablan, D. Maynard, H. Cunningham. Evolving GATE  to Meet New Challenges in Language Engineering. Natural Language Engineering. 10 (3/4), pp. 349-373. 2004.
    • 1.4. A short list of open problems in theories and methods.
* Scalability - many tools do not scale up easily to large ontologies and large quantities of data
* Specific vs generic - NLP applications work best when either restricted to a very specific doman and simple text type; if a general domain and/or complex text type is required then performance must generally be sacrificed
* Availability of training data - most systems work best when large amounts of training data are available but this can be time-consuming to construct

  • 2. TRENDS ON THEORIES AND METHODS, SERVICES AND APPLICATIONS
    • 2.1. Research projects in which contributors are involved, along with a general description. Moreover, suggest for each project the possible future uses and applications related to the Semantic Web, the acceptance and diffusion in each period considered, the benefits, and the problems that will be probably occur.
Name of the project: MediaCampaign
Type: IST-2004
Duration: 30 months
Partners: 
Research Institution: JRS (AU) – Uni Twente (NL) – TNO (NL)
Industrial Partners: Nielsen Media Research (UK) – Ontotext (BG) – HS Art (AU) – Softeco (IT)
Core activities: 
* SComponents for advertisement analysis of modalities audio, video and text 
  -- very high relevance -- will be solved in the long term
Market opportunities: 
* Components for advertisement analysis of modalities audio, video and text 
  -- in the long term (10 years) very high acceptance and diffusion in the market
Benefits for industry and practitioners: 
Components for advertisement analysis of modalities audio, video and text -- very high relevance
Comments: MediaCampaign is a recently started IST project, the aim of which 
consists of discovering, inter-relating and navigating cross-media campaign knowledge 
and extensively automating the detection and tracking of media campaigns on television, 
Internet and in the press. The campaign and adverts will be represented in an ontology  
and instances extracted from multimedia documents. These instances will be used by reasoning 
to inter-relate the documents and discover new media campaigns.

Name of the project: SEKT
Type: IST
Duration: Jan 2004 – Dec 2006
Partners: see www.sekt-project.com for details
Core activities: 
* data mining for ontology learning -- high relevance -- will be solved in the medium and long term
* ontology evolution -- high relevance -- will be solved in the medium and long term
* semantic annotation -- high relevance -- will be solved in the medium and long term
Market opportunities: semantic-based knowledge management in digital libraries, 
in the legal domain, and on large corporate intranets 
Benefits for industry and practitioners: 
ease of finding information -- high relevance
information syndication -- high relevance
Technological Problems (missing theories and methods):"
scalability issues -- high relevance -- will be solved in the long term
domain portability -- high relevance -- will be solved in the long term
Technological Problems (missing tools)
scalable semantic repositories -- very high relevance -- will be solved in the medium term

Other projects: MUSING is a EU IST and the duration is April 2006 - March 2010
    • 2.2. Some topics that will not be solved in short and medium term, for each of them there is a short explanation of the main reasons and (if possible) some references.
Topics: 
Reason: 
  • 3. TRENDS ON TOOLS
    • 3.1. A list of the most relevant semantic based demos in the area.
Name: Ontology-based IE
Description: ontology-based IE
Website: http://gate.ac.uk/projects/sekt
Main features:
* ontology-based Information extraction -- high relevance
* human-assisted semantic annotation -- high relevance
Open problems:
* efficiency -- medium relevance -- will be solved in the medium term
Name: CLIE
Description: controlled language information extraction
Website: http://gate.ac.uk/projects/sekt
References: 
Main features:
* language-based ontology authoring tool -- high relevance
Open problems:
* evaluation of improvement -- medium relevance -- will be solved in the medium term
    • 3.2. A short description of tools that are still missing. A description of business activities and problems they should solve, will be provided.

none

  • 4.Please feel free to add any comment or suggestion.

none

Personal tools
Namespaces

Variants
Actions
Navigation
services
Toolbox