KWTR: Web Data Extraction / Information Extraction
From semanticweb.org
Main Contributors:
Diana Maynard - University of Sheffield (USFD) (d.maynard@dcs.shef.ac.uk)
York Sure - University of Karlsruhe (UKARL) (sure@aifb.uni-karlsruhe.de)
See the list of contributors
- 1. CURRENT TRENDS IN SEMANTIC WEB (In the following part we intend to identify the state of the art of Semantic Web based theories, methods, applications and tools in your research field.)
- 1.1. One or more examples (case studies) in which semantic web has been used.
Name of the institutions: Bayer, IChemE Industry / sector: Chemical Engineering / Employment Business activities improved by the SW solutions: Employment, Awareness of Activities in the Field Research Needs: tracking of information over time, monitoring new business trends and markets Name of the project: h-TechSight Tools and applications implemented in the project: GATE, WebQL, MASH, ToolBox
Name of the institutions: Innovantage
Industry / sector: Recruitment
Business activities improved by the SW solutions: information syndication from websites of UK companies
Research Needs: instance unification
Name of the project: JOCI
Tools and applications implemented in the project:
Tools and applications implemented in the project.
Recruitment Intelligence Collector
- 1.2. The first 4 Semantic Web based tools used in your research fields.
Name: GATE Website: http://gate.ac.uk White paper: Main characteristics: architecture for language engineering, contains Information Extraction system plus many processing resources for ontology-based applications Open problems: Name: MAGPIE Website: http://kmi.open.ac.uk/projects/magpie/main.html White paper: http://owl.man.ac.uk/api.shtml Main characteristics: tool to semantically mark web documents on the fly using ontologies Open problems: Name: KIM Website: http://www.ontotext.com/kim/ White paper: Main characteristics: software platform for semantic annotation of text, indexing and retrieval, and query and exploration of formal knowledge Open problems: Name: T-rex Website: http://tyne.shef.ac.uk/t-rex/ White paper: Main characteristics: architecture acting as a testbed for experimenting with several extraction algorithms and several extraction scenarios, especially extraction from the web Open problems:
Others: * Text2Onto – ontology learning from text * JENA – Semantic Web Java framework – RDF, RDFS, OWL support
- 1.3. A short summary of the first 3 best papers in the field.
Reference: * H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), 2002. Short abstract:
Reference:
* K. Bontcheva, V. Tablan, D. Maynard, H. Cunningham. Evolving GATE to Meet New Challenges in Language Engineering. Natural Language Engineering. 10 (3/4), pp. 349-373. 2004.
- 1.4. A short list of open problems in theories and methods.
* Scalability - many tools do not scale up easily to large ontologies and large quantities of data * Specific vs generic - NLP applications work best when either restricted to a very specific doman and simple text type; if a general domain and/or complex text type is required then performance must generally be sacrificed * Availability of training data - most systems work best when large amounts of training data are available but this can be time-consuming to construct
- 2. TRENDS ON THEORIES AND METHODS, SERVICES AND APPLICATIONS
- 2.1. Research projects in which contributors are involved, along with a general description. Moreover, suggest for each project the possible future uses and applications related to the Semantic Web, the acceptance and diffusion in each period considered, the benefits, and the problems that will be probably occur.
Name of the project: MediaCampaign Type: IST-2004 Duration: 30 months Partners: Research Institution: JRS (AU) – Uni Twente (NL) – TNO (NL) Industrial Partners: Nielsen Media Research (UK) – Ontotext (BG) – HS Art (AU) – Softeco (IT) Core activities: * SComponents for advertisement analysis of modalities audio, video and text -- very high relevance -- will be solved in the long term Market opportunities: * Components for advertisement analysis of modalities audio, video and text -- in the long term (10 years) very high acceptance and diffusion in the market Benefits for industry and practitioners: Components for advertisement analysis of modalities audio, video and text -- very high relevance Comments: MediaCampaign is a recently started IST project, the aim of which consists of discovering, inter-relating and navigating cross-media campaign knowledge and extensively automating the detection and tracking of media campaigns on television, Internet and in the press. The campaign and adverts will be represented in an ontology and instances extracted from multimedia documents. These instances will be used by reasoning to inter-relate the documents and discover new media campaigns. Name of the project: SEKT Type: IST Duration: Jan 2004 – Dec 2006 Partners: see www.sekt-project.com for details Core activities: * data mining for ontology learning -- high relevance -- will be solved in the medium and long term * ontology evolution -- high relevance -- will be solved in the medium and long term * semantic annotation -- high relevance -- will be solved in the medium and long term Market opportunities: semantic-based knowledge management in digital libraries, in the legal domain, and on large corporate intranets Benefits for industry and practitioners: ease of finding information -- high relevance information syndication -- high relevance Technological Problems (missing theories and methods):" scalability issues -- high relevance -- will be solved in the long term domain portability -- high relevance -- will be solved in the long term Technological Problems (missing tools) scalable semantic repositories -- very high relevance -- will be solved in the medium term Other projects: MUSING is a EU IST and the duration is April 2006 - March 2010
- 2.2. Some topics that will not be solved in short and medium term, for each of them there is a short explanation of the main reasons and (if possible) some references.
Topics: Reason:
- 3. TRENDS ON TOOLS
- 3.1. A list of the most relevant semantic based demos in the area.
Name: Ontology-based IE Description: ontology-based IE Website: http://gate.ac.uk/projects/sekt Main features: * ontology-based Information extraction -- high relevance * human-assisted semantic annotation -- high relevance Open problems: * efficiency -- medium relevance -- will be solved in the medium term
Name: CLIE Description: controlled language information extraction Website: http://gate.ac.uk/projects/sekt References: Main features: * language-based ontology authoring tool -- high relevance Open problems: * evaluation of improvement -- medium relevance -- will be solved in the medium term
- 3.2. A short description of tools that are still missing. A description of business activities and problems they should solve, will be provided.
none
- 4.Please feel free to add any comment or suggestion.
none