PoolParty Extractor

From semanticweb.org.edu
Jump to: navigation, search
PoolParty Extractor
Error creating thumbnail: Unable to save thumbnail to destination
Status: stable
Last release: 1.0 (March 2011)
License: Pay Licensed Closed Source
Affiliation: Semantic Web Company

PoolParty Extractor (PPX) provides entity extraction from unstructured text based on SKOS thesauri as well as XML to SKOS mappings which enables semantic search over heterogeneous data sources. The SKOS thesaurus can be created with e.g. PoolParty thesaurus managment system.

PPX creates an index from such a thesaurus, analyzes text supplied by the user and suggests controlled and free tags for this text based on the indexed thesaurus.

The PoolParty Extractor can process text documents and do statistical analysis on them. As PPX creates an extraction model from a thesaurus, the suggested tags can be controlled semantic tags instead of conventional keyword tags. I.e. PPX checks whether the analyzed text contains any words or phrases that match to any concept label stored in the loaded thesaurus. These concepts can be associated with synonyms, abbreviation, multilingual labels, etc., so that they make for much better tags and improve findability of tagged content.

Additionally concepts can be mapped to resources from the Linked Data cloud, where even more background knowledge (like longitude and latitude or category and type information from DBpedia or Yago) is available, thereby enriching the semantic fingerprint of a document tagged with such concepts and offering more possibilities for improved content recommendation and similarity calculations. The calculated tag suggestions can also be used to classify documents according to pre-defined categories.

The PoolParty Extractor uses a rule based system for performing sentence splitting, tokenization, stemming, and phrase construction on natural language documents. The words and the phrases extracted from the document are assigned with a relevancy score, and the extractions with the highest scores are chosen as keywords, key phrases to represent the input document. The scoring function of the extractor take the frequency and the position of the words into account (the more frequent a word is, the higher score it gets, and the words at the beginning of the text are considered more relevant, so get a higher score than words at the end of the document).

The extracted words and phrases are looked up in a thesaurus, after which a tag suggestions divided in concept tags and free tags can be presented to the user or automatically be associated with the document. Possible further services using PPX can be:

  • retrieval of additional information from Linked Data resources,
  • autocompletion of user entered terms,
  • facets provided from a thesaurus for search assistants or navigational aids
  • moderated search or query expansion powered by the thesaurus