EntitySearch2010
This page describes the process of the evaluation that was performed at the Entity Search Track of the Semantic Search 2010 workshop.
Contents |
[edit] Task definition
[edit] Data set
We used the Billion Triples Challenge 2009 dataset which was modified to replace blank nodes with URIs, so that blank nodes become retrievable without specifying context.
Blank nodes are encoded using the following rule:
BNID map to http://example.org/URLEncode(BNID), where BNID is the blank node id. Since the blank node ids in that dataset are unique, this convention is sufficient to map blank nodes to obtain distinct URIs.
Click here to download the original data set hosted at DERI.
Click here to download the modified data set hosted at KIT.
[edit] Query set
We have merged two sets of queries: one easier set of more popular queries, and one more difficult set of long tail queries.
- We have selected 42 popular queries (asked by >= 10 users) from a Microsoft Live Search query log. Only queries that contained at least one named entity (recognized by the Edinburgh NER tool) were considered. This MS Live Search query log is not available for general use. TODO: HARRY: double check this
- We have selected 50 queries from a Yahoo query log sample. Only queries that are asked by >= 3 users are included. The data set contains 4496 queries originally, from which we selected 50 queries that were classified as entity queries. This classification was done manually, using the methodology described in (1).
Click here to download the final query set of 92 queries.
[edit] Result format and Submission
We asked for submissions using the TREC qrels format. We assigned query ids for each query starting with q1 and ending with q92. We collected the submissions using EasyChair as attachments to the system descriptions.
Click here for a description of the qrels format
Some participants submitted URIs that were not used as subjects in the original data. These were counted as irrelevant results.
[edit] Pooling
TODO: Daniel
The output format is (qid, uri) pairs in a tab-separated file.
[edit] Merging results with collection
We used two different solutions for merging the pool with the data.
[edit] Option 1: collection building
Steps for single machine collection building:
- Store the the data on the cluster at /user/pmika/btc/data
- Group the data by subject using group-by-subject.sh. Note that this excludes documents > 10,000 triples. (Note to self: this could do the sorting already.) The results appear in /user/pmika/btc/bysubject.
- Get the list of subject URIs. The script metaurls.pig projects the data produced in the previous step and also sorts the URLs. (Note to self: currently uses the original collection, this needs to be changed, because it makes it necessary to resort the by-subject data according to the original order.)
- Fetch the URLs from the cluster using fetch-urls.sh
- Generate a minimal perfect hash, i.e. a mapping from URIs to docids using compute-mph.sh.
- Fetch the data from the cluster using fetchcollection.sh.
- Build the collection using build-collection.sh which uses the CollectionBuilder java class. This takes several days.
- Use the collection from the command-line or build a web service around it.
[edit] Option 2: merging on Hadoop
The input is qid-entity-pool-queries.txt which contains (qid, query, uri) tuples.
- As with the previous option, group the data by subject.
- Use query-results.pig to merge the pool with the collection.
The output is qid-entity-pool-docs.txt with (qid, query, uri, doc) tuples.
[edit] Assessment using Mechanical Turk
TODO: Henry
[edit] Reporting the results
[edit] Computing retrieval performance
TODO: Daniel... e.g which version and what settings of trec_eval, how the directory structure was organized etc.
[edit] Computing agreement
The Java class MTurkStatistics takes as input a file of the form
HITId \t SS1 \t SS2 \t SS3
where each SS is of the form
TurkerId s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12
It computes:
- Fleiss' kappa per HIT
- Distribution of judgments
- Distribution of hits across judges
(1) Jeffrey Pound, Peter Mika, Hugo Zaragoza. Ad-Hoc Object Retrieval in the Web of Data. WWW 2010. http://www.zaragozas.info/hugo/academic/pdf/pound_WWW2010.pdf