SemSearch2010
Contents |
[edit] TODO list
- Thanh to send around an updated Call for Papers for review by Haofen, Marko, Peter (deadline December 24)
- Peter to see if we can get Y queries for evaluation (deadline Dec 24)
- Peter to see if we can get an updated RDFa/microformats dataset (deadline Dec 24)
- Look into Openlink LOD dataset on Amazon (Peter, Thanh)
- Come up with a final description of the task (Retrieval task only for now)
- Define API for search engines to implement (query -> ranked list of resources)
- Convert bnodes to URIs in all datasets
- Write Mechanical Turk application: given a query and results, generate tasks that asks users to rate the relevance of the result to the query
- Write application to fetch results from MTurk and compute metrics (probably this should also have a simple API to easily plug-in new metrics)
[edit] Evaluation setup
There are a couple of options for all parts of the configuration ;)
[edit] Data
- DBP: Dbpedia only
- BTC2009: BTC 2009 dataset: crawl data from Semantic Web search engines (not guaranteed to be complete with respect to any particular LOD dataset)
- LOD: Selected LOD datasets
- RDFA: Microformat/RDFa dump available as part of Yahoo's WebScope program: this dataset was originally part of the BTC 2008 dataset, but we can get an updated version. Requires users to sign a license.
[edit] Queries
- AOL: The AOL query log
- YALL: Small (e.g. 500 query) sample of the Yahoo query log, licensed through WebScope
- YENT: Small (e.g. 500 query) sample of the Yahoo query log, containing only entity queries, licensed through WebScope
[edit] Query processing
- KEYW: Keyword queries only
- ANN: Annotated queries, e.g. entity name + context
- SPARQL: SPARQL queries
[edit] Evaluation
- MTURK: Amazon Mechanical Turk. Assuming 10 participants, 100 queries, 10 results per query, 5 judgments per result, we have at most 50000 judgments to make. Assume we can have them evaluated at $0.05 per judgement, we need $2500 for evaluation. Gross overestimate, we might not need duplicate judgments per result.
- COMM: We evaluate it ourselves.
[edit] Relevance Judgment
- 4P: 4 point scale as in Pound et al.: Not Relevant, Somewhat Relevant, Relevant, Perfect Match
- 3P: 3 point variant: Not Relevant, Relevant, Perfect Match
- AB: Side-by-side judgments: do you prefer result A over result B?
Conclusion:
- 3 point scale: Not Relevant, Relevant, Perfect Match
[edit] Metrics
- PR: Precision, Recall, F-Measure, MAP...
- SR: Structural Relevance based measures
- XX: Some ranking from pairwise preference judgments
Calculations: We used trec_eval to compute the metrics:
trec_eval -q -c -l2 assessment_file submitted-searchresult-file
For ndcg:
trec_eval -q -c -m ndcg.0=0,1=0,2=1,3=3 assessment_file submitted-searchresult-file
Problems:
- Some results contained duplicate results -> the second occurrence of a result was marked with "-duplicate" and taken as not-relevant.
- Some results were URL-encoded (& instead of & etc.), needed to be resolved beforehand.
[edit] Proposal
Two tasks:
[edit] Retrieval task
This task approximates the task of a semantic search engine ranking objects in response to keyword queries.
Queries are keyword queries with a known entity focus. The task is to produce lists of resources of at most 10 resources ordered by relevance (thus a result has to be a resource in the corpus). A perfect match is a resource that gives a description of the entity mentioned in the query. A relevant result is a resource that is related to the entity, or at least mentions the entity in a text field.
Suggestion: RDFA + YENT + KEYW + MTURK + 3P + PR
[edit] Data integration and summarization task
This task approximates the task of a search engine that needs to create an infobox dynamically as a response to a keyword query.
Queries are keyword queries with a known entity focus. The task is to produce a single integrated result, thus there is a single result combining information from one or more resources in the dataset. A result is an ordered list of at most 10 key/value pairs. The key is the name is property that is an existing property of one of the resources combined. The value is either a literal of at most 100 characters or a link to a website (URL).
Suggestion: RDFA + YENT + KEYW+ MTURK + AB + XX
[edit] Mechanical Turk Notes
Instructions on current XProc pipeline, still being debugged to match current Amazon APIs.
[edit] Questions we are seeking answers to
- What's a realistic fee?:: We can vary the fee over the course of an experiment and see what impact that has.
- What kind of turnaround/throughput can we expect?:: I think there are a lot of somewhat frustrated Turkers out there who are looking for work, but there's no guarantee that's correct
- Can we get the quality we need?:: Is getting three independent results per query enough? Too much?
- How bad are existing results anyway?:: That is, what percentage of hits get thrown away?
- Is ambiguity a problem?:: Does the context give enough information to the Turkers so they agree on the meaning of the query?
[edit] Experiment design
Independent variables for an experiment on the query+context task:
# Number of queries (e.g. 20000) # Number of results we check (e.g. 20) # Redundancy of checking (e.g. x3) # Fee we pay per task (e.g. 0.10 USD)
Dependent variables:
# Total cost (e.g. 7800 GBP)
[edit] Pre-pilot
100 queries, 30 results checked, 3 checkers per query, 10 results per task, 0.10 USD/task, 99USD/58.24GBP cost
What are the material and infrastructure prerequisites to making this happen?
- Hand-selected 100 queries -- from a domain, or just culled from the top 200 list, or . . .
- HST doesn't think the "top 100" is going to work -- the list he got from wordtracker is heavily weighted to pop stars and rock groups, and the top 30 hits on them from Yahoo are mostly pretty good. . .
- Already have a utility pipeline which takes an arbitrary AMT operation, xincludes and ships it and gets result back
- Pipeline to go from query to Yahoo to transformation into AMT !QuestionForm document
- This is a pain, because the QF allows only very minimal formatting, and no linking or frames, so this pipeline will have to retrieve and downsample each hit from HTML to QF markup. This may blow the whole deal, so HST is going to have a go at it right away.
- HST now (next day) thinks this really won't work. Amazon have replied that they will allow links in the (near?) future, so HST proposes to go ahead with a textual URL which has to be copied and pasted by the Turkers in order to perform the task.
- Still two alternatives:
1. Use the Amazon form for the questions and answers, link is to the cached/cleaned resource;
1. Cheat, link is to a full-fledged HTML Form, which ends by giving the Turker a cookie for filling in to the Amazon form
- In any case we must(well, almost certainly) cache the result resources, as in "Save Web Page, complete", which involves at least
* IMG/@src, IMG/@usemap, IMG/@longdesc
* OBJECT/@data (only if OBJECT/@type is image/... ?), OJBECT/@usemap
* FRAME/@src, FRAME/@longdesc, IFRAME/@src, IFRAME/@longdesc
* LINK/@href if @type is text/css
* INPUT/@src, INPUT/@usemap
* SCRIPT/@src?
* BODY/@background
* will miss background images in stylesheets. . .
1. Got to work harder:
* wget --exclude-domains=doubleclick.net,googlesyndication.com -nd -P twg -E -H -k -p [URI]
* tidy -m -f errs -q -asxml, check exit status (0, 1 are OK, 2 is a problem)
* Remove all SCRIPT elements
* Trash all url(...) in css content or files
1. This will be non-trivial if the resource is broken vanilla HTML and not tidiable (although sx with large error count may work. . .)
1. Maybe a better alternative is converting to a dead image format. Image Wizard (http://www.popularshareware.com/HTML-To-Image-Wizard_download_27786.html) seems to work well -- I've produced PNGs for a few complex cases and they look pretty good. . .
1. Using this on Windoz/Cygwin, as it uses IE
1. Other possibilities: PDFCreator, Print to file, Pearl Crescent Page Saver (http://pearlcrescent.com/products/pagesaver/)
1. Using pagesaver on Linux, by hacking around its not having a filename argument, using {{{showpage}}} script which watches for the appearance of a file in the output directory and renames it. Have emailed them offering to share code and do the fix ourselves.
1. Problem here: Both of these are too slow -- 2:42 on Windoz and 3:30 on Linux to do 20 pages
1. What the page looks like. What about three panes:
1. Instructions, never changes -- or as-it-were result page, with [] skip this boxes and a [go] button?
1. Down-sampled version of resource itself
1. Always shown, or on click?
1. Link in summary to this, or to real resource?
1. Two queries for qualification test -- say 30 results to check, different kinds of relevance distinctions
1. Pbly should pilot the qualification test itself, looking at results by hand, to see if auto-qualifying will work or not. . .
1. Can do the qualification pilot and qualification itself by hand, just using ss.xpdl
- Shell script to run over file of queries, run QF pipeline, create HITs for each one, log the HITId
- Pipeline to run over reviewable queries
- retrieve answers
- save response time
- compare them
- accept where there's unanimity
- flag others for human review
- May need automation of review and payoff
- Only put up 20 the first time around, review design after we get answers
- Possibly vary fee at this point
- Tabulate/graph the response times
[edit] Indirect experiment design
So, the outline of what we need for hosting our own forms is
- Python fast-CGI (use moinmoin as a model) script with taskid as parameter
Returns HTML form with taskid and count as hidden parameters Hidden parameters are done with display:none style element.
- Do we handle this with a switch visible in the pipelines, or separate states/pipelines etc.?
- For now, we'll go with separate states/pipelines