当前位置：搜档网 › emnlp-2008-richard

emnlp-2008-richard

Automatic Set Expansion for List Question Answering Richard C.Wang,Nico Schlaefer,William W.Cohen,and Eric Nyberg

Language Technologies Institute

Carnegie Mellon University

5000Forbes Avenue

Pittsburgh PA15213

{rcwang,nico,wcohen,ehn}@https://www.sodocs.net/doc/0c17961148.html,

Abstract

This paper explores the use of set expan-

sion(SE)to improve question answering(QA)

when the expected answer is a list of entities

belonging to a certain class.Given a small

set of seeds,SE algorithms mine textual re-

sources to produce an extended list including

additional members of the class represented

by the seeds.We explore the hypothesis that

a noise-resistant SE algorithm can be used to

extend candidate answers produced by a QA

system and generate a new list of answers that

is better than the original list produced by the

QA system.We further introduce a hybrid ap-

proach which combines the original answers

from the QA system with the output from the

SE algorithm.Experimental results for several

state-of-the-art QA systems show that the hy-

brid system performs better than the QA sys-

tems alone when tested on list question data

from past TREC evaluations.

1Introduction

Question answering(QA)systems are designed to retrieve precise answers to questions posed in nat-ural language.A list question expects a list as its answer,https://www.sodocs.net/doc/0c17961148.html, the coffee-producing countries in South America.The ability to answer list questions has been tested as part of the yearly TREC QA eval-uation(Dang et al.,2006;Dang et al.,2007).This paper focuses on the use of set expansion to improve list question answering.A set expansion(SE)algo-rithm receives as input a few members of a class or set,and mines various textual resources(e.g.web pages)to produce an extended list including addi-tional members of the class or set that are not in the input.A well-known online SE system is Google Sets1.This system is publicly accessible,but since it is a proprietary system that might be changed at any time,its results cannot be replicated reliably.We ex-plore the hypothesis that a SE algorithm,when care-fully designed to handle noisy inputs,can be applied to the output from a QA system to produce an overall list of answers for a given question that is better than the answers produced by the QA system itself.We propose to exploit large,redundant sources of struc-tured and/or semi-structured data and use linguistic analysis to seed a shallow analysis of these sources. This is a hard problem since the linguistic evidence used as seeds is noisy.More precisely,we combine the QA system Ephyra(Schlaefer et al.,2007)with the SE system SEAL(Wang and Cohen,2007)to create a hybrid approach that performs better than either system by itself when tested on data from the TREC13-15evaluations.In addition,we apply our SE algorithm to answers generated by the?ve QA systems that performed the best on the list questions in the TREC15evaluation and report improvements in F1scores for four of these systems.

Section2of the paper gives an overview of the QA and SE systems used for our experiments.Sec-tion3describes how the SE system was adapted to deal with noisy seeds produced by QA systems,and Section4presents the details of the experimental de-sign.Experimental results are discussed in Section 5,and the paper concludes in Section6with a dis-cussion of planned future work.

1https://www.sodocs.net/doc/0c17961148.html,/sets

2System Overview

2.1Ephyra Question Answering System Ephyra(Schlaefer et al.,2006;Schlaefer et al., 2007)is a QA system that has been evaluated in the TREC QA track(Dang et al.,2006;Dang et al., 2007).The system combines three answer extrac-tion techniques for factoid and list questions:(1)an answer type classi?cation approach;(2)a syntactic pattern learning and matching approach;and(3)a semantic extractor that uses a semantic role label-ing system.The answer type based extractor clas-si?es questions by their answer types and extracts candidates of the expected types.The Ephyra pat-tern matching approach learns textual patterns that relate question key terms to possible answers and applies these patterns to candidate sentences to ex-tract factoid answers.The semantic approach gener-ates a semantic representation of the question that is based on predicate-argument structures and extracts answer candidates from similar structures in the cor-pus.The source code of the answer extractors is in-cluded in OpenEphyra,an open source release of the system.2

The answer candidates from these extractors are combined and ranked by a statistical answer selec-tion framework(Ko et al.,2007),which estimates the probability of an answer based on a number of answer validation and similarity features.Valida-tion features use resources such as gazetteers and Wikipedia to verify an answer,whereas similarity features measure the syntactic and semantic simi-larity to other candidates,https://www.sodocs.net/doc/0c17961148.html,ing string distance measures and WordNet relations.

2.2Set Expander for Any Language(SEAL) Set expansion(SE)refers to expanding a given par-tial set of objects into a more complete set.SEAL3 (Wang and Cohen,2007)is a SE system which ac-cepts input elements(seeds)of some target set S t and automatically?nds other probable elements of S t in semi-structured documents such as web pages. SEAL also works on unstructured text,but its ex-traction mechanism bene?ts from structuring ele-ments such as HTML tags.The algorithm is in-dependent of the human language from which the 2https://www.sodocs.net/doc/0c17961148.html,/

https://www.sodocs.net/doc/0c17961148.html,/seal Figure1:Examples of SEAL’s input and output.English entities are reality TV shows,Chinese entities are popular Taiwanese food,and Japanese entities are famous cartoon

characters.

Figure2:An example graph constructed by SEAL.Every edge from node x to y actually has an inverse relation edge from node y to x that is not shown here(e.g.m1is extracted by w1).

seeds are taken,and also independent of the markup language used to annotate the documents.Examples of SEAL’s input and output are shown in Figure1. In more detail,SEAL comprises three major com-ponents:the Fetcher,the Extractor,and the Ranker. The Fetcher focuses on retrieving web pages.The URLs of the web pages come from top results re-trieved from Google and Yahoo!using the concate-nation of all seeds as the query.The Extractor au-tomatically constructs page-speci?c extraction rules, or wrappers,for each page that contains the seeds. Every wrapper is de?ned by two character strings, which specify the left-context and right-context nec-essary for an entity to be extracted from a page. These strings are chosen to be maximally-long con-texts that bracket at least one occurrence of every seed string on a page.Most of the wrappers con-

tain HTML tags,which illustrates the importance of structuring information in the source documents.All entity mentions bracketed by these contextual strings derived from a particular page are extracted from the same page.Finally,the Ranker builds a graph,and then ranks the extracted mentions glob-ally based on the weights computed by performing a random graph walk.

An example graph is shown in Figure 2,where each node d i represents a document,w i a wrapper,and m i an extracted entity mention.The graph mod-els the relationship between documents,wrappers,and mentions.In order to measure the relative im-portance of each node within the graph,the Ranker performs a graph walk until all node weights con-verge.The idea is that nodes are weighted higher if they are connected to many other highly weighted nodes.

We apply this SE algorithm to answer candidates for list questions generated by Ephyra and other TREC QA systems to ?nd additional instances of correct answers that were not in the original candi-date set.

3Proposed Approach

SEAL was originally designed to handle only rele-vant input seeds.When provided with a mixture of relevant and irrelevant answers from a QA system,the performance would suffer.In this section,we propose three modi?cations to SEAL to improve its ability to handle noisy input seeds.3.1Aggressive Fetcher

For each expansion,SEAL’s fetcher concatenates all seeds and sends them as one query to the search engines.However,when the seeds are noisy,the documents fetched are constrained by the irrele-vant seeds,which decreases the chance of ?nding good documents.To overcome this problem,we de-signed an aggressive fetcher (AF)that increases the chance of composing queries containing only rele-vant seeds.It sends a two-seed query for every pos-sible pair of seeds to the search engines.If there are n input seeds,then the total number of queries sent would be n 2

.For example,suppose SEAL is given a set of noisy seeds:Boston ,Seattle and Carnegie-Mellon (assuming Carnegie-Mellon is ir-

relevant),then by using AF,one query will contain only relevant seeds (as shown in Table 1).The docu-ments are then collected and sent to SEAL’s extrac-tor for learning wrappers.

Queries

Quality -AF #1:Boston Seattle Carnegie-Mellon Low +AF

#1:Boston Seattle

High #2:Boston Carnegie-Mellon Low #3:Seattle Carnegie-Mellon

Low

Table 1:Example queries and their quality given the seeds Boston ,Seattle and Carnegie-Mellon ,where Carnegie-Mellon is assumed to be irrelevant.

3.2Lenient Extractor

SEAL’s extractor requires the longest common contexts to bracket at least one instance of every seed per web page.However,when seeds are noisy,such common contexts usually do not exist or are too short to be useful.To solve this problem,we propose a lenient extractor (LE)which only requires the contexts to bracket at least one in-stance of a minimum of two seeds,instead of every seed.This increases the chance of ?nding longest common contexts that bracket only relevant seeds.For instance,suppose SEAL is given the seeds from the previous example (Boston ,Seattle and Carnegie-Mellon )and the passage below.Then the extractor would learn the wrappers shown in Table 2.“While attending a hearing in Boston City Hall,Alan,a professor at Boston University,met Tina,his former student at Seattle Univer-sity,who is studying at Carnegie-Mellon University Art School and will be working in Seattle City Hall.”

Learned Wrappers

-LE #1:at [...]University +LE

#1:at [...]University #2:in [...]City Hall

Table 2:Wrappers learned by SEAL’s extractor when given the passage in Section 3.2and the seeds Boston ,Seattle and Carnegie-Mellon .

As illustrated,with lenient extraction,SEAL is now able to learn the second wrapper because it brackets one instance of at least two seeds(Boston and Seattle).This can be very helpful if the list question is asking for city names rather than univer-sity names.The extractor then uses these wrappers to extract additional answer candidates,by search-ing for other strings that?t into the placeholders of the wrappers.Note that the example was simpli?ed for ease of presentation.The wrappers are actually character-based(as opposed to word-based)and are likely to contain HTML tags when generated from real web pages.

3.3Hinted Expander

Most QA systems use keywords from the question to guide the retrieval of relevant documents and the ex-traction of answer candidates.We believe these key-words are also important for SEAL to identify ad-ditional instances of correct answers.For example, if the seeds are George Washington,John Adams, and Thomas Jefferson,then without using any con-text from the question,SEAL would output a mix-ture of founding fathers and presidents of the U.S.A. To solve this problem,we devised a hinted expan-sion(HE)technique that utilizes the context given in the question to constrain SEAL’s search space on the Web.This is achieved by appending keywords from the question to every query that is sent to the search engines.The rationale is that the retrieved documents will also match the keywords,which may increase the chance of?nding those documents that contain our desired set of answers.

4Experimental Design

We conducted experiments in two phases.In the ?rst phase,we evaluated the SE approach by apply-ing SEAL to answers generated by Ephyra.In the second phase,we evaluated the approach by apply-ing SEAL to the output from QA systems that per-formed the best on the list questions in the TREC15 evaluation.In both phases,the answers found by SEAL were retrieved from the Web instead of the AQUAINT newswire corpus used in the TREC eval-uations.However,we rejected answers if they could only be found in the Web and not in the AQUAINT corpus to avoid an unfair advantage over the QA systems:TREC participants were allowed to extract candidates from the Web(or any other source),but they had to identify a supporting document in the AQUAINT corpus for each answer and thus could not return answers that were not covered by the cor-pus.

Preliminary experiments showed that we can ob-tain a good balance between the amount and quality of the documents fetched by using only rare ques-tion terms as hint words.In particular,we select the three question words that occur least frequently in a sample of the AQUAINT corpus as hints.The can-didate answers were evaluated by using the answer keys,composed of regular expression patterns,ob-tained from the TREC website.We did not extend the patterns with additional correct answers found in our experiments.These answer keys were not of?-cially used in the TREC evaluation;thus the baseline scores we computed for Ephyra and other QA sys-tems in our experiments are slightly different from those of?cially reported.

4.1Ephyra

We evaluated our SE approach on Ephyra using the list questions from TREC13,14,and15(55,93,and 89questions,respectively).For each question,the top four answer candidates from Ephyra were given as input seeds to SEAL.Initial experiments showed that by adding additional seeds,the effectiveness of our approach can be improved at the expense of a longer runtime.

We report both mean average precision(MAP) and F1scores.For the F1scores,we drop answer candidates with low con?dence scores by applying a relative cut-off threshold:an answer candidate is dropped if the ratio of its con?dence score and the score of the top answer is below a threshold.An optimal threshold for a question is a threshold that maximizes the F1score for that particular question. For each TREC dataset,we conducted three ex-periments:(1)evaluation of answer candidates us-ing MAP;(2)evaluation using average F1with an optimal threshold for each question;and(3)eval-uation using average F1with thresholds trained by 5-fold cross validation.For each of those5-fold val-idations,only one threshold was determined for all questions in the training folds.

Ephyra Ephyra’s SEAL SEAL+LE SEAL+LE SEAL+LE

Top4Ans.+AF+AF+HE TREC1325.95%21.39%23.76%31.43%34.22%35.26%

TREC1414.45%8.71%14.47%17.04%16.58%18.82%

TREC1513.42%9.02%13.17%16.87%17.12%18.95%

Table3:Mean average precision of Ephyra,its top four answers,and various SEAL con?gurations,where LE is Lenient Extractor,AF is Aggressive Fetcher,and HE is Hinted Expander.

Ephyra Ephyra’s SEAL SEAL+LE SEAL+LE SEAL+LE

Top4Ans.+AF+AF+HE TREC1335.74%26.29%30.53%36.47%40.08%40.80%

TREC1422.83%14.05%20.62%22.81%22.66%24.88%

TREC1522.42%14.57%19.88%23.30%24.04%25.65%

Table4:Average F1of Ephyra,its top four answers,and various SEAL con?gurations when using an optimal threshold for each question.

4.2Top QA Systems

We evaluated two SE approaches,SEAL and Google Sets,on the?ve QA systems that performed the best on the list questions in TREC15.For each question, the top four answer candidates4from those systems were given as input seeds to SEAL and Google Sets. Unlike the candidates found by Ephyra,these can-didates were provided without con?dence scores; hence,we assumed they all have a score of1.0.In our experiments with SEAL,we?rst determined a single threshold that optimizes the average of the F1 scores of the top?ve systems in both TREC13and 14.We then obtained evaluation results for the top systems in TREC15by using this trained threshold. When performing hinted expansion,the keywords (or hint words)for each question were extracted by Ephyra’s question analysis component.In our exper-iments with Google Sets,we requested Small Sets of items and again measured the performance in terms of F1scores.We also tried requesting Large Sets but the results were worse.

5Results and Discussion

In Tables3and4,we present evaluation results for all answers from Ephyra,only the top four answers, and various con?gurations of SEAL using the top four answers as seeds.Table3shows the MAP for 4Obtained from https://www.sodocs.net/doc/0c17961148.html,/results each dataset(TREC13,14,and15),and Table4 shows for each dataset the average F1score when using optimal per-question thresholds.The results indicate that SEAL achieves the best performance when con?gured with all three proposed extensions. In terms of MAP,the best-con?gured SEAL im-proves the quality of the input answers(relatively) by65%,116%,110%for each dataset respectively, and improves Ephyra’s overall performance by36%, 30%,41%.In terms of optimal F1,SEAL improves the quality of the input answers by55%,77%,76% and Ephyra’s overall performance by14%,9%,14% respectively.These results illustrate that a SE sys-tem is capable of improving a QA system’s perfor-mance on list questions,if we know how to select good thresholds.

In practice,the thresholds are unknown and must be estimated from a training set.Table5shows eval-uation results using5-fold cross validation for each dataset(TREC13,14,and15)independently,and the combination of all three datasets(All).For each validation,we determine the threshold that maxi-mizes the F1score on the training folds,and we also determine the F1score on the test fold by ap-plying the trained threshold.We repeat this valida-tion for each of the?ve test folds and present the av-erage threshold and F1score for each con?guration and dataset.The F1scores give an estimate of the performance on unseen data and allow a fair com-

Ephyra SEAL+LE+AF+HE Hybrid

Avg.F1Avg.Threshold Avg.F1Avg.Threshold Avg.F1Avg.Threshold TREC1325.55%0.380830.71%0.325729.04%0.0796

TREC1415.78%0.263615.60%0.188917.13%0.0108

TREC1515.19%0.119215.64%0.258116.47%0.0123 All18.03%0.288319.15%0.260619.59%0.0164

Table5:Average F1of Ephyra,the best-con?gured SEAL,and the hybrid system,along with thresholds trained by 5-fold cross validation.

TREC15Baseline Top4Ans.Google Sets SEAL+LE+AF+HE Hybrid

QA Systems Avg.F1Avg.F1Avg.F1?F1Avg.F1?F1Avg.F1?F1

lccPA0644.96%32.67%37.89%-15.72%40.00%-11.04%45.30%0.76% cuhkqaepisto18.27%17.02%15.96%-12.68%19.75%8.08%19.13% 4.70% NUSCHUAQA118.40%14.99%16.70%-9.21%18.74% 1.86%18.06%-1.81% FDUQAT15A19.71%14.32%18.79%-4.63%19.78%0.38%20.61% 4.57% QACTIS06C17.52%15.22%17.05%-2.72%18.45% 5.26%18.38% 4.85% Average23.77%18.84%21.28%-10.49%23.34%-1.81%24.30% 2.20% Table6:Average F1of the QA systems,their top four answers,Google Sets,the best-con?gured SEAL,the hybrid system,and their relative improvements over the QA systems.

parison across systems.Here,we also introduce a hybrid system(Hybrid)that intersects the answers found by both systems by multiplying their proba-bilistic scores.

Tables3,4,and5show that the effectiveness of the SE approach depends on the quality of the initial answer candidates.The improvements are most ap-parent for the TREC13dataset,where Ephyra has a much higher performance compared to TREC14 and15.However,the best-con?gured SEAL did not improve the F1score on TREC14,as reported in Table5.We suspect that this is due to the compar-atively low quality of Ephyra’s top four answers for this dataset.The experiments also illustrate that by intersecting the answer candidates found by Ephyra and SEAL,we can eliminate poor answer candi-dates and partially compensate for the low preci-sion of Ephyra on the harder TREC datasets.How-ever,this comes at the expense of a lower recall, which slightly hurts the performance on the compar-atively easier TREC13questions.We also evaluated Google Sets on top four answers from Ephyra for TREC13-15and obtained F1scores of12%,11%, and9%respectively(compared to29%,17%,and 16%for our hybrid approach with trained thresh-olds).

Table6shows F1scores for the SE approach applied to the output from the?ve QA systems with the highest performance on the list questions in TREC15.Again,Hybrid intersects the answers found by the QA system and SEAL by multiplying their con?dence scores.Two thresholds were trained separately on the top?ve systems in both TREC13 and14;one for SEAL(0.2376)and another for Hy-brid(0.2463).As shown,the performance of Google Sets is worse than SEAL and Hybrid,but better than the top four answers on average.We believe our SE system outperforms Google Sets because we have methods to handle noisy inputs(i.e.AF,LE)and a method for guiding the SE algorithm to search in the right space on the Web(i.e.HE).

The results show that both SEAL and Hybrid are capable of improving four out of the?ve systems. We observed that one reason why SEAL did not im-prove“lccPA06”was the incompleteness of the an-swer keys.Table7shows one of many examples where SEAL was penalized for?nding additional correct answers.As illustrated,Hybrid improved all systems except“NUSCHUAQA1”.The reason is that even though SEAL improved the baseline, their overlapping answer set is too small;thus hurt-ing the recall of Hybrid substantially.Unfortunately,

Question154.6:Name titles of movies,other than“Superman”movies,that

Christopher Reeve acted in.

lccPA06(F1:75%)SEAL+LE+AF+HE(F1:40%)

+Rear Window+Rear Window

+The Remains of the Day+The Remains of the Day

+Snakes and Ladders-The Bostonians

-Superman-Somewhere in Time

-Village of the Damned

-In the Gloaming

Table7:Example of SEAL being penalized for?nding correct answers(all are correct except the last one).Answers found in the answer keys are marked with“+”.All four answers from“lccPA06”were used as seeds.

Question170.6:What are the titles of songs written by John Prine?

NUSCHUAQA1(F1:25%)SEAL+LE+AF+HE(F1:44%)

+I Just Want to Dance With You+I Just Want to Dance With You

-Titled In Spite of Ourselves+Christmas in Prison

+Christmas in Prison+Sam Stone

-Grammy-Winning-Grandpa was a Carpenter

-Sabu Visits the Twin Cities Alone

+Angel from Montgomery

Table8:Example demonstrating SEAL’s ability to handle noisy input seeds.All four answers from“NUSCHUAQA1”were used as seeds.Again,SEAL is penalized for?nding correct answers(all answers are correct).

for the top TREC15systems we only had access to the answers that were actually submitted by the par-ticipants,whereas for Ephyra we could utilize the entire list of generated answer candidates,includ-ing those that fell below the cutoff threshold for list questions.Nevertheless,the hybrid approach could improve the baseline by more than2%on average in terms of F1score.Table8shows that the best-con?gured SEAL is capable of expanding only the relevant seeds even when given a set of noisy seeds. Neither Google Sets nor the original SE algorithm without the proposed extensions could expand these seeds with additional candidates.

For each of sampled list questions,SEAL requires on average about5seconds for querying the search engines,10seconds for crawling the Web,20sec-onds for extracting answer candidates from the web pages,and5seconds for ranking the candidates. Note that the SE system has not been optimized ex-tensively.The runtime of the web page retrieval step and much of the search is due to network latency and can be reduced if the search is performed locally.6Conclusion and Future Work

We have shown that our SE approach is capable of improving the performance of QA systems on list questions by utilizing only their top four answer can-didates as seeds.We have also illustrated a feasible and effective method for integrating a SE approach into any QA system.We would like to emphasize that for each of the experiments we conducted,all that the SE system received as input were the top four noisy answers from a QA system and three key-words from the TREC questions.We have shown that higher quality candidates support more effec-tive set expansion.In the future,we will investigate how to utilize more answer candidates from the QA system and determine the minimal quality of those candidates required for SE approach to make an im-provement.

We have also shown that,in terms of F1scores with trained thresholds,the hybrid method improves the Ephyra QA system on all datasets and also im-proves four out of the?ve systems that performed

the best on the list questions in TREC15.How-ever,the?nal list of answers only comprises candi-dates found by both the QA system and the SE al-gorithm.In future experiments,we will investigate other methods of merging answer candidates,such as taking the union of answers from both systems. We expect further improvements from adding can-didates that are found only by the QA system,but it is unclear how the con?dence measures from the two systems can be combined effectively.

We would also like to emphasize that the SE ap-proach is entirely language independent,and thus can be readily applied to answer candidates in other languages.In future experiments,we will investi-gate its performance on question answering tasks in languages such as Chinese and Japanese.

As pointed out previously,the performance of the SE approach highly depends on the accuracy of the seeds.However,QA systems are usually not op-timized to provide few high-precision results,but treat precision and recall as equally important.This leaves room for further improvements,e.g.by ap-plying stricter answer validation techniques to the seeds used for SE.

We also plan to analyze the effectiveness of our approach across different question types and evalu-ate it on more complex questions such as the rigid list questions in the new TAC QA evaluation,which ask for opinion holders and subjects. Acknowledgements

This work was supported in part by the Google Re-search Awards program,IBM Open Collaboration Agreement#W0652159,and the Defense Advanced Research Projects Agency(DARPA)under Contract No.NBCHD030010.

References

H.T.Dang,J.Lin,and D.Kelly.2006.Overview of the

TREC2006question answering track.Proceedings of the Fifteenth Text REtrieval Conference.

H.T.Dang,D.Kelly,and J.Lin.2007.Overview of the

TREC2007question answering track.Proceedings of the Sixteenth Text REtrieval Conference.

J.Ko,L.Si,and E.Nyberg.2007.A probabilistic frame-work for answer selection in question answering.Pro-ceedings of NAACL-HLT.N.Schlaefer,P.Gieselmann,and G.Sautter.2006.The Ephyra QA system at TREC2006.Proceedings of the Fifteenth Text REtrieval Conference.

N.Schlaefer,G.Sautter,J.Ko,J.Betteridge,M.Pathak, and E.Nyberg.2007.Semantic extensions of the Ephyra QA system in TREC2007.To appear in:Pro-ceedings of the Sixteenth Text REtrieval Conference. R.C.Wang and https://www.sodocs.net/doc/0c17961148.html,nguage-independent set expansion of named entities using the web.Proceedings of IEEE International Conference on Data Mining.

emnlp-2008-richard

相关文档

最新文档