Sunday, August 24, 2008

Set Retrieval vs. Ranked Retrieval

After last week's post about a racially targeted web search engine, you'd think I'd avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I'll start by introducing the terms.
  • In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.

  • In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.
An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:
  • The number of documents reported to match my search should be meaningful--or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.

  • Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.
Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

27 comments:

Max L. Wilson said...

See this is interesting. One of the things that set retrieval would enable is ranking by novelty.

The idea, to be clear, is to put in the first page of results, the 10 that, together, cover the most breadth of a topic. I think it would be really interesting, as part of this, to see novelty oriented result information, so that the snippit of text you see covers what that result covers that the other results do not. if your dataset has keywords etc too, then you could should the keywords that it has that the other results do not. perhaps.

ive not done anything specific in this area myself. but i've seen some guys from spain send out a few papers including some IP&M papers. In particular, i spoke to David Losisa's student about the idea at SIGIR07.

i suspect it might be more applicable to vertical search. i guess it has most applicability to domain learning tasks, as prioritizing novelty might exclude your answer that was in result 2 from the relevance ranking.

Daniel Tunkelang said...

Max, that's a good point, but I feel you're understating it. Since relevance ranking is a fundamentally a sort order for the results, it precludes any explicit choice of sort order. It only makes sense to sort a set. I'll dive deeper into this topic in an upcoming post.

Michael Bendersky said...

Hi Daniel,
Great post!

I wanted to comment on a connection between the set and the ranked retrieval.

I think people often assume the set retrieval model implicitly, without actually stating it. Two immediate examples I can think of are pseudo-relevance feedback and query-clarity. In both models, top-k are just assumed to be a relevant set.

I believe that stating the set-retrieval model explicitly might be helpful in understanding both of these models, and also other retrieval/re-ranking models that utilize a sub-set of the retrieved results.

Anonymous said...

Daniel -

I'm really beginning to enjoy your blog. I discovered it a few weeks ago, and have slowly been digesting your older posts.

On the subject of relevance ranking helping distinguish more relevant results from less relevant results.. I have a vague recollection of some of the early web search engines, circa 1994 or 1995 (Infoseek?) actually displaying their normalized [0..100] relevance scores to the left of every search result. That way you could tell whether the top 10 links were all in the 99 to 92 range, or whether the scores started in the 90s, but then quickly tapered off after the 4th or 5th result to the 70s or 40s or whatever.

It is a very simple, very old idea. But no one does it anymore. Why is that? It's very frustrating. It would be much more useful to get a sense of the relevance-ranking score distribution. It might even be nice to show a binned histogram of the top 1000 retrieved documents, so that you could see not only the range but the distribution of the scores in a summarized, exploratory manner.

David Fauth said...

Jeremy,

I seem to remember the relevance ranking and tools like Verity would allow you to display it. However, as Daniel says, this is already some sort of a ranking/ordering system implying a sort order.

Yahoo has BOSS which allows developers access to their search engine. What I don't know is it is retrieving only a set or if there is implied ranking with it.

I'm looking forward to the rest of this series.

Daniel Tunkelang said...

Mike, it warms my heart to see you make those particular connections.

Query clarity was actually one of the ideas that made me focus so intently on set retrieval. My colleagues and I at Endeca have actually taken a query-less approach that evaluates a clarity-like measure for document sets, regardless of how they are obtained. We've implemented lots of applications on top of this measure.

As for pseudo-relevance feedback, I did see a poster at ECIR '08 suggesting ways to optimize the number of documents used for feedback. I can't find it online, but the title was "Robust Query-specific Pseudo Feedback Document Selection for Query Expansion."

Daniel Tunkelang said...

Jeremy, thanks! As for displaying relevance scores, I think the problem is that most ranking approaches don't actually compute a probability of relevance--all they compute is some score you can use for ranking. That makes it impossible to band results or offer a meaningful histogram.

That said, I'm not sure how well calibrate the probabilities were for engines that did offer them.

Daniel Tunkelang said...

David, pretty sure that Yahoo doesn't let you change the ranking. I've never seen any web search API let you do that. But if I'm wrong, please let me know!

Anonymous said...

Hi Daniel,

With Yahoo! BOSS you can re-order the results returned by the API. It is one of the highlighted features at:

http://developer.yahoo.com/search/boss/

Daniel Tunkelang said...

Sérgio, do you mean that you can retrieve the top 50 according to Yahoo's relevance ranking strategy and re-order those, or that you can change the sort order used to determine what are the top 50? My sense is that you can only do the former, but maybe I'm missing something in the documentation.

Anonymous said...

Yes, you are right. Only the former seems to be possible. I misinterpreted the previous comments on this thread.

David Fauth said...

I believe that Yahoo still returns what they believe is the right "set". You only get to reorder the results not change how the set is obtained.

Anonymous said...

@David F.: I seem to remember the relevance ranking and tools like Verity would allow you to display it. However, as Daniel says, this is already some sort of a ranking/ordering system implying a sort order.

Yes, I know it does imply some sort of order, but I am basically reacting to this statement from Daniel, that says if you're going to order it anyway, you should give the user a way of distinguishing more-relevant from less-relevant:

Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.

And one (simple, old) way of doing that is simply to show the final overall score for each document. That way you can see if the step from result 3 to result 4 is a score change from 0.947 to 0.921, or a change from 0.947 to 0.223. Being able to see that precipitous drop-off when going from rank 3 to rank 4, would let you see exactly where "more relevant" starts to become "less relevant".

And regarding the Yahoo BOSS API; realize that there is more than one API. There is the standard BOSS API, but there is also a "BOSS Academic" API and a "BOSS Custom" API. I've talked with a few of the guys at Yahoo, and I have the impression that some of the other two versions of the API (Academic and Custom) do allow you to change how the set is actually produced, rather than only reorder results once you've got them back.

Daniel Tunkelang said...

Well, if the scoring has meaning beyond rank order, I'm all for exposing it. But I think it's rare for a modern information retrieval model to deliver more than a rank order of results. Certainly I don't see the TREC evaluations taking any such scoring into account, which would be straightforward if, say, relevance scores were returned as probabilities.

Re Yahoo: I'd be delighted if they offered the ability to change the overall sort. I'm very skeptical, since allowing that degree of control could have significant implications for the way they'd have to optimize their data structures. I know a little about this stuff from my work at Endeca. :-)

Anonymous said...

But I think it's rare for a modern information retrieval model to deliver more than a rank order of results.

Rare to deliver to the consumer? Or rare to have computed, internally?

If they're not computing some sort of numerical score, then how are they doing the ranking? They're certainly not classifying each result into categories "apple", "banana" and "pear" -- and as everybody knows pears come before bananas. Right? So they must have some sort of number? So even if its rare to expose that number to the end user, don't they still have that number, somewhere, in the system?

Certainly I don't see the TREC evaluations taking any such scoring into account, which would be straightforward if, say, relevance scores were returned as probabilities.

Well, probabilities would be nice. But they don't have to be probabilities. They just have to be magnitude-differentiated ordinals of some sort. Suppose for example the search engine were some sort of metasearch, doing Borda count ranked list fusion, i.e. the sum of "k - rank(d, E)", where k is a constant (say, 1000), and rank(d, E) is the rank of document (or web page) d returned by search engine E. Take 5 different search engines, figure out at what rank from each engine a document is located, and do the summation across all the engines. (If a document is not returned by an engine, then put its rank = k, so that the score is 0.) Sort or rank by the sum total.

At the end of that process, you'll have an integer-valued sum of integers. No probabilities, no real-valued numbers. But there will still be a gap, an integer difference, between the rank#1 doc and the rank#2 doc, between rank#7 and rank#8, etc.

And so you'll still be able to see when a borda-count fusion score goes from 4,235 to 4,227, versus when a score goes from 4,227 to 2,506.

So this sort of information has to exist, in the engine, in some form, right? Some sort of magnitude-differentiated ordinal? Or are modern search engines sorting without any kind of magnitude? Is that even possible? Even when I'm summing ranks, and each individual sublist starts out with no difference in magnitude from one ranked result to the next, the sum starts to develop a magnitude difference. Is it really that different when combining multi-word queries?

Re Yahoo: I'd be delighted if they offered the ability to change the overall sort. I'm very skeptical, since allowing that degree of control could have significant implications for the way they'd have to optimize their data structures. I know a little about this stuff from my work at Endeca. :-)

From talking with the Yahoo guys, in their BOSS Custom offering, they indeed will go out and do some customized index stuff for you. I'm just passing along what they say.

Daniel Tunkelang said...

Jeremy, I meant rare for the algorithms to deliver more than rank ordering.

Here's an excerpt from a paper by Fernando Diaz:

A ranked retrieval model assigns some rank or score to each document in the collection and ranks documents according to the score. The user then scans the documents according to the ranking. The score function for a ranked retrieval model maps documents to real values. Given a query, the model provides a function from documents to scores. The values of the function provide the desired ranking of the documents.

Fernando goes on to examine the behavior of score functions for ranked retrieval models. But I want to emphasize his point that ranked retrieval models only use scores to obtain ranking. They don't promise anything about the scores themselves--e.g., whether the magnitude of difference in score has a meaningful interpretation. Indeed, Fernando's paper is an attempt to obtain useful scores from any scoring approach through regularization.

Re Yahoo: I looked at the BOSS Custom page. It sounds interesting, but delightfully vague. I don't blame them; supporting this sort of customization is a serious endeavor.

Anonymous said...

Ok, I understand ya now. Rare for the search engines to deliver. And so I was just saying that they have the ability to deliver, if they so chose to. And it's just unpleasing that they choose not to.

But yes, I now understand your counterpoint also.. that even when there are magnitude differences between scores, those might not have any semantic interpretation. I'd been operating under the assumption that there would be something semantically interpretable if you could see the distribution or histogram of the scores returned by a query. Maybe not a strong one.. but definitely something that the human brain could make use of. But now I realize.. that's just an assumption.

Thanks for reminding me of Fernando's work. I'd read a shorter version of this regularization idea, but this one looks great.. lots more info.

But there has to be other, "HCIR" work out there, looking at this sort of thing? Because I do remember seeing some systems that did actually expose their (unregularized) scores, that did offer that transparency. But I wonder if anyone has done the HCI experiments to see how useful that information really is. You couldn't measure it though clickthroughs or something like that, right? Because it's not necessarily something that is designed to increase your clickthrough. So you'd have to measure it some other way.

Thoughts?

Anonymous said...

The regularization work that you cite always assumes that there is value in the actual distribution of baseline scores (Jeremy talks about this earlier). Regularization fails beautifully if you swap the baseline scores with an arbitrary monotonically decreasing function. And in fact, the black art of getting regularization to work for some baselines is in score pre-processing and normalization. I usually used zero-mean unit variance because it's simple to implement, theoretically appealing for regularization, empirically effective for metasearch, and worked. The regularized scores, then, are crazy things that never look like probabilities.

Anonymous said...

Set retrieval has been studied before here,

[1] W. Dai and R. Srihari. Minimal document set retrieval. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 752–759, New York, NY, USA, 2005. ACM Press.
[2] A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science, 37(1):3–11, 1986.
[3] M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 76–84, New York, NY, USA, 1996. ACM Press.
[4] N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.
[5] X. Liu and W. B. Croft. Experiments on retrieval of optimal clusters. Technical Report IR-478, University of Massachusetts Amherst, 2006.

And for a regularization way to do it, try Section 8.3.1 here.

Anonymous said...

Neither clarity nor PRF in their most effective senses use set-based feedback. The query model in clarity (and similarly in relevance modeling) is very closely tied to the baseline retrieval score. Specifically, as published, the query model is the linear combination of document vectors weighted by the relevance score. The interpolation weights for top-k can be thought of as being constant for the top-k and 0 for everything lower. When implemented according to theory, the interpolation weights drop off precipitously, providing a very nice adaptive top-k.

In practice, clarity and relevance models use top-k in addition to relevance weighting for efficiency reasons. It's theoretically justified because the scores in general drop off by the cutoff point.

This doesn't preclude using a top-k clarity of PRF without paying attention to weights in situations where weights are bogus or missing. But in my experience, if you have that information you should use it.

Daniel Tunkelang said...

Fernando, I'll have to cite your work more often!

Thanks for the clarification about the regularization work. But, if the raw scores are meaningful--which your approach counts on--why don't they get considered for user interfaces or evaluation purposes. I've had the strong impression that ranked retrieval approaches toss out the raw scores because they have no use for them. If that impression is off base, I'd love to correct it.

Re clarity, I'm not sure I follow the adaptive top-k explanation. I thought the cut-off was explicitly chosen (ranging from 20 to 500 documents in the examples in the original paper). Is your point that, because these contributions drop off rapidly, the score is fairly insensitive to the cutoff?

Daniel Tunkelang said...

Oh, and thanks for all the set retrieval references! The early ones are familiar, but I'll have to check out the CIKM '05 paper and the tech report, as well as your dissertation.

Anonymous said...

The value of presenting scores is a UI issue, so I'm not really qualified to answer. The readers of this blog would certainly know how to interpret scores and probabilities, but my suspicion is that "regular users" would need some training. Scores probably got dropped for presentation in an effort to reduce clutter. That said, behind the curtain, I can only comment that it would be a very good thing to look at the score.

With respect to scores, topk and clarity/PRF, I can point to this figure. This is a theoretical curve for a single query. The horizontal axis is the rank, the vertical axis is the weight in the clarity/PRF interpolation. The curve labeled y is the mass normalized score as used in clarity/PRF. The red line is the weight used by topk clarity/PRF. When the red line is consistent with the black line topk=relevance weighting. However, the decay of the weights depends on the query. Here are some example curves for TREC queries with QL baselines. Top 100 is definitely a reasonable pick but I would just use the scores directly

Anonymous said...

@Daniel: nice post, well nice posts, well nice blog :) I have to say that I still feel that ranked retrieval is an improvement over set retrieval, though.

@Fernando (and just to add some more... noise): which metric should one use to compare two different score functions? I'm asking because I did some work on that - the ADM metric.

Anonymous said...

some goof published this paper on comparing score functions,


F. Diaz. Performance prediction using spatial autocorrelation. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 583–590, New York, NY, USA, 2007. ACM Press.

discussions of relationship to clarity and other precision predictors contained too.

Anonymous said...

I think we can have our cake and eat it, too.

Suppose you have a probabilistic estimate of being in the relevant set. The expectation of this variable is the best estimate of the number of relevant docs. It can also be used to infer other statistics, such as term counts, site counts, etc.

You can then sample results according to their probabilities, which might be theoretically interesting, but seems like it'd be confusing to users in practice.

If all probs are 0/1, the probabilistic model reduces to set retrieval.

A threshold can be used to convert a probabilistic model to a 0/1 model. Varying the threshold trades precision for recall if the relevance estimates are at all reasonable. I always thought the TREC evals were taking this view, but with ranked results rather than scored ones.

Daniel Tunkelang said...

Agreed: a probabilistic retrieval model gives us the best of both worlds. But that's more powerful than a ranked retrieval model, and, as far as I can tell, TREC's predominant focus on ranked retrieval models doesn't help us build better (i.e., better calibrated) probabilistic models.

Sunday, August 24, 2008

Set Retrieval vs. Ranked Retrieval

After last week's post about a racially targeted web search engine, you'd think I'd avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I'll start by introducing the terms.
  • In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.

  • In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.
An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:
  • The number of documents reported to match my search should be meaningful--or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.

  • Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.
Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

27 comments:

Max L. Wilson said...

See this is interesting. One of the things that set retrieval would enable is ranking by novelty.

The idea, to be clear, is to put in the first page of results, the 10 that, together, cover the most breadth of a topic. I think it would be really interesting, as part of this, to see novelty oriented result information, so that the snippit of text you see covers what that result covers that the other results do not. if your dataset has keywords etc too, then you could should the keywords that it has that the other results do not. perhaps.

ive not done anything specific in this area myself. but i've seen some guys from spain send out a few papers including some IP&M papers. In particular, i spoke to David Losisa's student about the idea at SIGIR07.

i suspect it might be more applicable to vertical search. i guess it has most applicability to domain learning tasks, as prioritizing novelty might exclude your answer that was in result 2 from the relevance ranking.

Daniel Tunkelang said...

Max, that's a good point, but I feel you're understating it. Since relevance ranking is a fundamentally a sort order for the results, it precludes any explicit choice of sort order. It only makes sense to sort a set. I'll dive deeper into this topic in an upcoming post.

Michael Bendersky said...

Hi Daniel,
Great post!

I wanted to comment on a connection between the set and the ranked retrieval.

I think people often assume the set retrieval model implicitly, without actually stating it. Two immediate examples I can think of are pseudo-relevance feedback and query-clarity. In both models, top-k are just assumed to be a relevant set.

I believe that stating the set-retrieval model explicitly might be helpful in understanding both of these models, and also other retrieval/re-ranking models that utilize a sub-set of the retrieved results.

Anonymous said...

Daniel -

I'm really beginning to enjoy your blog. I discovered it a few weeks ago, and have slowly been digesting your older posts.

On the subject of relevance ranking helping distinguish more relevant results from less relevant results.. I have a vague recollection of some of the early web search engines, circa 1994 or 1995 (Infoseek?) actually displaying their normalized [0..100] relevance scores to the left of every search result. That way you could tell whether the top 10 links were all in the 99 to 92 range, or whether the scores started in the 90s, but then quickly tapered off after the 4th or 5th result to the 70s or 40s or whatever.

It is a very simple, very old idea. But no one does it anymore. Why is that? It's very frustrating. It would be much more useful to get a sense of the relevance-ranking score distribution. It might even be nice to show a binned histogram of the top 1000 retrieved documents, so that you could see not only the range but the distribution of the scores in a summarized, exploratory manner.

David Fauth said...

Jeremy,

I seem to remember the relevance ranking and tools like Verity would allow you to display it. However, as Daniel says, this is already some sort of a ranking/ordering system implying a sort order.

Yahoo has BOSS which allows developers access to their search engine. What I don't know is it is retrieving only a set or if there is implied ranking with it.

I'm looking forward to the rest of this series.

Daniel Tunkelang said...

Mike, it warms my heart to see you make those particular connections.

Query clarity was actually one of the ideas that made me focus so intently on set retrieval. My colleagues and I at Endeca have actually taken a query-less approach that evaluates a clarity-like measure for document sets, regardless of how they are obtained. We've implemented lots of applications on top of this measure.

As for pseudo-relevance feedback, I did see a poster at ECIR '08 suggesting ways to optimize the number of documents used for feedback. I can't find it online, but the title was "Robust Query-specific Pseudo Feedback Document Selection for Query Expansion."

Daniel Tunkelang said...

Jeremy, thanks! As for displaying relevance scores, I think the problem is that most ranking approaches don't actually compute a probability of relevance--all they compute is some score you can use for ranking. That makes it impossible to band results or offer a meaningful histogram.

That said, I'm not sure how well calibrate the probabilities were for engines that did offer them.

Daniel Tunkelang said...

David, pretty sure that Yahoo doesn't let you change the ranking. I've never seen any web search API let you do that. But if I'm wrong, please let me know!

Anonymous said...

Hi Daniel,

With Yahoo! BOSS you can re-order the results returned by the API. It is one of the highlighted features at:

http://developer.yahoo.com/search/boss/

Daniel Tunkelang said...

Sérgio, do you mean that you can retrieve the top 50 according to Yahoo's relevance ranking strategy and re-order those, or that you can change the sort order used to determine what are the top 50? My sense is that you can only do the former, but maybe I'm missing something in the documentation.

Anonymous said...

Yes, you are right. Only the former seems to be possible. I misinterpreted the previous comments on this thread.

David Fauth said...

I believe that Yahoo still returns what they believe is the right "set". You only get to reorder the results not change how the set is obtained.

Anonymous said...

@David F.: I seem to remember the relevance ranking and tools like Verity would allow you to display it. However, as Daniel says, this is already some sort of a ranking/ordering system implying a sort order.

Yes, I know it does imply some sort of order, but I am basically reacting to this statement from Daniel, that says if you're going to order it anyway, you should give the user a way of distinguishing more-relevant from less-relevant:

Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.

And one (simple, old) way of doing that is simply to show the final overall score for each document. That way you can see if the step from result 3 to result 4 is a score change from 0.947 to 0.921, or a change from 0.947 to 0.223. Being able to see that precipitous drop-off when going from rank 3 to rank 4, would let you see exactly where "more relevant" starts to become "less relevant".

And regarding the Yahoo BOSS API; realize that there is more than one API. There is the standard BOSS API, but there is also a "BOSS Academic" API and a "BOSS Custom" API. I've talked with a few of the guys at Yahoo, and I have the impression that some of the other two versions of the API (Academic and Custom) do allow you to change how the set is actually produced, rather than only reorder results once you've got them back.

Daniel Tunkelang said...

Well, if the scoring has meaning beyond rank order, I'm all for exposing it. But I think it's rare for a modern information retrieval model to deliver more than a rank order of results. Certainly I don't see the TREC evaluations taking any such scoring into account, which would be straightforward if, say, relevance scores were returned as probabilities.

Re Yahoo: I'd be delighted if they offered the ability to change the overall sort. I'm very skeptical, since allowing that degree of control could have significant implications for the way they'd have to optimize their data structures. I know a little about this stuff from my work at Endeca. :-)

Anonymous said...

But I think it's rare for a modern information retrieval model to deliver more than a rank order of results.

Rare to deliver to the consumer? Or rare to have computed, internally?

If they're not computing some sort of numerical score, then how are they doing the ranking? They're certainly not classifying each result into categories "apple", "banana" and "pear" -- and as everybody knows pears come before bananas. Right? So they must have some sort of number? So even if its rare to expose that number to the end user, don't they still have that number, somewhere, in the system?

Certainly I don't see the TREC evaluations taking any such scoring into account, which would be straightforward if, say, relevance scores were returned as probabilities.

Well, probabilities would be nice. But they don't have to be probabilities. They just have to be magnitude-differentiated ordinals of some sort. Suppose for example the search engine were some sort of metasearch, doing Borda count ranked list fusion, i.e. the sum of "k - rank(d, E)", where k is a constant (say, 1000), and rank(d, E) is the rank of document (or web page) d returned by search engine E. Take 5 different search engines, figure out at what rank from each engine a document is located, and do the summation across all the engines. (If a document is not returned by an engine, then put its rank = k, so that the score is 0.) Sort or rank by the sum total.

At the end of that process, you'll have an integer-valued sum of integers. No probabilities, no real-valued numbers. But there will still be a gap, an integer difference, between the rank#1 doc and the rank#2 doc, between rank#7 and rank#8, etc.

And so you'll still be able to see when a borda-count fusion score goes from 4,235 to 4,227, versus when a score goes from 4,227 to 2,506.

So this sort of information has to exist, in the engine, in some form, right? Some sort of magnitude-differentiated ordinal? Or are modern search engines sorting without any kind of magnitude? Is that even possible? Even when I'm summing ranks, and each individual sublist starts out with no difference in magnitude from one ranked result to the next, the sum starts to develop a magnitude difference. Is it really that different when combining multi-word queries?

Re Yahoo: I'd be delighted if they offered the ability to change the overall sort. I'm very skeptical, since allowing that degree of control could have significant implications for the way they'd have to optimize their data structures. I know a little about this stuff from my work at Endeca. :-)

From talking with the Yahoo guys, in their BOSS Custom offering, they indeed will go out and do some customized index stuff for you. I'm just passing along what they say.

Daniel Tunkelang said...

Jeremy, I meant rare for the algorithms to deliver more than rank ordering.

Here's an excerpt from a paper by Fernando Diaz:

A ranked retrieval model assigns some rank or score to each document in the collection and ranks documents according to the score. The user then scans the documents according to the ranking. The score function for a ranked retrieval model maps documents to real values. Given a query, the model provides a function from documents to scores. The values of the function provide the desired ranking of the documents.

Fernando goes on to examine the behavior of score functions for ranked retrieval models. But I want to emphasize his point that ranked retrieval models only use scores to obtain ranking. They don't promise anything about the scores themselves--e.g., whether the magnitude of difference in score has a meaningful interpretation. Indeed, Fernando's paper is an attempt to obtain useful scores from any scoring approach through regularization.

Re Yahoo: I looked at the BOSS Custom page. It sounds interesting, but delightfully vague. I don't blame them; supporting this sort of customization is a serious endeavor.

Anonymous said...

Ok, I understand ya now. Rare for the search engines to deliver. And so I was just saying that they have the ability to deliver, if they so chose to. And it's just unpleasing that they choose not to.

But yes, I now understand your counterpoint also.. that even when there are magnitude differences between scores, those might not have any semantic interpretation. I'd been operating under the assumption that there would be something semantically interpretable if you could see the distribution or histogram of the scores returned by a query. Maybe not a strong one.. but definitely something that the human brain could make use of. But now I realize.. that's just an assumption.

Thanks for reminding me of Fernando's work. I'd read a shorter version of this regularization idea, but this one looks great.. lots more info.

But there has to be other, "HCIR" work out there, looking at this sort of thing? Because I do remember seeing some systems that did actually expose their (unregularized) scores, that did offer that transparency. But I wonder if anyone has done the HCI experiments to see how useful that information really is. You couldn't measure it though clickthroughs or something like that, right? Because it's not necessarily something that is designed to increase your clickthrough. So you'd have to measure it some other way.

Thoughts?

Anonymous said...

The regularization work that you cite always assumes that there is value in the actual distribution of baseline scores (Jeremy talks about this earlier). Regularization fails beautifully if you swap the baseline scores with an arbitrary monotonically decreasing function. And in fact, the black art of getting regularization to work for some baselines is in score pre-processing and normalization. I usually used zero-mean unit variance because it's simple to implement, theoretically appealing for regularization, empirically effective for metasearch, and worked. The regularized scores, then, are crazy things that never look like probabilities.

Anonymous said...

Set retrieval has been studied before here,

[1] W. Dai and R. Srihari. Minimal document set retrieval. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 752–759, New York, NY, USA, 2005. ACM Press.
[2] A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science, 37(1):3–11, 1986.
[3] M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 76–84, New York, NY, USA, 1996. ACM Press.
[4] N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.
[5] X. Liu and W. B. Croft. Experiments on retrieval of optimal clusters. Technical Report IR-478, University of Massachusetts Amherst, 2006.

And for a regularization way to do it, try Section 8.3.1 here.

Anonymous said...

Neither clarity nor PRF in their most effective senses use set-based feedback. The query model in clarity (and similarly in relevance modeling) is very closely tied to the baseline retrieval score. Specifically, as published, the query model is the linear combination of document vectors weighted by the relevance score. The interpolation weights for top-k can be thought of as being constant for the top-k and 0 for everything lower. When implemented according to theory, the interpolation weights drop off precipitously, providing a very nice adaptive top-k.

In practice, clarity and relevance models use top-k in addition to relevance weighting for efficiency reasons. It's theoretically justified because the scores in general drop off by the cutoff point.

This doesn't preclude using a top-k clarity of PRF without paying attention to weights in situations where weights are bogus or missing. But in my experience, if you have that information you should use it.

Daniel Tunkelang said...

Fernando, I'll have to cite your work more often!

Thanks for the clarification about the regularization work. But, if the raw scores are meaningful--which your approach counts on--why don't they get considered for user interfaces or evaluation purposes. I've had the strong impression that ranked retrieval approaches toss out the raw scores because they have no use for them. If that impression is off base, I'd love to correct it.

Re clarity, I'm not sure I follow the adaptive top-k explanation. I thought the cut-off was explicitly chosen (ranging from 20 to 500 documents in the examples in the original paper). Is your point that, because these contributions drop off rapidly, the score is fairly insensitive to the cutoff?

Daniel Tunkelang said...

Oh, and thanks for all the set retrieval references! The early ones are familiar, but I'll have to check out the CIKM '05 paper and the tech report, as well as your dissertation.

Anonymous said...

The value of presenting scores is a UI issue, so I'm not really qualified to answer. The readers of this blog would certainly know how to interpret scores and probabilities, but my suspicion is that "regular users" would need some training. Scores probably got dropped for presentation in an effort to reduce clutter. That said, behind the curtain, I can only comment that it would be a very good thing to look at the score.

With respect to scores, topk and clarity/PRF, I can point to this figure. This is a theoretical curve for a single query. The horizontal axis is the rank, the vertical axis is the weight in the clarity/PRF interpolation. The curve labeled y is the mass normalized score as used in clarity/PRF. The red line is the weight used by topk clarity/PRF. When the red line is consistent with the black line topk=relevance weighting. However, the decay of the weights depends on the query. Here are some example curves for TREC queries with QL baselines. Top 100 is definitely a reasonable pick but I would just use the scores directly

Anonymous said...

@Daniel: nice post, well nice posts, well nice blog :) I have to say that I still feel that ranked retrieval is an improvement over set retrieval, though.

@Fernando (and just to add some more... noise): which metric should one use to compare two different score functions? I'm asking because I did some work on that - the ADM metric.

Anonymous said...

some goof published this paper on comparing score functions,


F. Diaz. Performance prediction using spatial autocorrelation. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 583–590, New York, NY, USA, 2007. ACM Press.

discussions of relationship to clarity and other precision predictors contained too.

Anonymous said...

I think we can have our cake and eat it, too.

Suppose you have a probabilistic estimate of being in the relevant set. The expectation of this variable is the best estimate of the number of relevant docs. It can also be used to infer other statistics, such as term counts, site counts, etc.

You can then sample results according to their probabilities, which might be theoretically interesting, but seems like it'd be confusing to users in practice.

If all probs are 0/1, the probabilistic model reduces to set retrieval.

A threshold can be used to convert a probabilistic model to a 0/1 model. Varying the threshold trades precision for recall if the relevance estimates are at all reasonable. I always thought the TREC evals were taking this view, but with ranked results rather than scored ones.

Daniel Tunkelang said...

Agreed: a probabilistic retrieval model gives us the best of both worlds. But that's more powerful than a ranked retrieval model, and, as far as I can tell, TREC's predominant focus on ranked retrieval models doesn't help us build better (i.e., better calibrated) probabilistic models.

Sunday, August 24, 2008

Set Retrieval vs. Ranked Retrieval

After last week's post about a racially targeted web search engine, you'd think I'd avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I'll start by introducing the terms.
  • In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.

  • In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.
An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:
  • The number of documents reported to match my search should be meaningful--or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.

  • Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.
Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

27 comments:

Max L. Wilson said...

See this is interesting. One of the things that set retrieval would enable is ranking by novelty.

The idea, to be clear, is to put in the first page of results, the 10 that, together, cover the most breadth of a topic. I think it would be really interesting, as part of this, to see novelty oriented result information, so that the snippit of text you see covers what that result covers that the other results do not. if your dataset has keywords etc too, then you could should the keywords that it has that the other results do not. perhaps.

ive not done anything specific in this area myself. but i've seen some guys from spain send out a few papers including some IP&M papers. In particular, i spoke to David Losisa's student about the idea at SIGIR07.

i suspect it might be more applicable to vertical search. i guess it has most applicability to domain learning tasks, as prioritizing novelty might exclude your answer that was in result 2 from the relevance ranking.

Daniel Tunkelang said...

Max, that's a good point, but I feel you're understating it. Since relevance ranking is a fundamentally a sort order for the results, it precludes any explicit choice of sort order. It only makes sense to sort a set. I'll dive deeper into this topic in an upcoming post.

Michael Bendersky said...

Hi Daniel,
Great post!

I wanted to comment on a connection between the set and the ranked retrieval.

I think people often assume the set retrieval model implicitly, without actually stating it. Two immediate examples I can think of are pseudo-relevance feedback and query-clarity. In both models, top-k are just assumed to be a relevant set.

I believe that stating the set-retrieval model explicitly might be helpful in understanding both of these models, and also other retrieval/re-ranking models that utilize a sub-set of the retrieved results.

Anonymous said...

Daniel -

I'm really beginning to enjoy your blog. I discovered it a few weeks ago, and have slowly been digesting your older posts.

On the subject of relevance ranking helping distinguish more relevant results from less relevant results.. I have a vague recollection of some of the early web search engines, circa 1994 or 1995 (Infoseek?) actually displaying their normalized [0..100] relevance scores to the left of every search result. That way you could tell whether the top 10 links were all in the 99 to 92 range, or whether the scores started in the 90s, but then quickly tapered off after the 4th or 5th result to the 70s or 40s or whatever.

It is a very simple, very old idea. But no one does it anymore. Why is that? It's very frustrating. It would be much more useful to get a sense of the relevance-ranking score distribution. It might even be nice to show a binned histogram of the top 1000 retrieved documents, so that you could see not only the range but the distribution of the scores in a summarized, exploratory manner.

David Fauth said...

Jeremy,

I seem to remember the relevance ranking and tools like Verity would allow you to display it. However, as Daniel says, this is already some sort of a ranking/ordering system implying a sort order.

Yahoo has BOSS which allows developers access to their search engine. What I don't know is it is retrieving only a set or if there is implied ranking with it.

I'm looking forward to the rest of this series.

Daniel Tunkelang said...

Mike, it warms my heart to see you make those particular connections.

Query clarity was actually one of the ideas that made me focus so intently on set retrieval. My colleagues and I at Endeca have actually taken a query-less approach that evaluates a clarity-like measure for document sets, regardless of how they are obtained. We've implemented lots of applications on top of this measure.

As for pseudo-relevance feedback, I did see a poster at ECIR '08 suggesting ways to optimize the number of documents used for feedback. I can't find it online, but the title was "Robust Query-specific Pseudo Feedback Document Selection for Query Expansion."

Daniel Tunkelang said...

Jeremy, thanks! As for displaying relevance scores, I think the problem is that most ranking approaches don't actually compute a probability of relevance--all they compute is some score you can use for ranking. That makes it impossible to band results or offer a meaningful histogram.

That said, I'm not sure how well calibrate the probabilities were for engines that did offer them.

Daniel Tunkelang said...

David, pretty sure that Yahoo doesn't let you change the ranking. I've never seen any web search API let you do that. But if I'm wrong, please let me know!

Anonymous said...

Hi Daniel,

With Yahoo! BOSS you can re-order the results returned by the API. It is one of the highlighted features at:

http://developer.yahoo.com/search/boss/

Daniel Tunkelang said...

Sérgio, do you mean that you can retrieve the top 50 according to Yahoo's relevance ranking strategy and re-order those, or that you can change the sort order used to determine what are the top 50? My sense is that you can only do the former, but maybe I'm missing something in the documentation.

Anonymous said...

Yes, you are right. Only the former seems to be possible. I misinterpreted the previous comments on this thread.

David Fauth said...

I believe that Yahoo still returns what they believe is the right "set". You only get to reorder the results not change how the set is obtained.

Anonymous said...

@David F.: I seem to remember the relevance ranking and tools like Verity would allow you to display it. However, as Daniel says, this is already some sort of a ranking/ordering system implying a sort order.

Yes, I know it does imply some sort of order, but I am basically reacting to this statement from Daniel, that says if you're going to order it anyway, you should give the user a way of distinguishing more-relevant from less-relevant:

Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.

And one (simple, old) way of doing that is simply to show the final overall score for each document. That way you can see if the step from result 3 to result 4 is a score change from 0.947 to 0.921, or a change from 0.947 to 0.223. Being able to see that precipitous drop-off when going from rank 3 to rank 4, would let you see exactly where "more relevant" starts to become "less relevant".

And regarding the Yahoo BOSS API; realize that there is more than one API. There is the standard BOSS API, but there is also a "BOSS Academic" API and a "BOSS Custom" API. I've talked with a few of the guys at Yahoo, and I have the impression that some of the other two versions of the API (Academic and Custom) do allow you to change how the set is actually produced, rather than only reorder results once you've got them back.

Daniel Tunkelang said...

Well, if the scoring has meaning beyond rank order, I'm all for exposing it. But I think it's rare for a modern information retrieval model to deliver more than a rank order of results. Certainly I don't see the TREC evaluations taking any such scoring into account, which would be straightforward if, say, relevance scores were returned as probabilities.

Re Yahoo: I'd be delighted if they offered the ability to change the overall sort. I'm very skeptical, since allowing that degree of control could have significant implications for the way they'd have to optimize their data structures. I know a little about this stuff from my work at Endeca. :-)

Anonymous said...

But I think it's rare for a modern information retrieval model to deliver more than a rank order of results.

Rare to deliver to the consumer? Or rare to have computed, internally?

If they're not computing some sort of numerical score, then how are they doing the ranking? They're certainly not classifying each result into categories "apple", "banana" and "pear" -- and as everybody knows pears come before bananas. Right? So they must have some sort of number? So even if its rare to expose that number to the end user, don't they still have that number, somewhere, in the system?

Certainly I don't see the TREC evaluations taking any such scoring into account, which would be straightforward if, say, relevance scores were returned as probabilities.

Well, probabilities would be nice. But they don't have to be probabilities. They just have to be magnitude-differentiated ordinals of some sort. Suppose for example the search engine were some sort of metasearch, doing Borda count ranked list fusion, i.e. the sum of "k - rank(d, E)", where k is a constant (say, 1000), and rank(d, E) is the rank of document (or web page) d returned by search engine E. Take 5 different search engines, figure out at what rank from each engine a document is located, and do the summation across all the engines. (If a document is not returned by an engine, then put its rank = k, so that the score is 0.) Sort or rank by the sum total.

At the end of that process, you'll have an integer-valued sum of integers. No probabilities, no real-valued numbers. But there will still be a gap, an integer difference, between the rank#1 doc and the rank#2 doc, between rank#7 and rank#8, etc.

And so you'll still be able to see when a borda-count fusion score goes from 4,235 to 4,227, versus when a score goes from 4,227 to 2,506.

So this sort of information has to exist, in the engine, in some form, right? Some sort of magnitude-differentiated ordinal? Or are modern search engines sorting without any kind of magnitude? Is that even possible? Even when I'm summing ranks, and each individual sublist starts out with no difference in magnitude from one ranked result to the next, the sum starts to develop a magnitude difference. Is it really that different when combining multi-word queries?

Re Yahoo: I'd be delighted if they offered the ability to change the overall sort. I'm very skeptical, since allowing that degree of control could have significant implications for the way they'd have to optimize their data structures. I know a little about this stuff from my work at Endeca. :-)

From talking with the Yahoo guys, in their BOSS Custom offering, they indeed will go out and do some customized index stuff for you. I'm just passing along what they say.

Daniel Tunkelang said...

Jeremy, I meant rare for the algorithms to deliver more than rank ordering.

Here's an excerpt from a paper by Fernando Diaz:

A ranked retrieval model assigns some rank or score to each document in the collection and ranks documents according to the score. The user then scans the documents according to the ranking. The score function for a ranked retrieval model maps documents to real values. Given a query, the model provides a function from documents to scores. The values of the function provide the desired ranking of the documents.

Fernando goes on to examine the behavior of score functions for ranked retrieval models. But I want to emphasize his point that ranked retrieval models only use scores to obtain ranking. They don't promise anything about the scores themselves--e.g., whether the magnitude of difference in score has a meaningful interpretation. Indeed, Fernando's paper is an attempt to obtain useful scores from any scoring approach through regularization.

Re Yahoo: I looked at the BOSS Custom page. It sounds interesting, but delightfully vague. I don't blame them; supporting this sort of customization is a serious endeavor.

Anonymous said...

Ok, I understand ya now. Rare for the search engines to deliver. And so I was just saying that they have the ability to deliver, if they so chose to. And it's just unpleasing that they choose not to.

But yes, I now understand your counterpoint also.. that even when there are magnitude differences between scores, those might not have any semantic interpretation. I'd been operating under the assumption that there would be something semantically interpretable if you could see the distribution or histogram of the scores returned by a query. Maybe not a strong one.. but definitely something that the human brain could make use of. But now I realize.. that's just an assumption.

Thanks for reminding me of Fernando's work. I'd read a shorter version of this regularization idea, but this one looks great.. lots more info.

But there has to be other, "HCIR" work out there, looking at this sort of thing? Because I do remember seeing some systems that did actually expose their (unregularized) scores, that did offer that transparency. But I wonder if anyone has done the HCI experiments to see how useful that information really is. You couldn't measure it though clickthroughs or something like that, right? Because it's not necessarily something that is designed to increase your clickthrough. So you'd have to measure it some other way.

Thoughts?

Anonymous said...

The regularization work that you cite always assumes that there is value in the actual distribution of baseline scores (Jeremy talks about this earlier). Regularization fails beautifully if you swap the baseline scores with an arbitrary monotonically decreasing function. And in fact, the black art of getting regularization to work for some baselines is in score pre-processing and normalization. I usually used zero-mean unit variance because it's simple to implement, theoretically appealing for regularization, empirically effective for metasearch, and worked. The regularized scores, then, are crazy things that never look like probabilities.

Anonymous said...

Set retrieval has been studied before here,

[1] W. Dai and R. Srihari. Minimal document set retrieval. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 752–759, New York, NY, USA, 2005. ACM Press.
[2] A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science, 37(1):3–11, 1986.
[3] M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 76–84, New York, NY, USA, 1996. ACM Press.
[4] N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.
[5] X. Liu and W. B. Croft. Experiments on retrieval of optimal clusters. Technical Report IR-478, University of Massachusetts Amherst, 2006.

And for a regularization way to do it, try Section 8.3.1 here.

Anonymous said...

Neither clarity nor PRF in their most effective senses use set-based feedback. The query model in clarity (and similarly in relevance modeling) is very closely tied to the baseline retrieval score. Specifically, as published, the query model is the linear combination of document vectors weighted by the relevance score. The interpolation weights for top-k can be thought of as being constant for the top-k and 0 for everything lower. When implemented according to theory, the interpolation weights drop off precipitously, providing a very nice adaptive top-k.

In practice, clarity and relevance models use top-k in addition to relevance weighting for efficiency reasons. It's theoretically justified because the scores in general drop off by the cutoff point.

This doesn't preclude using a top-k clarity of PRF without paying attention to weights in situations where weights are bogus or missing. But in my experience, if you have that information you should use it.

Daniel Tunkelang said...

Fernando, I'll have to cite your work more often!

Thanks for the clarification about the regularization work. But, if the raw scores are meaningful--which your approach counts on--why don't they get considered for user interfaces or evaluation purposes. I've had the strong impression that ranked retrieval approaches toss out the raw scores because they have no use for them. If that impression is off base, I'd love to correct it.

Re clarity, I'm not sure I follow the adaptive top-k explanation. I thought the cut-off was explicitly chosen (ranging from 20 to 500 documents in the examples in the original paper). Is your point that, because these contributions drop off rapidly, the score is fairly insensitive to the cutoff?

Daniel Tunkelang said...

Oh, and thanks for all the set retrieval references! The early ones are familiar, but I'll have to check out the CIKM '05 paper and the tech report, as well as your dissertation.

Anonymous said...

The value of presenting scores is a UI issue, so I'm not really qualified to answer. The readers of this blog would certainly know how to interpret scores and probabilities, but my suspicion is that "regular users" would need some training. Scores probably got dropped for presentation in an effort to reduce clutter. That said, behind the curtain, I can only comment that it would be a very good thing to look at the score.

With respect to scores, topk and clarity/PRF, I can point to this figure. This is a theoretical curve for a single query. The horizontal axis is the rank, the vertical axis is the weight in the clarity/PRF interpolation. The curve labeled y is the mass normalized score as used in clarity/PRF. The red line is the weight used by topk clarity/PRF. When the red line is consistent with the black line topk=relevance weighting. However, the decay of the weights depends on the query. Here are some example curves for TREC queries with QL baselines. Top 100 is definitely a reasonable pick but I would just use the scores directly

Anonymous said...

@Daniel: nice post, well nice posts, well nice blog :) I have to say that I still feel that ranked retrieval is an improvement over set retrieval, though.

@Fernando (and just to add some more... noise): which metric should one use to compare two different score functions? I'm asking because I did some work on that - the ADM metric.

Anonymous said...

some goof published this paper on comparing score functions,


F. Diaz. Performance prediction using spatial autocorrelation. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 583–590, New York, NY, USA, 2007. ACM Press.

discussions of relationship to clarity and other precision predictors contained too.

Anonymous said...

I think we can have our cake and eat it, too.

Suppose you have a probabilistic estimate of being in the relevant set. The expectation of this variable is the best estimate of the number of relevant docs. It can also be used to infer other statistics, such as term counts, site counts, etc.

You can then sample results according to their probabilities, which might be theoretically interesting, but seems like it'd be confusing to users in practice.

If all probs are 0/1, the probabilistic model reduces to set retrieval.

A threshold can be used to convert a probabilistic model to a 0/1 model. Varying the threshold trades precision for recall if the relevance estimates are at all reasonable. I always thought the TREC evals were taking this view, but with ranked results rather than scored ones.

Daniel Tunkelang said...

Agreed: a probabilistic retrieval model gives us the best of both worlds. But that's more powerful than a ranked retrieval model, and, as far as I can tell, TREC's predominant focus on ranked retrieval models doesn't help us build better (i.e., better calibrated) probabilistic models.