Comments on Redirecting to http://thenoisychannel.com...: Set Retrieval vs. Ranked Retrieval

Agreed: a probabilistic retrieval model gives us t...

2008-09-03T23:31:00.000-04:00

Agreed: a probabilistic retrieval model gives us the best of both worlds. But that's more powerful than a ranked retrieval model, and, as far as I can tell, TREC's predominant focus on ranked retrieval models doesn't help us build better (i.e., better calibrated) probabilistic models.

I think we can have our cake and eat it, too.Suppo...

2008-09-03T18:30:00.000-04:00

I think we can have our cake and eat it, too.

Suppose you have a probabilistic estimate of being in the relevant set. The expectation of this variable is the best estimate of the number of relevant docs. It can also be used to infer other statistics, such as term counts, site counts, etc.

You can then sample results according to their probabilities, which might be theoretically interesting, but seems like it'd be confusing to users in practice.

If all probs are 0/1, the probabilistic model reduces to set retrieval.

A threshold can be used to convert a probabilistic model to a 0/1 model. Varying the threshold trades precision for recall if the relevance estimates are at all reasonable. I always thought the TREC evals were taking this view, but with ranked results rather than scored ones.

some goof published this paper on comparing score ...

2008-09-02T10:48:00.000-04:00

some goof published this paper on comparing score functions,

F. Diaz. Performance prediction using spatial autocorrelation. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 583–590, New York, NY, USA, 2007. ACM Press.

discussions of relationship to clarity and other precision predictors contained too.

@Daniel: nice post, well nice posts, well nice blo...

2008-09-02T07:02:00.000-04:00

@Daniel: nice post, well nice posts, well nice blog :) I have to say that I still feel that ranked retrieval is an improvement over set retrieval, though.

@Fernando (and just to add some more... noise): which metric should one use to compare two different score functions? I'm asking because I did some work on that - the ADM metric.

The value of presenting scores is a UI issue, so I...

2008-08-29T10:42:00.000-04:00

The value of presenting scores is a UI issue, so I'm not really qualified to answer. The readers of this blog would certainly know how to interpret scores and probabilities, but my suspicion is that "regular users" would need some training. Scores probably got dropped for presentation in an effort to reduce clutter. That said, behind the curtain, I can only comment that it would be a very good thing to look at the score.

With respect to scores, topk and clarity/PRF, I can point to this figure. This is a theoretical curve for a single query. The horizontal axis is the rank, the vertical axis is the weight in the clarity/PRF interpolation. The curve labeled y is the mass normalized score as used in clarity/PRF. The red line is the weight used by topk clarity/PRF. When the red line is consistent with the black line topk=relevance weighting. However, the decay of the weights depends on the query. Here are some example curves for TREC queries with QL baselines. Top 100 is definitely a reasonable pick but I would just use the scores directly

Oh, and thanks for all the set retrieval reference...

2008-08-28T21:29:00.000-04:00

Oh, and thanks for all the set retrieval references! The early ones are familiar, but I'll have to check out the CIKM '05 paper and the tech report, as well as your dissertation.

Fernando, I'll have to cite your work more often!T...

2008-08-28T21:08:00.000-04:00

Fernando, I'll have to cite your work more often!

Thanks for the clarification about the regularization work. But, if the raw scores are meaningful--which your approach counts on--why don't they get considered for user interfaces or evaluation purposes. I've had the strong impression that ranked retrieval approaches toss out the raw scores because they have no use for them. If that impression is off base, I'd love to correct it.

Re clarity, I'm not sure I follow the adaptive top-k explanation. I thought the cut-off was explicitly chosen (ranging from 20 to 500 documents in the examples in the original paper). Is your point that, because these contributions drop off rapidly, the score is fairly insensitive to the cutoff?

Neither clarity nor PRF in their most effective se...

2008-08-28T13:19:00.000-04:00

Neither clarity nor PRF in their most effective senses use set-based feedback. The query model in clarity (and similarly in relevance modeling) is very closely tied to the baseline retrieval score. Specifically, as published, the query model is the linear combination of document vectors weighted by the relevance score. The interpolation weights for top-k can be thought of as being constant for the top-k and 0 for everything lower. When implemented according to theory, the interpolation weights drop off precipitously, providing a very nice adaptive top-k.

In practice, clarity and relevance models use top-k in addition to relevance weighting for efficiency reasons. It's theoretically justified because the scores in general drop off by the cutoff point.

This doesn't preclude using a top-k clarity of PRF without paying attention to weights in situations where weights are bogus or missing. But in my experience, if you have that information you should use it.

Set retrieval has been studied before here,[1] W. ...

2008-08-28T13:12:00.000-04:00

Set retrieval has been studied before here,

[1] W. Dai and R. Srihari. Minimal document set retrieval. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 752–759, New York, NY, USA, 2005. ACM Press.
[2] A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science, 37(1):3–11, 1986.
[3] M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 76–84, New York, NY, USA, 1996. ACM Press.
[4] N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.
[5] X. Liu and W. B. Croft. Experiments on retrieval of optimal clusters. Technical Report IR-478, University of Massachusetts Amherst, 2006.

And for a regularization way to do it, try Section 8.3.1 here.

The regularization work that you cite always assum...

2008-08-28T13:07:00.000-04:00

The regularization work that you cite always assumes that there is value in the actual distribution of baseline scores (Jeremy talks about this earlier). Regularization fails beautifully if you swap the baseline scores with an arbitrary monotonically decreasing function. And in fact, the black art of getting regularization to work for some baselines is in score pre-processing and normalization. I usually used zero-mean unit variance because it's simple to implement, theoretically appealing for regularization, empirically effective for metasearch, and worked. The regularized scores, then, are crazy things that never look like probabilities.

Ok, I understand ya now. Rare for the search engi...

2008-08-28T11:41:00.000-04:00

Ok, I understand ya now. Rare for the search engines to deliver. And so I was just saying that they have the ability to deliver, if they so chose to. And it's just unpleasing that they choose not to.

But yes, I now understand your counterpoint also.. that even when there are magnitude differences between scores, those might not have any semantic interpretation. I'd been operating under the assumption that there would be something semantically interpretable if you could see the distribution or histogram of the scores returned by a query. Maybe not a strong one.. but definitely something that the human brain could make use of. But now I realize.. that's just an assumption.

Thanks for reminding me of Fernando's work. I'd read a shorter version of this regularization idea, but this one looks great.. lots more info.

But there has to be other, "HCIR" work out there, looking at this sort of thing? Because I do remember seeing some systems that did actually expose their (unregularized) scores, that did offer that transparency. But I wonder if anyone has done the HCI experiments to see how useful that information really is. You couldn't measure it though clickthroughs or something like that, right? Because it's not necessarily something that is designed to increase your clickthrough. So you'd have to measure it some other way.

Thoughts?

Jeremy, I meant rare for the algorithms to deliver...

2008-08-27T17:12:00.000-04:00

Jeremy, I meant rare for the algorithms to deliver more than rank ordering.

Here's an excerpt from a paper by Fernando Diaz:

A ranked retrieval model assigns some rank or score to each document in the collection and ranks documents according to the score. The user then scans the documents according to the ranking. The score function for a ranked retrieval model maps documents to real values. Given a query, the model provides a function from documents to scores. The values of the function provide the desired ranking of the documents.

Fernando goes on to examine the behavior of score functions for ranked retrieval models. But I want to emphasize his point that ranked retrieval models only use scores to obtain ranking. They don't promise anything about the scores themselves--e.g., whether the magnitude of difference in score has a meaningful interpretation. Indeed, Fernando's paper is an attempt to obtain useful scores from any scoring approach through regularization.

Re Yahoo: I looked at the BOSS Custom page. It sounds interesting, but delightfully vague. I don't blame them; supporting this sort of customization is a serious endeavor.

But I think it's rare for a modern information ret...

2008-08-27T11:47:00.000-04:00

But I think it's rare for a modern information retrieval model to deliver more than a rank order of results.

Rare to deliver to the consumer? Or rare to have computed, internally?

If they're not computing some sort of numerical score, then how are they doing the ranking? They're certainly not classifying each result into categories "apple", "banana" and "pear" -- and as everybody knows pears come before bananas. Right? So they must have some sort of number? So even if its rare to expose that number to the end user, don't they still have that number, somewhere, in the system?

Certainly I don't see the TREC evaluations taking any such scoring into account, which would be straightforward if, say, relevance scores were returned as probabilities.

Well, probabilities would be nice. But they don't have to be probabilities. They just have to be magnitude-differentiated ordinals of some sort. Suppose for example the search engine were some sort of metasearch, doing Borda count ranked list fusion, i.e. the sum of "k - rank(d, E)", where k is a constant (say, 1000), and rank(d, E) is the rank of document (or web page) d returned by search engine E. Take 5 different search engines, figure out at what rank from each engine a document is located, and do the summation across all the engines. (If a document is not returned by an engine, then put its rank = k, so that the score is 0.) Sort or rank by the sum total.

At the end of that process, you'll have an integer-valued sum of integers. No probabilities, no real-valued numbers. But there will still be a gap, an integer difference, between the rank#1 doc and the rank#2 doc, between rank#7 and rank#8, etc.

And so you'll still be able to see when a borda-count fusion score goes from 4,235 to 4,227, versus when a score goes from 4,227 to 2,506.

So this sort of information has to exist, in the engine, in some form, right? Some sort of magnitude-differentiated ordinal? Or are modern search engines sorting without any kind of magnitude? Is that even possible? Even when I'm summing ranks, and each individual sublist starts out with no difference in magnitude from one ranked result to the next, the sum starts to develop a magnitude difference. Is it really that different when combining multi-word queries?

Re Yahoo: I'd be delighted if they offered the ability to change the overall sort. I'm very skeptical, since allowing that degree of control could have significant implications for the way they'd have to optimize their data structures. I know a little about this stuff from my work at Endeca. :-)

From talking with the Yahoo guys, in their BOSS Custom offering, they indeed will go out and do some customized index stuff for you. I'm just passing along what they say.

Well, if the scoring has meaning beyond rank order...

2008-08-26T20:49:00.000-04:00

Well, if the scoring has meaning beyond rank order, I'm all for exposing it. But I think it's rare for a modern information retrieval model to deliver more than a rank order of results. Certainly I don't see the TREC evaluations taking any such scoring into account, which would be straightforward if, say, relevance scores were returned as probabilities.

Re Yahoo: I'd be delighted if they offered the ability to change the overall sort. I'm very skeptical, since allowing that degree of control could have significant implications for the way they'd have to optimize their data structures. I know a little about this stuff from my work at Endeca. :-)

@David F.: I seem to remember the relevance rankin...

2008-08-26T12:03:00.000-04:00

@David F.: I seem to remember the relevance ranking and tools like Verity would allow you to display it. However, as Daniel says, this is already some sort of a ranking/ordering system implying a sort order.

Yes, I know it does imply some sort of order, but I am basically reacting to this statement from Daniel, that says if you're going to order it anyway, you should give the user a way of distinguishing more-relevant from less-relevant:

Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.

And one (simple, old) way of doing that is simply to show the final overall score for each document. That way you can see if the step from result 3 to result 4 is a score change from 0.947 to 0.921, or a change from 0.947 to 0.223. Being able to see that precipitous drop-off when going from rank 3 to rank 4, would let you see exactly where "more relevant" starts to become "less relevant".

And regarding the Yahoo BOSS API; realize that there is more than one API. There is the standard BOSS API, but there is also a "BOSS Academic" API and a "BOSS Custom" API. I've talked with a few of the guys at Yahoo, and I have the impression that some of the other two versions of the API (Academic and Custom) do allow you to change how the set is actually produced, rather than only reorder results once you've got them back.

I believe that Yahoo still returns what they belie...

2008-08-26T10:48:00.000-04:00

I believe that Yahoo still returns what they believe is the right "set". You only get to reorder the results not change how the set is obtained.

Yes, you are right. Only the former seems to be po...

2008-08-26T10:32:00.000-04:00

Yes, you are right. Only the former seems to be possible. I misinterpreted the previous comments on this thread.

Sérgio, do you mean that you can retrieve the top ...

2008-08-26T08:25:00.000-04:00

Sérgio, do you mean that you can retrieve the top 50 according to Yahoo's relevance ranking strategy and re-order those, or that you can change the sort order used to determine what are the top 50? My sense is that you can only do the former, but maybe I'm missing something in the documentation.

Hi Daniel,With Yahoo! BOSS you can re-order the re...

2008-08-26T05:04:00.000-04:00

Hi Daniel,

With Yahoo! BOSS you can re-order the results returned by the API. It is one of the highlighted features at:

http://developer.yahoo.com/search/boss/

David, pretty sure that Yahoo doesn't let you chan...

2008-08-26T00:07:00.000-04:00

David, pretty sure that Yahoo doesn't let you change the ranking. I've never seen any web search API let you do that. But if I'm wrong, please let me know!

Jeremy, thanks! As for displaying relevance scores...

2008-08-26T00:06:00.000-04:00

Jeremy, thanks! As for displaying relevance scores, I think the problem is that most ranking approaches don't actually compute a probability of relevance--all they compute is some score you can use for ranking. That makes it impossible to band results or offer a meaningful histogram.

That said, I'm not sure how well calibrate the probabilities were for engines that did offer them.

Mike, it warms my heart to see you make those part...

2008-08-26T00:01:00.000-04:00

Mike, it warms my heart to see you make those particular connections.

Query clarity was actually one of the ideas that made me focus so intently on set retrieval. My colleagues and I at Endeca have actually taken a query-less approach that evaluates a clarity-like measure for document sets, regardless of how they are obtained. We've implemented lots of applications on top of this measure.

As for pseudo-relevance feedback, I did see a poster at ECIR '08 suggesting ways to optimize the number of documents used for feedback. I can't find it online, but the title was "Robust Query-specific Pseudo Feedback Document Selection for Query Expansion."

Jeremy,I seem to remember the relevance ranking an...

2008-08-25T22:02:00.000-04:00

Jeremy,

I seem to remember the relevance ranking and tools like Verity would allow you to display it. However, as Daniel says, this is already some sort of a ranking/ordering system implying a sort order.

Yahoo has BOSS which allows developers access to their search engine. What I don't know is it is retrieving only a set or if there is implied ranking with it.

I'm looking forward to the rest of this series.

Daniel -I'm really beginning to enjoy your blog. ...

2008-08-25T17:09:00.000-04:00

Daniel -

I'm really beginning to enjoy your blog. I discovered it a few weeks ago, and have slowly been digesting your older posts.

On the subject of relevance ranking helping distinguish more relevant results from less relevant results.. I have a vague recollection of some of the early web search engines, circa 1994 or 1995 (Infoseek?) actually displaying their normalized [0..100] relevance scores to the left of every search result. That way you could tell whether the top 10 links were all in the 99 to 92 range, or whether the scores started in the 90s, but then quickly tapered off after the 4th or 5th result to the 70s or 40s or whatever.

It is a very simple, very old idea. But no one does it anymore. Why is that? It's very frustrating. It would be much more useful to get a sense of the relevance-ranking score distribution. It might even be nice to show a binned histogram of the top 1000 retrieved documents, so that you could see not only the range but the distribution of the scores in a summarized, exploratory manner.

Hi Daniel,Great post!I wanted to comment on a conn...

2008-08-25T12:35:00.000-04:00

Hi Daniel,
Great post!

I wanted to comment on a connection between the set and the ranked retrieval.

I think people often assume the set retrieval model implicitly, without actually stating it. Two immediate examples I can think of are pseudo-relevance feedback and query-clarity. In both models, top-k are just assumed to be a relevant set.

I believe that stating the set-retrieval model explicitly might be helpful in understanding both of these models, and also other retrieval/re-ranking models that utilize a sub-set of the retrieved results.