Friday, May 16, 2008

A Utilitarian View of IR Evaluation

In many information retrieval papers that propose new techniques, the authors validate those techniques by demonstrating improved mean average precision over a standard test collection. The value of such results--at least to a practitioner--hinges on whether mean average precision correlates to utility for users. Not only do user studies place this correlation in doubt, but I have yet to see an empirical argument defending the utility of average precision as an evaluation measure. Please send me any references if you are aware of them!

Of course, user studies are fraught with complications, the most practical one being their expense. I'm not suggesting that we need to replace Cranfield studies with user studies wholesale. Rather, I see the purpose of user studies as establishing the utility of measures that can then be evaluated by Cranfield studies. As with any other science, we need to work with simplified, abstract models to achieve progress, but we also need to ground those models by validating them in the real world.

For example, consider the scenario where a collection contains no documents that match a user's need. In this case, it is ideal for the user to reach this conclusion as accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? Alternatively, can we demonstrate that some interfaces lead to better user performance than others? If so, can we establish measures suitable for those interfaces?

The "no documents" case is just one of many real-world scenarios, and I don't mean to suggest we should study it at the expense of all others. That said, I think it's a particularly valuable scenario that, as far as I can tell, has been neglected by the information retreival community. I use it to drive home the argument that practical use cases should drive our process of defining evaluation measures.

7 comments:

Anonymous said...

Oh, where to begin!

The user Turpin user study you cite, "User performance versus precision measures for simple search tasks", is interesting but FAR from a definitive nail in the coffin for MAP. Two major "flaws" (I really hesitate to use that word) I see -- first, the MAP of the systems they're testing are in the range of 0.7 - 0.9 (if I remember correctly), and they aim to show that there's no noticeable difference in the systems from the users' point of view. This MAP range is really well above the range of MAP scores we see in TREC style evaluations, which is usually closer to 0.3-0.5 for most tasks. Second, measuring user satisfaction in a controlled lab environment with search tasks dictated by someone other than the users is... well... unsatisfying. I'm sure I don't need to convince you of that.

The real issue, IMHO:

Relevance is not only subjective, but its also just part of the picture when talking about utility & user satisfaction. The Cranfield/TREC methodology allows for subjectivity -- queries and relevance judgements are given by a single person, with the intent that a relevant document is relevant for that assessor, not absolutely relevant for everyone. However, Cranfield and evaluation measures like MAP only look at relevance, not all the other factors that make a system truly useful: authority, diversity, recency, and many others.

It has been useful for us IR researchers to focus on relevance -- retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system. Although I don't think we've solved this problem, we all need to be aware that that is not the only criteria for a successful IR system.

Anonymous said...

Correction: my recollection of systems tested within the MAP range of 0.7 - 0.9 was a little off: they evaluated systems with MAP values of 0.55, 0.65, ... 0.95. This is still well above the best performing ad hoc system performance at TREC, typically in the 0.3-0.35 range.

What implication does this have? Hard to say for sure, but its roughly the difference in systems having a relevant document at the top two ranks 75% of the time vs. 30% of the time (assuming 1-2 relevant documents per query). AP, like Reciprocal Rank, is very sensitive at the high end of the range, so a difference in MAP of 0.85 and 0.95 is due to very small perturbations in the rankings. Differences between 0.25 and 0.35, on the other hand, require a much larger shuffle of the document ranks.

Daniel Tunkelang said...

Jon, I'll concede that, just because I like the Turpin study doesn't mean it's right, and I appreciate you keeping me honest. The end doesn't justify the means.

Still, I can't leave your assertion that "retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system" unchallenged. My concern is not just the subjective nature of relevance assessment--if that were the only issue, then we could see the average judgment of assessors as an "objective" relevance measure. The problem is the subjective nature of relevance itself. Ultimately, relevant information is just whatever information I want / need right now.

Anonymous said...

"The problem is the subjective nature of relevance itself. "

I agree that this is part of the problem, Daniel. That's the reason why the relevance assessor is the same person who has the information need and crafts the query/topic. There is no absolute (topical) relevance. A measure like AP is evaluating how well a system returns relevant documents according to one person's idea of relevance at the time the assessment was made. There always are and always will be disagreements between assessors, but that's the point (and what makes our job interesting)!

Assessors are typically not asked to penalize redundant documents, documents with mis-spellings or grammatical errors, documents from known unreliable sources, etc. These are all facets of what a user may consider useful or not... but they are not traditionally considered aspects of topical relevance.

Relevance is a necessary but not sufficient criterion for utility.

Daniel Tunkelang said...

I agree that, at least in many circumstances, I'm interested in retrieving documents that are relevant to an information need. I'll put aside the use case I described, where the optimal outcome is to quickly ascertain that no such documents exist.

But there is a difference between a task goal and a query goal. As I perform a sequence of queries to meet my information need, I'm not necessarily concerned with retrieving relevant documents on each query. Rather, I'd like to be learning how to improve my query strategy to ultimately complete my task as successfully as possible.

So yes, relevance is a necessary criterion. But not necessarily in the way that the Cranfield/TREC experiments measure it.

Anonymous said...

...accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? ...

I think measures that factor in time as well as the amount of relevant material retrieved will be useful to evaluate atleast the first two criteria you have mentioned. You might want to look at papers by Mika Kaki, especially this one. Another interesting read would be this SIGIR 2007 poster on correlations between IR measures and user satisfaction.

Daniel Tunkelang said...

Giri, thanks for the links! While I'm intrigued by the user satisfaction measures in the SIGIR '07 poster, I'm more interested in the objective measures of task effectiveness and efficiency. The Kaki paper seems more on target, especially for evaluating a recall-oriented task.

Friday, May 16, 2008

A Utilitarian View of IR Evaluation

In many information retrieval papers that propose new techniques, the authors validate those techniques by demonstrating improved mean average precision over a standard test collection. The value of such results--at least to a practitioner--hinges on whether mean average precision correlates to utility for users. Not only do user studies place this correlation in doubt, but I have yet to see an empirical argument defending the utility of average precision as an evaluation measure. Please send me any references if you are aware of them!

Of course, user studies are fraught with complications, the most practical one being their expense. I'm not suggesting that we need to replace Cranfield studies with user studies wholesale. Rather, I see the purpose of user studies as establishing the utility of measures that can then be evaluated by Cranfield studies. As with any other science, we need to work with simplified, abstract models to achieve progress, but we also need to ground those models by validating them in the real world.

For example, consider the scenario where a collection contains no documents that match a user's need. In this case, it is ideal for the user to reach this conclusion as accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? Alternatively, can we demonstrate that some interfaces lead to better user performance than others? If so, can we establish measures suitable for those interfaces?

The "no documents" case is just one of many real-world scenarios, and I don't mean to suggest we should study it at the expense of all others. That said, I think it's a particularly valuable scenario that, as far as I can tell, has been neglected by the information retreival community. I use it to drive home the argument that practical use cases should drive our process of defining evaluation measures.

7 comments:

Anonymous said...

Oh, where to begin!

The user Turpin user study you cite, "User performance versus precision measures for simple search tasks", is interesting but FAR from a definitive nail in the coffin for MAP. Two major "flaws" (I really hesitate to use that word) I see -- first, the MAP of the systems they're testing are in the range of 0.7 - 0.9 (if I remember correctly), and they aim to show that there's no noticeable difference in the systems from the users' point of view. This MAP range is really well above the range of MAP scores we see in TREC style evaluations, which is usually closer to 0.3-0.5 for most tasks. Second, measuring user satisfaction in a controlled lab environment with search tasks dictated by someone other than the users is... well... unsatisfying. I'm sure I don't need to convince you of that.

The real issue, IMHO:

Relevance is not only subjective, but its also just part of the picture when talking about utility & user satisfaction. The Cranfield/TREC methodology allows for subjectivity -- queries and relevance judgements are given by a single person, with the intent that a relevant document is relevant for that assessor, not absolutely relevant for everyone. However, Cranfield and evaluation measures like MAP only look at relevance, not all the other factors that make a system truly useful: authority, diversity, recency, and many others.

It has been useful for us IR researchers to focus on relevance -- retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system. Although I don't think we've solved this problem, we all need to be aware that that is not the only criteria for a successful IR system.

Anonymous said...

Correction: my recollection of systems tested within the MAP range of 0.7 - 0.9 was a little off: they evaluated systems with MAP values of 0.55, 0.65, ... 0.95. This is still well above the best performing ad hoc system performance at TREC, typically in the 0.3-0.35 range.

What implication does this have? Hard to say for sure, but its roughly the difference in systems having a relevant document at the top two ranks 75% of the time vs. 30% of the time (assuming 1-2 relevant documents per query). AP, like Reciprocal Rank, is very sensitive at the high end of the range, so a difference in MAP of 0.85 and 0.95 is due to very small perturbations in the rankings. Differences between 0.25 and 0.35, on the other hand, require a much larger shuffle of the document ranks.

Daniel Tunkelang said...

Jon, I'll concede that, just because I like the Turpin study doesn't mean it's right, and I appreciate you keeping me honest. The end doesn't justify the means.

Still, I can't leave your assertion that "retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system" unchallenged. My concern is not just the subjective nature of relevance assessment--if that were the only issue, then we could see the average judgment of assessors as an "objective" relevance measure. The problem is the subjective nature of relevance itself. Ultimately, relevant information is just whatever information I want / need right now.

Anonymous said...

"The problem is the subjective nature of relevance itself. "

I agree that this is part of the problem, Daniel. That's the reason why the relevance assessor is the same person who has the information need and crafts the query/topic. There is no absolute (topical) relevance. A measure like AP is evaluating how well a system returns relevant documents according to one person's idea of relevance at the time the assessment was made. There always are and always will be disagreements between assessors, but that's the point (and what makes our job interesting)!

Assessors are typically not asked to penalize redundant documents, documents with mis-spellings or grammatical errors, documents from known unreliable sources, etc. These are all facets of what a user may consider useful or not... but they are not traditionally considered aspects of topical relevance.

Relevance is a necessary but not sufficient criterion for utility.

Daniel Tunkelang said...

I agree that, at least in many circumstances, I'm interested in retrieving documents that are relevant to an information need. I'll put aside the use case I described, where the optimal outcome is to quickly ascertain that no such documents exist.

But there is a difference between a task goal and a query goal. As I perform a sequence of queries to meet my information need, I'm not necessarily concerned with retrieving relevant documents on each query. Rather, I'd like to be learning how to improve my query strategy to ultimately complete my task as successfully as possible.

So yes, relevance is a necessary criterion. But not necessarily in the way that the Cranfield/TREC experiments measure it.

Anonymous said...

...accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? ...

I think measures that factor in time as well as the amount of relevant material retrieved will be useful to evaluate atleast the first two criteria you have mentioned. You might want to look at papers by Mika Kaki, especially this one. Another interesting read would be this SIGIR 2007 poster on correlations between IR measures and user satisfaction.

Daniel Tunkelang said...

Giri, thanks for the links! While I'm intrigued by the user satisfaction measures in the SIGIR '07 poster, I'm more interested in the objective measures of task effectiveness and efficiency. The Kaki paper seems more on target, especially for evaluating a recall-oriented task.

Friday, May 16, 2008

A Utilitarian View of IR Evaluation

In many information retrieval papers that propose new techniques, the authors validate those techniques by demonstrating improved mean average precision over a standard test collection. The value of such results--at least to a practitioner--hinges on whether mean average precision correlates to utility for users. Not only do user studies place this correlation in doubt, but I have yet to see an empirical argument defending the utility of average precision as an evaluation measure. Please send me any references if you are aware of them!

Of course, user studies are fraught with complications, the most practical one being their expense. I'm not suggesting that we need to replace Cranfield studies with user studies wholesale. Rather, I see the purpose of user studies as establishing the utility of measures that can then be evaluated by Cranfield studies. As with any other science, we need to work with simplified, abstract models to achieve progress, but we also need to ground those models by validating them in the real world.

For example, consider the scenario where a collection contains no documents that match a user's need. In this case, it is ideal for the user to reach this conclusion as accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? Alternatively, can we demonstrate that some interfaces lead to better user performance than others? If so, can we establish measures suitable for those interfaces?

The "no documents" case is just one of many real-world scenarios, and I don't mean to suggest we should study it at the expense of all others. That said, I think it's a particularly valuable scenario that, as far as I can tell, has been neglected by the information retreival community. I use it to drive home the argument that practical use cases should drive our process of defining evaluation measures.

7 comments:

Anonymous said...

Oh, where to begin!

The user Turpin user study you cite, "User performance versus precision measures for simple search tasks", is interesting but FAR from a definitive nail in the coffin for MAP. Two major "flaws" (I really hesitate to use that word) I see -- first, the MAP of the systems they're testing are in the range of 0.7 - 0.9 (if I remember correctly), and they aim to show that there's no noticeable difference in the systems from the users' point of view. This MAP range is really well above the range of MAP scores we see in TREC style evaluations, which is usually closer to 0.3-0.5 for most tasks. Second, measuring user satisfaction in a controlled lab environment with search tasks dictated by someone other than the users is... well... unsatisfying. I'm sure I don't need to convince you of that.

The real issue, IMHO:

Relevance is not only subjective, but its also just part of the picture when talking about utility & user satisfaction. The Cranfield/TREC methodology allows for subjectivity -- queries and relevance judgements are given by a single person, with the intent that a relevant document is relevant for that assessor, not absolutely relevant for everyone. However, Cranfield and evaluation measures like MAP only look at relevance, not all the other factors that make a system truly useful: authority, diversity, recency, and many others.

It has been useful for us IR researchers to focus on relevance -- retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system. Although I don't think we've solved this problem, we all need to be aware that that is not the only criteria for a successful IR system.

Anonymous said...

Correction: my recollection of systems tested within the MAP range of 0.7 - 0.9 was a little off: they evaluated systems with MAP values of 0.55, 0.65, ... 0.95. This is still well above the best performing ad hoc system performance at TREC, typically in the 0.3-0.35 range.

What implication does this have? Hard to say for sure, but its roughly the difference in systems having a relevant document at the top two ranks 75% of the time vs. 30% of the time (assuming 1-2 relevant documents per query). AP, like Reciprocal Rank, is very sensitive at the high end of the range, so a difference in MAP of 0.85 and 0.95 is due to very small perturbations in the rankings. Differences between 0.25 and 0.35, on the other hand, require a much larger shuffle of the document ranks.

Daniel Tunkelang said...

Jon, I'll concede that, just because I like the Turpin study doesn't mean it's right, and I appreciate you keeping me honest. The end doesn't justify the means.

Still, I can't leave your assertion that "retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system" unchallenged. My concern is not just the subjective nature of relevance assessment--if that were the only issue, then we could see the average judgment of assessors as an "objective" relevance measure. The problem is the subjective nature of relevance itself. Ultimately, relevant information is just whatever information I want / need right now.

Anonymous said...

"The problem is the subjective nature of relevance itself. "

I agree that this is part of the problem, Daniel. That's the reason why the relevance assessor is the same person who has the information need and crafts the query/topic. There is no absolute (topical) relevance. A measure like AP is evaluating how well a system returns relevant documents according to one person's idea of relevance at the time the assessment was made. There always are and always will be disagreements between assessors, but that's the point (and what makes our job interesting)!

Assessors are typically not asked to penalize redundant documents, documents with mis-spellings or grammatical errors, documents from known unreliable sources, etc. These are all facets of what a user may consider useful or not... but they are not traditionally considered aspects of topical relevance.

Relevance is a necessary but not sufficient criterion for utility.

Daniel Tunkelang said...

I agree that, at least in many circumstances, I'm interested in retrieving documents that are relevant to an information need. I'll put aside the use case I described, where the optimal outcome is to quickly ascertain that no such documents exist.

But there is a difference between a task goal and a query goal. As I perform a sequence of queries to meet my information need, I'm not necessarily concerned with retrieving relevant documents on each query. Rather, I'd like to be learning how to improve my query strategy to ultimately complete my task as successfully as possible.

So yes, relevance is a necessary criterion. But not necessarily in the way that the Cranfield/TREC experiments measure it.

Anonymous said...

...accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? ...

I think measures that factor in time as well as the amount of relevant material retrieved will be useful to evaluate atleast the first two criteria you have mentioned. You might want to look at papers by Mika Kaki, especially this one. Another interesting read would be this SIGIR 2007 poster on correlations between IR measures and user satisfaction.

Daniel Tunkelang said...

Giri, thanks for the links! While I'm intrigued by the user satisfaction measures in the SIGIR '07 poster, I'm more interested in the objective measures of task effectiveness and efficiency. The Kaki paper seems more on target, especially for evaluating a recall-oriented task.