Wednesday, June 11, 2008
How Google Measures Search Quality
The executive summary: rather than relying on click-through data to judge quality, Google employs armies of raters who manually rate search results for randomly selected queries using different ranking algorithms. These manual ratings drive the evaluation and evolution of Google's ranking algorithms.
I'm intrigued that Google is seems to wholeheartedly embrace the Cranfield paradigm. Of course, they don't publicize their evaluation measures, so perhaps they're optimizing something more interesting than mean average precision.
More questions for Amit. :)
Wednesday, June 11, 2008
How Google Measures Search Quality
The executive summary: rather than relying on click-through data to judge quality, Google employs armies of raters who manually rate search results for randomly selected queries using different ranking algorithms. These manual ratings drive the evaluation and evolution of Google's ranking algorithms.
I'm intrigued that Google is seems to wholeheartedly embrace the Cranfield paradigm. Of course, they don't publicize their evaluation measures, so perhaps they're optimizing something more interesting than mean average precision.
More questions for Amit. :)
5 comments:
- Max L. Wilson said...
-
my goodness that must be a brilliant dataset to have.
- June 12, 2008 at 1:32 AM
- Daniel Tunkelang said...
-
Indeed, but I'd still prefer the usage data. It boggles my mind to think they're not taking full advantage of it to test hypotheses and improve user experience, particularly when it comes to crowd-sourcing relevance assessment. I can't believe that Peter Norvig and colleagues would be that profligate.
- June 12, 2008 at 3:02 PM
-
-
I have no doubt that they take advantage of user data when appropriate (and I strongly doubt that they're using average precision; more likely they're using variations of DCG/NDCG). But ultimately you're still going to need raters to find those documents that are relevant but that don't show up for any query users typically think of. Especially if you want to get at questions about recall, there are always going to be relevant results that are only retrieved by oddball algorithms that could never be deployed---even to a small subset of users---because they perform far too poorly on average.
- June 16, 2008 at 10:39 PM
- Daniel Tunkelang said...
-
I agree that it make more sense that they'd be using a DCG-like measure, particularly if, as someone suggested to me, they dramatically discount results after rank 3.
But I still don't buy the need for raters. Unless they're just that unwilling to jeopardize user experience for sufficiently many queries to test a particular algorithm or parameter setting--but that seems odd given their volume.
Also, while I agree that evaluating recall is much harder than evaluating precision, I wasn't even aware that Google cared much about recall. Isn't that directly at odds with their use of measures like DCG to measure search quality? - June 17, 2008 at 2:27 PM
-
-
even NDCG has a recall component — the 'N'. The normalization requires some sense of how many relevant documents there are in the collection per query. Without this, its impossible to compare performance across queries.
They must be using all that user data somehow. But, I think the original post was really making reference to judging the quality of different ranking algorithms exclusively with click-through usage data. As we all know, this is extremely shallow and noisy. So much so, that the Datawocky post seems to suggest that patterns don't emerge to distinguish between ranking algorithms using this data alone.
Think of it this way: usage data may tell you whether swapping documents at positions 2 and 3 is good or bad. But, it may not — this is such a minimal change in the ranking that users may not notice or this will just be lost in the noise. There may be a much bigger benefit for a particular query swapping documents at positions 2 and 12. But, you'll never get this information from click-through data alone since users don't typically click past the first 10 results. Thus, there is a need for deeper relevance assessment that goes beyond what a typical user sees.
In my experience using machine learning for document ranking, relevance judgements on somewhere between 20 and 50 documents per query is reasonable for training. Again, this is many more documents than a typical user would look at for any query. - June 17, 2008 at 3:47 PM
Wednesday, June 11, 2008
How Google Measures Search Quality
The executive summary: rather than relying on click-through data to judge quality, Google employs armies of raters who manually rate search results for randomly selected queries using different ranking algorithms. These manual ratings drive the evaluation and evolution of Google's ranking algorithms.
I'm intrigued that Google is seems to wholeheartedly embrace the Cranfield paradigm. Of course, they don't publicize their evaluation measures, so perhaps they're optimizing something more interesting than mean average precision.
More questions for Amit. :)
5 comments:
- Max L. Wilson said...
-
my goodness that must be a brilliant dataset to have.
- June 12, 2008 at 1:32 AM
- Daniel Tunkelang said...
-
Indeed, but I'd still prefer the usage data. It boggles my mind to think they're not taking full advantage of it to test hypotheses and improve user experience, particularly when it comes to crowd-sourcing relevance assessment. I can't believe that Peter Norvig and colleagues would be that profligate.
- June 12, 2008 at 3:02 PM
-
-
I have no doubt that they take advantage of user data when appropriate (and I strongly doubt that they're using average precision; more likely they're using variations of DCG/NDCG). But ultimately you're still going to need raters to find those documents that are relevant but that don't show up for any query users typically think of. Especially if you want to get at questions about recall, there are always going to be relevant results that are only retrieved by oddball algorithms that could never be deployed---even to a small subset of users---because they perform far too poorly on average.
- June 16, 2008 at 10:39 PM
- Daniel Tunkelang said...
-
I agree that it make more sense that they'd be using a DCG-like measure, particularly if, as someone suggested to me, they dramatically discount results after rank 3.
But I still don't buy the need for raters. Unless they're just that unwilling to jeopardize user experience for sufficiently many queries to test a particular algorithm or parameter setting--but that seems odd given their volume.
Also, while I agree that evaluating recall is much harder than evaluating precision, I wasn't even aware that Google cared much about recall. Isn't that directly at odds with their use of measures like DCG to measure search quality? - June 17, 2008 at 2:27 PM
-
-
even NDCG has a recall component — the 'N'. The normalization requires some sense of how many relevant documents there are in the collection per query. Without this, its impossible to compare performance across queries.
They must be using all that user data somehow. But, I think the original post was really making reference to judging the quality of different ranking algorithms exclusively with click-through usage data. As we all know, this is extremely shallow and noisy. So much so, that the Datawocky post seems to suggest that patterns don't emerge to distinguish between ranking algorithms using this data alone.
Think of it this way: usage data may tell you whether swapping documents at positions 2 and 3 is good or bad. But, it may not — this is such a minimal change in the ranking that users may not notice or this will just be lost in the noise. There may be a much bigger benefit for a particular query swapping documents at positions 2 and 12. But, you'll never get this information from click-through data alone since users don't typically click past the first 10 results. Thus, there is a need for deeper relevance assessment that goes beyond what a typical user sees.
In my experience using machine learning for document ranking, relevance judgements on somewhere between 20 and 50 documents per query is reasonable for training. Again, this is many more documents than a typical user would look at for any query. - June 17, 2008 at 3:47 PM
5 comments:
my goodness that must be a brilliant dataset to have.
Indeed, but I'd still prefer the usage data. It boggles my mind to think they're not taking full advantage of it to test hypotheses and improve user experience, particularly when it comes to crowd-sourcing relevance assessment. I can't believe that Peter Norvig and colleagues would be that profligate.
I have no doubt that they take advantage of user data when appropriate (and I strongly doubt that they're using average precision; more likely they're using variations of DCG/NDCG). But ultimately you're still going to need raters to find those documents that are relevant but that don't show up for any query users typically think of. Especially if you want to get at questions about recall, there are always going to be relevant results that are only retrieved by oddball algorithms that could never be deployed---even to a small subset of users---because they perform far too poorly on average.
I agree that it make more sense that they'd be using a DCG-like measure, particularly if, as someone suggested to me, they dramatically discount results after rank 3.
But I still don't buy the need for raters. Unless they're just that unwilling to jeopardize user experience for sufficiently many queries to test a particular algorithm or parameter setting--but that seems odd given their volume.
Also, while I agree that evaluating recall is much harder than evaluating precision, I wasn't even aware that Google cared much about recall. Isn't that directly at odds with their use of measures like DCG to measure search quality?
even NDCG has a recall component — the 'N'. The normalization requires some sense of how many relevant documents there are in the collection per query. Without this, its impossible to compare performance across queries.
They must be using all that user data somehow. But, I think the original post was really making reference to judging the quality of different ranking algorithms exclusively with click-through usage data. As we all know, this is extremely shallow and noisy. So much so, that the Datawocky post seems to suggest that patterns don't emerge to distinguish between ranking algorithms using this data alone.
Think of it this way: usage data may tell you whether swapping documents at positions 2 and 3 is good or bad. But, it may not — this is such a minimal change in the ranking that users may not notice or this will just be lost in the noise. There may be a much bigger benefit for a particular query swapping documents at positions 2 and 12. But, you'll never get this information from click-through data alone since users don't typically click past the first 10 results. Thus, there is a need for deeper relevance assessment that goes beyond what a typical user sees.
In my experience using machine learning for document ranking, relevance judgements on somewhere between 20 and 50 documents per query is reasonable for training. Again, this is many more documents than a typical user would look at for any query.
Post a Comment