This will be the first of several posts along this theme, so I'll start by introducing the terms.
- In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.
- In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.
What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:
- The number of documents reported to match my search should be meaningful--or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.
- Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.