Monday, July 28, 2008

Not as Cuil as I Expected

Today's big tech news is the launch of Cuil, the latest challenger to Google's hegemony in Web search. Given the impressive team of Xooglers that put it together, I had high expectations for the launch.

My overall reaction: not bad, but not good enough to take seriously as a challenge to Google. They may be "The World's Biggest Search Engine" based on the number of pages indexed, but they return zero results for a number of queries where Google does just fine, including noisy channel blog (compare to Google). But I'm not taking it personally--after all, their own site doesn't show up when you search for their name (again, compare to Google). As for their interface features (column display, explore by category, query suggestions), they're fine, but neither the concepts nor the quality of their implementation strike me as revolutionary.

Perhaps I'm expecting too much on day 1. But they're not just trying to beat Gigablast; they're trying to beat Google, and they surely expected to get lots of critical attention the moment they launched. Regardless of the improvements they've made in indexing, they clearly need to do more work on their crawler. It's hard to judge the quality of results when it's clear that at least some of the problem is that the most relevant documents simply aren't in their index. I'm also surprised to not see Wikipedia documents showing up much for my searches--particularly for searches when I'm quite sure the most relevant document is in Wikipedia. Again, it's hard to tell if this is an indexing or results quality issue.

I wish them luck--I speak for many in my desire to see Google face worthy competition in web search.

Sunday, July 27, 2008

Catching up on SIGIR '08

Now that SIGIR '08 is over, I hope to see more folks blogging about it. I'm jealous of everyone who had the opportunity to attend, not only because of the culinary delights of Singapore, but because the program seems to reflect an increasing interest of the academic community in real-world IR problems.

Some notes from looking over the proceedings:
  • Of the 27 paper sessions, 2 include the word "user" in their titles, 2 include the word "social", 2 focus on Query Analysis & Models, and 1 is about exploratory search. Compared to the last few SIGIR conferences, this is a significant increase in focus on users and interaction.

  • A paper on whether test collections predict users' effectiveness offers an admirable defense of the Cranfield paradigm, much along the lines I've been advocating.

  • A nice paper from Microsoft Research looks at the problem of whether to personalize results for a query, recognizing that not all queries benefit from personalization. This approach may well be able to reap the benefits of personaliztion while avoiding much of its harm.

  • Two papers on tag prediction: Real-time Automatic Tag Recommendation (ACM Digital Library subscription required) and Social Tag Prediction. Semi-automated tagging tools are one of the best ways to leverage the best of both human and machine capabilities.
And I haven't even gotten to the posters! I'm sad to see that they dropped the industry day, but perhaps they'll bring it back next year in Boston.

Wednesday, July 23, 2008

Knol: Google takes on Wikipedia

Just a few days ago, I was commenting on a New York Times article about Wikipedia's new approval system that the biggest problem with Wikipedia is anonymous authorship. By synchronous coincidence, Google unveiled Knol today, which is something of a cross between Wikipedia and Squidoo. It's most salient feature is that each entry will have a clearly identified author. They even allow authors to verify their identities using credit cards or phone directories.

It's a nice idea, since anonymous authorship is a a major factor in the adversarial nature of information retrieval on the web. Not only does the accountability of authorship inhibit vandalism and edit wars, but it also allows readers to decide for themselves whom to trust--at least to the extent that readers are able and willing to obtain reliable information about the authors. Without question, they are addressing Wikipedia's biggest weakness.

But it's too little, too late. Wikipedia is already there. And, despite complaints about its inaccuracy and bias, Wikipedia is a fantastic, highly utilized resource. The only way I see for Knol to supplant Wikipedia in reasonable time frame is through a massive cut-and-paste to make up for the huge difference in content.

Interestingly, Wikipedia does not seem to place any onerous restrictions on verbatim copying. However, unless a single author is 100% responsible for authoring a Wikipedia entry, it isn't clear that anyone can simply copy the entry into Knol.

I know that it's dangerous to bet against Google. But I'm really skeptical about this latest effort. It's a pity, because I think their emphasis is the right one. But for once I wish they'd been a bit more humble and accepted that they aren't going to build a better Wikipedia from scratch.

Saturday, July 19, 2008

Predictably Irrational

As regular readers have surely noticed by now, I've been on a bit of a behavioral psychology kick lately. Some of this reflects long-standing personal interest and my latest reading. But I also feel increasingly concerned that researchers in information seeking--especially those working on tools--have neglected the impact of cognitive bias.

For those who are unfamiliar with last few decades of research in this field, I highly recommend a recent lecture by behavioral economist Dan Ariely on predictable irrationality. Not only is he a very informative and entertaining speaker, but he chooses very concrete and credible examples, starting with his contemplating how we experience pain based on his own experience of suffering
third-degree burns over 70 percent of his body. I promise you, the lecture is an hour well spent, and the time will fly by.

A running theme of through this and my other posts on cognitive bias is that the way a information is presented to us has dramatic effects on how we interpret that information.

This is great news for anyone who wants to manipulate people. In fact, I once asked Dan about the relative importance of people's inherent preferences vs. those induced by presentation on retail web sites, and he all but dismissed the former (i.e., you can sell ice cubes to Eskimos, if you can manipulate their cognitive biases appropriately). But it's sobering news for those of us who want to empower user to evaluate information objectively to support decision making.

Friday, July 18, 2008

Call to Action - A Follow-Up

The call to action I sent out a couple of weeks ago has generated healthy interest.

One of the several people who responded is the CTO of one of Endeca's competitors, whom I laud for understanding that the need to better articulate and communicate the technology of information access transcends competition among vendors. While we have differences on how to achieve this goal, I at least see hope from his responsiveness.

The rest were analysts representing some of the leading firms in the space. They not only expressed interest, but also contributed their own ideas on how to make this effort successful. Indeed, I met with two analysts this week to discuss next steps.

Here is where I see this going.

In order for any efforts to communicate the technology of information access to be effective, the forum has to establish credibility as a vendor-neutral and analyst-neutral forum. Ideally, that means having at least two major vendors and two major analysts on board. What we want to avoid is having only one major vendor or analyst, since that will create a reasonable perception of bias.

I'd also like to involve academics in information retrieval and library and information science. As one of the analysts suggested, we could reach out to the leading iSchools, who have expressed an open interest in engaging the broader community.

What I'd like to see come together is a forum, probably a one-day workshop, that brings together credible representatives from the vendor, analyst, and academic communities. With a critical mass of participants and enough diversity to assuage concerns of bias, we can start making good on this call to action.

Tuesday, July 15, 2008

Beyond a Reasonable Doubt

In Psychology of Intelligence Analysis, Richards Heuer advocates that we quantify expressions of uncertainty: "To avoid ambiguity, insert an odds ratio or probability range in parentheses after expressions of uncertainty in key judgments."

His suggestion reminds me of my pet peeve about the unquantified notion of reasonable doubt in the American justice system. I've always wanted (but never had the opportunity) to ask a judge what probability of innocence constitutes a reasonable doubt.

Unfortunately, as Heuer himself notes elsewhere in his book, we human beings are really bad at estimating probabilities. I suspect (with a confidence of 90 to 95%) that quantifying our uncertainties as probability ranges will only suggest a false sense of precision.

So, what can we do to better communicate uncertainty? Here are a couple of thoughts:
  • We can calibrate estimates based on past performance. It's unclear what will happen if people realize that their estimates are being translated, but, at worst, it feels like good fodder for research in judgment and decision making.

  • We can ask people to express relative probability judgments. While these are also susceptible to bias, at least they don't demand as much precision. And we can always vary the framing of questions to try to factor out the cognitive biases they induce.
Also, we talk about uncertainty, it is important that we distinguish between aleatory and epistemic uncertainty.

When I flip a coin, I am certain it has a 50% chance of landing heads, because I know the probability distribution of the event space. This is aleatory uncertainty, and forms the basis of probability and statistics.

But when I reason about less contrived uncertain events, such as estimating the likelihood that my bank will collapse this year, the challenge is my ignorance of the probability distribution. This is epistemic uncertainty, and it's a lot messier.

If you'd like to learn more about aleatory and existential uncertainty, I recommend Nicholas Nassim Taleb's Fooled by Randomness (which is a better read than his better-known Black Swan).

In summary, we have to accept the bad news that the real world is messy. As a mathematician and computer scientist, I've learned to pursue theoretical rigor as an ideal. Like me, you may find it very disconcerting to not be able to treat all real-world uncertainty in terms of probability spaces. Tell it to the judge!

Sunday, July 13, 2008

Small is Beautiful

Today's New York Times has an article by John Markoff called On a Small Screen, Just the Salient Stuff. It argues that the design constraints of the iPhone (and of mobile devices in general) lead to an improved user experience, since site designers do a better job of focusing on the information that users will find relevant.

Of course, on a blog entitled The Noisy Channel, I can't help praising approaches that strive to improve the signal-to-noise ratio in information seeking applications. And I'm glad to see them quoting Ben Shneiderman, a colleague of mine at the University of Maryland who has spent much of his career focusing on HCIR issues.

Still, I think they could have taken the idea much further. Their discussion of more efficient or ergonomic use of real estate boils down to stripping extraneous content (a good idea, but hardly novel), and making sites vertically oriented (i.e., no horizontal scrolling). They don't consider the question of what information is best to present in the limited space--which, in my mind, is the most important question to consider as we optimize interaction. Indeed, many of the questions raised by small screens also apply to other interfaces, such as voice.

Perhaps I am asking too much to expect them to call out the extreme inefficiency of ranked lists, compared to summarization-oriented approaches. Certainly the mobile space opens great opportunities for someone to get this right on the web.

Friday, July 11, 2008

Psychology of Intelligence Analysis

In the course of working with some of Endeca's more interesting clients, I started reading up on how the intelligence agencies address the challenges of making decisions, especially in the face of incomplete and contradictory evidence. I ran into a book called Psychology of Intelligence Analysis by former CIA analyst Richards Heuer. The entire book is available online, or you can hunt down a hard copy of the out-of-print book from your favorite used book seller.

Given the mixed record of the intelligence agencies over the past few decades, you might be wondering if the CIA is the best source for learning how to analyze intelligence. But this book is a gem. Even if the agencies don't always practice what they preach (and the book makes a good case as to why), the book is an excellent tour through the literature on judgment and decision making.

If you're already familiar with work by Herb Simon, Danny Kahneman, and Amos Tversky, then a lot of the ground he covers will be familiar--especially the third of the book that enumerates cognitive biases. I'm a big fan of the judgment and decision making literature myself. But I still found some great nuggets, particularly Chapter 8 on Analysis of Competing Hypotheses. Unlike most of the literature that focuses exclusively on demonstrating our systematic departures from rationality, Heuer hopes offer at least some constructive advice.

As someone who builds tools to help people make decisions using information that not only may be incomplete and contradictory, but also challenging to find in the first place, I'm very sensitive to how people's cognitive biases affect their ability to use these tools effectively. One of the HCIR '07 presentations by Jolie Martin and Michael Norton (who have worked with Max Bazerman) showed how the manner in which information was partitioned on retail web sites drove decisions, i.e., re-organizing the same information affected consumer's decision process.

It may be tempting for us on the software side to wash our hands of our users' cognitive biases. But such an approach would be short-sighted. As Heuer shows in his well-researched book, people not only have cognitive biases, but are unable to counter those biases simply by being made aware of them. Hence, if software tools are to help people make effective decisions, it is the job of us tool builders to build with those biases in mind, and to support processes like Analysis of Competing Hypotheses that try to compensate for human bias.

Thursday, July 10, 2008

Nice Selection of Machine Learning Papers

John Langford just posted a list of seven ICML '08 papers that he found interesting. I appreciate his taste in papers, and I particularly liked a paper on Learning Diverse Rankings with Multi-Armed Bandits that addresses learning a diverse ranking of documents based on users' clicking behavior. If you liked the Less is More work that Harr Chen and David Karger presented at SIGIR '06, then I recommend you check this one out.

Tuesday, July 8, 2008

Librarian 2.0

Many of the words that mark milestones in the history of technology, such as calculator and word processor, originally corresponded to people. Calculating had at least two lives as a technology breakthrough--first as a process, and then as a automatic means for executing that process. Thanks to inventions like calculators and computers, human beings have moved up the value chain to become scientists and engineers who take low-level details for granted.

Similarly, the advances in information science and retrieval have dramatically changed the role of a reference librarian.

Hopefully some of you old enough to remember card catalogs, They were certainly functional if you knew the exact title or author you were looking for, assuming the title wasn't too generic or author too prolific. Where card catalogs fell short was in supporting exploratory search. In many cases, your best bet was to quite literally explore the stacks and hope that locality within the Dewey Decimal system sufficed for to support your information seeking needs. Alternatively, you could follow citation paths--the dead-tree precursor of surfing a hypertext collection.

For exploratory tasks, library patrons would turn to reference librarians, who would clarify the patrons' needs through a process called the reference interview. According to Wikipedia:
A reference interview is composed of two segments:

1. An initial segment in which the librarian encourages the user to fully discuss the request.
2. A final segment in which the librarian asks questions to relate the request to the materials available in the library

A reference interview is structured (ideally) according to the following series of steps. First the library user states a question or describes a problem. The librarian then clarifies the user's information need, sometimes leading him or her back from a request for a specific resource (which may not be the best one for the problem at hand) to the actual information need as it manifests in the library user's life. Following that, the librarian suggests information resources that address the user's information need, explaining the nature and scope of information they contain and soliciting feedback. The reference interview closes when the librarian has provided the appropriate information or a referral to an outside resource where it can be found, and the user confirms that he or she has received the information needed.
Fast forward to the present day. Thanks to modern search engines, title and author search are no longer tedious processes. Moreover, search engines are somewhat forgiving of users, offering spelling correction and inexact query matching. Libraries are still catching up with advances in technology, but the evolution is clearly under way.

However, search engines have not obviated the need for a reference interview. Excepting the simple cases of known item search, the typical information seeker needs help translating an information need into one or more search queries. And that information need may change as the seeker learns from the process.

But it should come as no surprise that information seeking support systems need to be more than search engines. The ideal information seeking support system emulates a reference librarian, stepping users through a structured process of clarification. Indeed, this is exactly what my colleagues and I at Endeca are trying to do in our work with libraries and more broadly in pursuing a vision of human computer information retrieval.

What then becomes of librarians? Much as calculators and computers did not obviate the need for mathematicians, I don't see technology obviating the need for information scientists. Library schools have already evolved into information schools, and I have no doubt that their graduates will help establish the next generation of information seeking technology that makes today's search engines seem as quaint as card catalogs.

Sunday, July 6, 2008

Resolving the Battle Royale between Information Retrieval and Information Science

The following is the position paper I submitted to the NSF Information Seeking Support Systems Workshop last month. The workshop report is still being assembled, but I wanted to share my own contribution to the discussion, since it is particularly appropriate to the themes of The Noisy Channel.


Resolving the Battle Royale between Information Retrieval and Information Science


Daniel Tunkelang

Endeca

ABSTRACT

We propose an approach to help resolve the “battle royale” between the information retrieval and information science communities. The information retrieval side favors the Cranfield paradigm of batch evaluation, criticized by the information science side for its neglect of the user. The information science side favors user studies, criticized by the information retrieval side for their scale and repeatability challenges. Our approach aims to satisfy the primary concerns of both sides.

Categories and Subject Descriptors

H.1.2 [Human Factors]: Human information processing.

H.3.3 [Information Systems]: Information Search and Retrieval - Information Filtering, Retrieval Models

H.5.2 [Information Systems]: Information Interfaces and Presentation - User Interfaces

General Terms

Design, Experimentation, Human Factors

Keywords

Information science, information retrieval, information seeking, evaluation, user studies

1. INTRODUCTION

Over the past few decades, a growing community of researchers has called for the information retrieval community to think outside the Cranfield box. Perhaps the most vocal advocate is Nick Belkin, whose "grand challenges" in his keynote at the 2008 European Conference on Information Retrieval [1] all pertained to the interactive nature of information seeking he claims the Cranfield approach neglects. Belkin cited similar calls to action going back as far as Karen Spärck Jones, in her 1988 acceptance speech for the Gerald Salton award [2], and again from Tefko Saracevic, when he received the same award in 1997 [3]. More recently, we have the Information Seeking and Retrieval research program proposed by Peter Ingwersen and Kalervo Järvelin in The Turn, published in 2005 [4].

2. IMPASSE BETWEEN IR AND IS

Given the advocacy of Belkin and others, why hasn't there been more progress? As Ellen Voorhees noted in defense of Cranfield at the 2006 Workshop on Adaptive Information Retrieval, "changing the abstraction slightly to include just a bit more characterization of the user will result in a dramatic loss of power or increase in cost of retrieval experiments" [5]. Despite user studies that have sought to challenge the Cranfield emphasis on batch information retrieval measures like mean average precision—such as those of Andrew Turpin and Bill Hersh [6]—the information retrieval community, on the whole, remains unconvinced by these experiments because they are smaller in scale and less repeatable than the TREC evaluations.

As Tefko Saracevic has said, there is a "battle royale" between the information retrieval community, which favors the Cranfield paradigm of batch evaluation despite its neglect of the user, and the information science community, which favors user studies despite their scale and repeatability challenges [7]. How do we move forward?

3. PRIMARY CONCERNS OF IR AND IS

Both sides have compelling arguments. If an evaluation procedure is not repeatable and cost-effective, it has little practical value. Nonetheless, it is essential that an evaluation procedure measure the interactive nature of information seeking.

If we are to find common ground to resolve this dispute, we need to satisfy the primary concerns of both sides:

· Real information seeking tasks are interstice, so the results of the evaluation procedure must be meaningful in an interactive context.

· The evaluation procedure must be repeatable and cost-effective.

In order to move beyond the battle royale and resolve the impasse between the IR and IS communities, we need to address both of these concerns.

4. PROPOSED APPROACH


A key point of contention in the battle royale is whether we should evaluate systems by studying individual users or measuring system performance against test collections.

The short answer is that we need to do both. In order to ground the results of evaluation in realistic contexts, we need to conduct user studies that relate proposed measures to success in interactive information seeking tasks. Otherwise, we optimize under the artificial constraint that a task involves only a single user query.

Such an approach presumes that we have a characterization of information seeking tasks. This characterization is an open problem that is beyond the scope of this position paper but has been addressed by other information seeking researchers, including Ingwersen and Järvelin [4]. We presume access to a set of tasks that, if not exhaustive, at least applies to a valuable subset of real information seeking problems.

Consider, as a concrete example, the task of a researcher who, given a comprehensive digital library of technical publications, wants to determine with confidence whether his or her idea is novel. In other words, the researcher want to either discover prior art that anticipates the idea, or to state with confidence that there is no such art. Patent inventors and lawyers performing e-discovery perform analogous tasks. We can measure task performance objectively as a combination of accuracy and efficiency, and we can also consider subject measures like user confidence and satisfaction. Let us assume that we are able to quantify a task success measure that incorporates these factors.

Given this task and success measure, we would like to know how well an information retrieval system supports the user performing it. As the information scientists correctly argue, user studies are indispensable. But, as we employ user studies to determine which systems are most helpful to users, we need to go a step further and correlate user success to one or more system measures. We can then evaluate these system measures in a repeatable, cost-effective process that does not require user involvement.

For example, let us hypothesize that mean average precision (MAP) on a given TREC collection is such a measure. We hypothesize that users pursuing the prior art search task are more successful using a system with higher MAP than those using a system with lower MAP. In order to test this hypothesis, we can present users with a family of systems that, insofar as possible, vary only in MAP, and see how well user success correlates to the system’s MAP. If the correlation is strong, then we validate the utility of MAP as a system measure and invest in evaluating systems using MAP against the specified collection in order to predict their utility for the prior art task.

The principle here is a general one, and can even be used not only to compare different algorithms, but also to evaluate more sophisticated interfaces, such as document clustering [8] or faceted search [9]. The only requirement is that we hypothesize and validate system measures that correlate to user success.

5. WEAKNESSES OF APPROACH

Our proposed approach has two major weaknesses.

The first weakness is that, in a realistic interactive information retrieval context, distinct queries are not independent. Rather, a typical user executes a sequence of queries in pursuit of an information need, each query informed by the results of the previous ones.

In a batch test, we must decide the query sequence in advance, and cannot model how the user’s queries depend on system response. Hence, we are limited to computing measures that can be evaluated for each query independently. Nonetheless, we can choose measures which correlate to effectiveness in realistic settings. Hopefully these measures are still meaningful, even when we remove the test queries from their realistic context.

The second challenge is that we do not envision a way to compare different interfaces in a batch setting. It seems that testing the relative merits of different interfaces requires real—or at least simulated—users.

If, however, we hold the interface constant, then we can define performance measures that apply to those interfaces. For example, we can develop standardized versions of well-studied interfaces, such as faceted search and clustering. We can then compare the performance of different systems that use these interfaces, e.g., different clustering algorithms.

6. AN ALTERNATIVE APPROACH

An alternative way to tackle the evaluation problem leverages the “human computation” approach championed by Luis Von Ahn [10]. This approach uses “games with a purpose” to motivate people to perform information-related tasks, such as image tagging and optical character recognition (OCR).

A particularly interesting "game" in our present context is Phetch, in which in which one or more "Seekers" compete to find an image based on a text description provided by a "Describer" [11]. The Describer’s goal is to help the Seekers succeed, while the Seekers compete with one another to find the target image within a fixed time limit, using search engine that has indexed the images based on tagging results from the ESP Game. In order to discourage a shotgun approach, the game penalizes Seekers for wrong guesses.

This game goes quite far in capturing the essence of interactive information retrieval. If we put aside the competition among the Seekers, then we see that an individual Seeker, aided by the human Describer and the algorithmic--but human indexed--search engine--is pursuing an information retrieval task. Moreover, the Seeker is incented to be both effective and efficient.

How can we leverage this framework for information retrieval evaluation? Even though the game envisions both Describers and Seekers to be human beings, there is no reason we cannot allow computers to play too--in either or both roles. Granted, the game, as currently designed, focuses on image retrieval without giving the human players direct access to the image tags, but we could imagine a framework that is more amenable to machine participation, e.g., providing a machine player with a set of tags derived from those in the index when that player is presented with an image. Alternatively, there may be a domain more suited than image retrieval to incorporating computer players.

The main appeal of the game framework is that it allows all participants to be judged based on an objective criterion that reflects the effectiveness and efficiency of the interactive information retrieval process. A good Describer should, on average, outscore a bad Describer over the long term; likewise, a good Seeker should outscore a bad one. We can even vary the search engine available to Seekers, in order to compare competing search engine algorithms or interfaces.

7. CONCLUSION

Our goal is ambitious: we aspire towards an evaluation framework that satisfies information scientists as relevant to real-world information seeking, but nonetheless offers the practicality of the Cranfield paradigm that dominates information retrieval. The near absence of collaboration between the information science and information retrieval communities has been a greatly missed opportunity not only for both researcher communities but also for the rest of the world who could benefit from practical advances in our understanding of information seeking. We hope that the approach we propose takes at least a small step towards resolving this battle royale.

8. REFERENCES

[1] Belkin, N. J., 2008. Some(What) Grand Challenges for Information Retrieval. ACM SIGIR Forum 42, 1 (June 2008), 47-54.

[2] Spärck Jones, K. 1988. A look back and a look forward. In: SIGIR ’88. In Proceedings of the 11th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, 13-29.

[3] Saracevic, T. 1997. Users lost: reflections of the past, future and limits of information science. ACM SIGIR Forum 31, 2 (July 1997), 16-27.

[4] Ingwersen, P. and Järvelin, K. 2005. The turn. Integration of information seeking and retrieval in context. Springer.

[5] Voorhees, E. 2006. Building Test Collections for Adaptive Information Retrieval: What to Abstract for What cost? In First International Workshop on Adaptive Information Retrieval (AIR).

[6] Turpin, A. and Scholer, F. 2006. User performance versus precision measures for simple search tasks. In Proceedings
of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval
, 11-18.

[7] Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology 58(3), 1915-1933.

[8] Cutting, D., Karger, D., Pedersen, J., and Tukey, J. 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, 318-329.

[9] Workshop on Faceted Search. 2006. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval.

[10] Von Ahn, L. 2006. Games with a Purpose. IEEE Computer 39, 6 (June 2006), 92-94.

[11] Von Ahn, L., Ginosar, S., Kedia, M., Liu, R., and Blum, M. 2006. Improving accessibility of the web with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 79-82.


Wednesday, July 2, 2008

A Call to Action

I sent the following open letter to the leading enterprise providers and industry analysts in the information access community. I am inspired by the recent efforts of researchers to bring industry events to major academic conferences. I'd like to see industry--particularly enterprise providers and industry analysts--return the favor, embracing these events to help bridge the gap between research and practice.

Dear friends in the information access community,

I am reaching out to you with this open letter because I believe we, the leading providers and analysts in the information access community, share a common goal of helping companies understand, evaluate, and differentiate the technologies in this space.

Frankly, I feel that we as a community can do much better at achieving this goal. In my experience talking with CTOs, CIOs, and other decision makers in enterprises, I've found that too many people fail to understand either the state of current technology or the processes they need to put in place to leverage that technology. Indeed, a recent AIIM report confirms what I already knew anecdotally--that there is a widespread failure in the enterprise to understand and derive value from information access.

In order to advance the state of knowledge, I propose that we engage an underutilized resource: the scholarly community of information retrieval and information science researchers. Not only has this community brought us many of the foundations of the technology we provide, but it has also developed a rigorous tradition of evaluation and peer review.

In addition, this community has been increasingly interested in connection with practitioners, as demonstrated by the industry days held at top-tier scholarly conferences, such as SIGIR, CIKM, and ECIR. I have participated in a few of these, and I was impressed with the quality of both the presenters and the attendees. Web search leaders, such as Google, Yahoo, and Microsoft, have embraced these events, as have smaller companies that specialize in search and related technologies, such as information extraction. Enterprise information access providers, however, have been largely absent at these events, as have industry analysts.

I suggest that we take at least the following steps to engage the scholarly community of information retrieval and information science researchers:
  • Collaborate with the organizers of academic conferences such as SIGIR, CIKM, and ECIR to promote participation of enterprise information access providers and analysts in conference industry days.

  • Participate in workshops that are particularly relevant to enterprise information access providers, such as the annual HCIR and exploratory search workshops.
The rigor and independence of the conferences and workshops makes them ideal as vendor-neutral forums. I hope that you all will join me in working to strengthen the connection between the commercial and scholarly communities, thus furthering everyone's understanding of the technology that drives our community forward.

Please contact me at dt@endeca.com or join in an open discussion at http://thenoisychannel.blogspot.com/2008/07/call-to-action.html if you are interested in participating in this effort.

Sincerely,
Daniel Tunkelang

Tuesday, July 1, 2008

Clarification before Refinement on Amazon

I just noticed today that a search on Amazon (e.g., this search for algorithms) does not provide the options to sort the results or to refine by anything other than category. Once you do select a category (e.g., books), you are given additional refinement options, as well as the ability to sort.

While I find this interface less than ideal (e.g. even if all of your search are in a single category, it still makes you select that category explicitly), I do commend them for recognizing the need to have users clarify before they refine. The implication--one we've been pursuing at Endeca--is that it is incumbent on the system to detect when its understanding of the user's intent is ambiguous enough to require a clarification dialogue.

Monday, July 28, 2008

Not as Cuil as I Expected

Today's big tech news is the launch of Cuil, the latest challenger to Google's hegemony in Web search. Given the impressive team of Xooglers that put it together, I had high expectations for the launch.

My overall reaction: not bad, but not good enough to take seriously as a challenge to Google. They may be "The World's Biggest Search Engine" based on the number of pages indexed, but they return zero results for a number of queries where Google does just fine, including noisy channel blog (compare to Google). But I'm not taking it personally--after all, their own site doesn't show up when you search for their name (again, compare to Google). As for their interface features (column display, explore by category, query suggestions), they're fine, but neither the concepts nor the quality of their implementation strike me as revolutionary.

Perhaps I'm expecting too much on day 1. But they're not just trying to beat Gigablast; they're trying to beat Google, and they surely expected to get lots of critical attention the moment they launched. Regardless of the improvements they've made in indexing, they clearly need to do more work on their crawler. It's hard to judge the quality of results when it's clear that at least some of the problem is that the most relevant documents simply aren't in their index. I'm also surprised to not see Wikipedia documents showing up much for my searches--particularly for searches when I'm quite sure the most relevant document is in Wikipedia. Again, it's hard to tell if this is an indexing or results quality issue.

I wish them luck--I speak for many in my desire to see Google face worthy competition in web search.

Sunday, July 27, 2008

Catching up on SIGIR '08

Now that SIGIR '08 is over, I hope to see more folks blogging about it. I'm jealous of everyone who had the opportunity to attend, not only because of the culinary delights of Singapore, but because the program seems to reflect an increasing interest of the academic community in real-world IR problems.

Some notes from looking over the proceedings:
  • Of the 27 paper sessions, 2 include the word "user" in their titles, 2 include the word "social", 2 focus on Query Analysis & Models, and 1 is about exploratory search. Compared to the last few SIGIR conferences, this is a significant increase in focus on users and interaction.

  • A paper on whether test collections predict users' effectiveness offers an admirable defense of the Cranfield paradigm, much along the lines I've been advocating.

  • A nice paper from Microsoft Research looks at the problem of whether to personalize results for a query, recognizing that not all queries benefit from personalization. This approach may well be able to reap the benefits of personaliztion while avoiding much of its harm.

  • Two papers on tag prediction: Real-time Automatic Tag Recommendation (ACM Digital Library subscription required) and Social Tag Prediction. Semi-automated tagging tools are one of the best ways to leverage the best of both human and machine capabilities.
And I haven't even gotten to the posters! I'm sad to see that they dropped the industry day, but perhaps they'll bring it back next year in Boston.

Wednesday, July 23, 2008

Knol: Google takes on Wikipedia

Just a few days ago, I was commenting on a New York Times article about Wikipedia's new approval system that the biggest problem with Wikipedia is anonymous authorship. By synchronous coincidence, Google unveiled Knol today, which is something of a cross between Wikipedia and Squidoo. It's most salient feature is that each entry will have a clearly identified author. They even allow authors to verify their identities using credit cards or phone directories.

It's a nice idea, since anonymous authorship is a a major factor in the adversarial nature of information retrieval on the web. Not only does the accountability of authorship inhibit vandalism and edit wars, but it also allows readers to decide for themselves whom to trust--at least to the extent that readers are able and willing to obtain reliable information about the authors. Without question, they are addressing Wikipedia's biggest weakness.

But it's too little, too late. Wikipedia is already there. And, despite complaints about its inaccuracy and bias, Wikipedia is a fantastic, highly utilized resource. The only way I see for Knol to supplant Wikipedia in reasonable time frame is through a massive cut-and-paste to make up for the huge difference in content.

Interestingly, Wikipedia does not seem to place any onerous restrictions on verbatim copying. However, unless a single author is 100% responsible for authoring a Wikipedia entry, it isn't clear that anyone can simply copy the entry into Knol.

I know that it's dangerous to bet against Google. But I'm really skeptical about this latest effort. It's a pity, because I think their emphasis is the right one. But for once I wish they'd been a bit more humble and accepted that they aren't going to build a better Wikipedia from scratch.

Saturday, July 19, 2008

Predictably Irrational

As regular readers have surely noticed by now, I've been on a bit of a behavioral psychology kick lately. Some of this reflects long-standing personal interest and my latest reading. But I also feel increasingly concerned that researchers in information seeking--especially those working on tools--have neglected the impact of cognitive bias.

For those who are unfamiliar with last few decades of research in this field, I highly recommend a recent lecture by behavioral economist Dan Ariely on predictable irrationality. Not only is he a very informative and entertaining speaker, but he chooses very concrete and credible examples, starting with his contemplating how we experience pain based on his own experience of suffering
third-degree burns over 70 percent of his body. I promise you, the lecture is an hour well spent, and the time will fly by.

A running theme of through this and my other posts on cognitive bias is that the way a information is presented to us has dramatic effects on how we interpret that information.

This is great news for anyone who wants to manipulate people. In fact, I once asked Dan about the relative importance of people's inherent preferences vs. those induced by presentation on retail web sites, and he all but dismissed the former (i.e., you can sell ice cubes to Eskimos, if you can manipulate their cognitive biases appropriately). But it's sobering news for those of us who want to empower user to evaluate information objectively to support decision making.

Friday, July 18, 2008

Call to Action - A Follow-Up

The call to action I sent out a couple of weeks ago has generated healthy interest.

One of the several people who responded is the CTO of one of Endeca's competitors, whom I laud for understanding that the need to better articulate and communicate the technology of information access transcends competition among vendors. While we have differences on how to achieve this goal, I at least see hope from his responsiveness.

The rest were analysts representing some of the leading firms in the space. They not only expressed interest, but also contributed their own ideas on how to make this effort successful. Indeed, I met with two analysts this week to discuss next steps.

Here is where I see this going.

In order for any efforts to communicate the technology of information access to be effective, the forum has to establish credibility as a vendor-neutral and analyst-neutral forum. Ideally, that means having at least two major vendors and two major analysts on board. What we want to avoid is having only one major vendor or analyst, since that will create a reasonable perception of bias.

I'd also like to involve academics in information retrieval and library and information science. As one of the analysts suggested, we could reach out to the leading iSchools, who have expressed an open interest in engaging the broader community.

What I'd like to see come together is a forum, probably a one-day workshop, that brings together credible representatives from the vendor, analyst, and academic communities. With a critical mass of participants and enough diversity to assuage concerns of bias, we can start making good on this call to action.

Tuesday, July 15, 2008

Beyond a Reasonable Doubt

In Psychology of Intelligence Analysis, Richards Heuer advocates that we quantify expressions of uncertainty: "To avoid ambiguity, insert an odds ratio or probability range in parentheses after expressions of uncertainty in key judgments."

His suggestion reminds me of my pet peeve about the unquantified notion of reasonable doubt in the American justice system. I've always wanted (but never had the opportunity) to ask a judge what probability of innocence constitutes a reasonable doubt.

Unfortunately, as Heuer himself notes elsewhere in his book, we human beings are really bad at estimating probabilities. I suspect (with a confidence of 90 to 95%) that quantifying our uncertainties as probability ranges will only suggest a false sense of precision.

So, what can we do to better communicate uncertainty? Here are a couple of thoughts:
  • We can calibrate estimates based on past performance. It's unclear what will happen if people realize that their estimates are being translated, but, at worst, it feels like good fodder for research in judgment and decision making.

  • We can ask people to express relative probability judgments. While these are also susceptible to bias, at least they don't demand as much precision. And we can always vary the framing of questions to try to factor out the cognitive biases they induce.
Also, we talk about uncertainty, it is important that we distinguish between aleatory and epistemic uncertainty.

When I flip a coin, I am certain it has a 50% chance of landing heads, because I know the probability distribution of the event space. This is aleatory uncertainty, and forms the basis of probability and statistics.

But when I reason about less contrived uncertain events, such as estimating the likelihood that my bank will collapse this year, the challenge is my ignorance of the probability distribution. This is epistemic uncertainty, and it's a lot messier.

If you'd like to learn more about aleatory and existential uncertainty, I recommend Nicholas Nassim Taleb's Fooled by Randomness (which is a better read than his better-known Black Swan).

In summary, we have to accept the bad news that the real world is messy. As a mathematician and computer scientist, I've learned to pursue theoretical rigor as an ideal. Like me, you may find it very disconcerting to not be able to treat all real-world uncertainty in terms of probability spaces. Tell it to the judge!

Sunday, July 13, 2008

Small is Beautiful

Today's New York Times has an article by John Markoff called On a Small Screen, Just the Salient Stuff. It argues that the design constraints of the iPhone (and of mobile devices in general) lead to an improved user experience, since site designers do a better job of focusing on the information that users will find relevant.

Of course, on a blog entitled The Noisy Channel, I can't help praising approaches that strive to improve the signal-to-noise ratio in information seeking applications. And I'm glad to see them quoting Ben Shneiderman, a colleague of mine at the University of Maryland who has spent much of his career focusing on HCIR issues.

Still, I think they could have taken the idea much further. Their discussion of more efficient or ergonomic use of real estate boils down to stripping extraneous content (a good idea, but hardly novel), and making sites vertically oriented (i.e., no horizontal scrolling). They don't consider the question of what information is best to present in the limited space--which, in my mind, is the most important question to consider as we optimize interaction. Indeed, many of the questions raised by small screens also apply to other interfaces, such as voice.

Perhaps I am asking too much to expect them to call out the extreme inefficiency of ranked lists, compared to summarization-oriented approaches. Certainly the mobile space opens great opportunities for someone to get this right on the web.

Friday, July 11, 2008

Psychology of Intelligence Analysis

In the course of working with some of Endeca's more interesting clients, I started reading up on how the intelligence agencies address the challenges of making decisions, especially in the face of incomplete and contradictory evidence. I ran into a book called Psychology of Intelligence Analysis by former CIA analyst Richards Heuer. The entire book is available online, or you can hunt down a hard copy of the out-of-print book from your favorite used book seller.

Given the mixed record of the intelligence agencies over the past few decades, you might be wondering if the CIA is the best source for learning how to analyze intelligence. But this book is a gem. Even if the agencies don't always practice what they preach (and the book makes a good case as to why), the book is an excellent tour through the literature on judgment and decision making.

If you're already familiar with work by Herb Simon, Danny Kahneman, and Amos Tversky, then a lot of the ground he covers will be familiar--especially the third of the book that enumerates cognitive biases. I'm a big fan of the judgment and decision making literature myself. But I still found some great nuggets, particularly Chapter 8 on Analysis of Competing Hypotheses. Unlike most of the literature that focuses exclusively on demonstrating our systematic departures from rationality, Heuer hopes offer at least some constructive advice.

As someone who builds tools to help people make decisions using information that not only may be incomplete and contradictory, but also challenging to find in the first place, I'm very sensitive to how people's cognitive biases affect their ability to use these tools effectively. One of the HCIR '07 presentations by Jolie Martin and Michael Norton (who have worked with Max Bazerman) showed how the manner in which information was partitioned on retail web sites drove decisions, i.e., re-organizing the same information affected consumer's decision process.

It may be tempting for us on the software side to wash our hands of our users' cognitive biases. But such an approach would be short-sighted. As Heuer shows in his well-researched book, people not only have cognitive biases, but are unable to counter those biases simply by being made aware of them. Hence, if software tools are to help people make effective decisions, it is the job of us tool builders to build with those biases in mind, and to support processes like Analysis of Competing Hypotheses that try to compensate for human bias.

Thursday, July 10, 2008

Nice Selection of Machine Learning Papers

John Langford just posted a list of seven ICML '08 papers that he found interesting. I appreciate his taste in papers, and I particularly liked a paper on Learning Diverse Rankings with Multi-Armed Bandits that addresses learning a diverse ranking of documents based on users' clicking behavior. If you liked the Less is More work that Harr Chen and David Karger presented at SIGIR '06, then I recommend you check this one out.

Tuesday, July 8, 2008

Librarian 2.0

Many of the words that mark milestones in the history of technology, such as calculator and word processor, originally corresponded to people. Calculating had at least two lives as a technology breakthrough--first as a process, and then as a automatic means for executing that process. Thanks to inventions like calculators and computers, human beings have moved up the value chain to become scientists and engineers who take low-level details for granted.

Similarly, the advances in information science and retrieval have dramatically changed the role of a reference librarian.

Hopefully some of you old enough to remember card catalogs, They were certainly functional if you knew the exact title or author you were looking for, assuming the title wasn't too generic or author too prolific. Where card catalogs fell short was in supporting exploratory search. In many cases, your best bet was to quite literally explore the stacks and hope that locality within the Dewey Decimal system sufficed for to support your information seeking needs. Alternatively, you could follow citation paths--the dead-tree precursor of surfing a hypertext collection.

For exploratory tasks, library patrons would turn to reference librarians, who would clarify the patrons' needs through a process called the reference interview. According to Wikipedia:
A reference interview is composed of two segments:

1. An initial segment in which the librarian encourages the user to fully discuss the request.
2. A final segment in which the librarian asks questions to relate the request to the materials available in the library

A reference interview is structured (ideally) according to the following series of steps. First the library user states a question or describes a problem. The librarian then clarifies the user's information need, sometimes leading him or her back from a request for a specific resource (which may not be the best one for the problem at hand) to the actual information need as it manifests in the library user's life. Following that, the librarian suggests information resources that address the user's information need, explaining the nature and scope of information they contain and soliciting feedback. The reference interview closes when the librarian has provided the appropriate information or a referral to an outside resource where it can be found, and the user confirms that he or she has received the information needed.
Fast forward to the present day. Thanks to modern search engines, title and author search are no longer tedious processes. Moreover, search engines are somewhat forgiving of users, offering spelling correction and inexact query matching. Libraries are still catching up with advances in technology, but the evolution is clearly under way.

However, search engines have not obviated the need for a reference interview. Excepting the simple cases of known item search, the typical information seeker needs help translating an information need into one or more search queries. And that information need may change as the seeker learns from the process.

But it should come as no surprise that information seeking support systems need to be more than search engines. The ideal information seeking support system emulates a reference librarian, stepping users through a structured process of clarification. Indeed, this is exactly what my colleagues and I at Endeca are trying to do in our work with libraries and more broadly in pursuing a vision of human computer information retrieval.

What then becomes of librarians? Much as calculators and computers did not obviate the need for mathematicians, I don't see technology obviating the need for information scientists. Library schools have already evolved into information schools, and I have no doubt that their graduates will help establish the next generation of information seeking technology that makes today's search engines seem as quaint as card catalogs.

Sunday, July 6, 2008

Resolving the Battle Royale between Information Retrieval and Information Science

The following is the position paper I submitted to the NSF Information Seeking Support Systems Workshop last month. The workshop report is still being assembled, but I wanted to share my own contribution to the discussion, since it is particularly appropriate to the themes of The Noisy Channel.


Resolving the Battle Royale between Information Retrieval and Information Science


Daniel Tunkelang

Endeca

ABSTRACT

We propose an approach to help resolve the “battle royale” between the information retrieval and information science communities. The information retrieval side favors the Cranfield paradigm of batch evaluation, criticized by the information science side for its neglect of the user. The information science side favors user studies, criticized by the information retrieval side for their scale and repeatability challenges. Our approach aims to satisfy the primary concerns of both sides.

Categories and Subject Descriptors

H.1.2 [Human Factors]: Human information processing.

H.3.3 [Information Systems]: Information Search and Retrieval - Information Filtering, Retrieval Models

H.5.2 [Information Systems]: Information Interfaces and Presentation - User Interfaces

General Terms

Design, Experimentation, Human Factors

Keywords

Information science, information retrieval, information seeking, evaluation, user studies

1. INTRODUCTION

Over the past few decades, a growing community of researchers has called for the information retrieval community to think outside the Cranfield box. Perhaps the most vocal advocate is Nick Belkin, whose "grand challenges" in his keynote at the 2008 European Conference on Information Retrieval [1] all pertained to the interactive nature of information seeking he claims the Cranfield approach neglects. Belkin cited similar calls to action going back as far as Karen Spärck Jones, in her 1988 acceptance speech for the Gerald Salton award [2], and again from Tefko Saracevic, when he received the same award in 1997 [3]. More recently, we have the Information Seeking and Retrieval research program proposed by Peter Ingwersen and Kalervo Järvelin in The Turn, published in 2005 [4].

2. IMPASSE BETWEEN IR AND IS

Given the advocacy of Belkin and others, why hasn't there been more progress? As Ellen Voorhees noted in defense of Cranfield at the 2006 Workshop on Adaptive Information Retrieval, "changing the abstraction slightly to include just a bit more characterization of the user will result in a dramatic loss of power or increase in cost of retrieval experiments" [5]. Despite user studies that have sought to challenge the Cranfield emphasis on batch information retrieval measures like mean average precision—such as those of Andrew Turpin and Bill Hersh [6]—the information retrieval community, on the whole, remains unconvinced by these experiments because they are smaller in scale and less repeatable than the TREC evaluations.

As Tefko Saracevic has said, there is a "battle royale" between the information retrieval community, which favors the Cranfield paradigm of batch evaluation despite its neglect of the user, and the information science community, which favors user studies despite their scale and repeatability challenges [7]. How do we move forward?

3. PRIMARY CONCERNS OF IR AND IS

Both sides have compelling arguments. If an evaluation procedure is not repeatable and cost-effective, it has little practical value. Nonetheless, it is essential that an evaluation procedure measure the interactive nature of information seeking.

If we are to find common ground to resolve this dispute, we need to satisfy the primary concerns of both sides:

· Real information seeking tasks are interstice, so the results of the evaluation procedure must be meaningful in an interactive context.

· The evaluation procedure must be repeatable and cost-effective.

In order to move beyond the battle royale and resolve the impasse between the IR and IS communities, we need to address both of these concerns.

4. PROPOSED APPROACH


A key point of contention in the battle royale is whether we should evaluate systems by studying individual users or measuring system performance against test collections.

The short answer is that we need to do both. In order to ground the results of evaluation in realistic contexts, we need to conduct user studies that relate proposed measures to success in interactive information seeking tasks. Otherwise, we optimize under the artificial constraint that a task involves only a single user query.

Such an approach presumes that we have a characterization of information seeking tasks. This characterization is an open problem that is beyond the scope of this position paper but has been addressed by other information seeking researchers, including Ingwersen and Järvelin [4]. We presume access to a set of tasks that, if not exhaustive, at least applies to a valuable subset of real information seeking problems.

Consider, as a concrete example, the task of a researcher who, given a comprehensive digital library of technical publications, wants to determine with confidence whether his or her idea is novel. In other words, the researcher want to either discover prior art that anticipates the idea, or to state with confidence that there is no such art. Patent inventors and lawyers performing e-discovery perform analogous tasks. We can measure task performance objectively as a combination of accuracy and efficiency, and we can also consider subject measures like user confidence and satisfaction. Let us assume that we are able to quantify a task success measure that incorporates these factors.

Given this task and success measure, we would like to know how well an information retrieval system supports the user performing it. As the information scientists correctly argue, user studies are indispensable. But, as we employ user studies to determine which systems are most helpful to users, we need to go a step further and correlate user success to one or more system measures. We can then evaluate these system measures in a repeatable, cost-effective process that does not require user involvement.

For example, let us hypothesize that mean average precision (MAP) on a given TREC collection is such a measure. We hypothesize that users pursuing the prior art search task are more successful using a system with higher MAP than those using a system with lower MAP. In order to test this hypothesis, we can present users with a family of systems that, insofar as possible, vary only in MAP, and see how well user success correlates to the system’s MAP. If the correlation is strong, then we validate the utility of MAP as a system measure and invest in evaluating systems using MAP against the specified collection in order to predict their utility for the prior art task.

The principle here is a general one, and can even be used not only to compare different algorithms, but also to evaluate more sophisticated interfaces, such as document clustering [8] or faceted search [9]. The only requirement is that we hypothesize and validate system measures that correlate to user success.

5. WEAKNESSES OF APPROACH

Our proposed approach has two major weaknesses.

The first weakness is that, in a realistic interactive information retrieval context, distinct queries are not independent. Rather, a typical user executes a sequence of queries in pursuit of an information need, each query informed by the results of the previous ones.

In a batch test, we must decide the query sequence in advance, and cannot model how the user’s queries depend on system response. Hence, we are limited to computing measures that can be evaluated for each query independently. Nonetheless, we can choose measures which correlate to effectiveness in realistic settings. Hopefully these measures are still meaningful, even when we remove the test queries from their realistic context.

The second challenge is that we do not envision a way to compare different interfaces in a batch setting. It seems that testing the relative merits of different interfaces requires real—or at least simulated—users.

If, however, we hold the interface constant, then we can define performance measures that apply to those interfaces. For example, we can develop standardized versions of well-studied interfaces, such as faceted search and clustering. We can then compare the performance of different systems that use these interfaces, e.g., different clustering algorithms.

6. AN ALTERNATIVE APPROACH

An alternative way to tackle the evaluation problem leverages the “human computation” approach championed by Luis Von Ahn [10]. This approach uses “games with a purpose” to motivate people to perform information-related tasks, such as image tagging and optical character recognition (OCR).

A particularly interesting "game" in our present context is Phetch, in which in which one or more "Seekers" compete to find an image based on a text description provided by a "Describer" [11]. The Describer’s goal is to help the Seekers succeed, while the Seekers compete with one another to find the target image within a fixed time limit, using search engine that has indexed the images based on tagging results from the ESP Game. In order to discourage a shotgun approach, the game penalizes Seekers for wrong guesses.

This game goes quite far in capturing the essence of interactive information retrieval. If we put aside the competition among the Seekers, then we see that an individual Seeker, aided by the human Describer and the algorithmic--but human indexed--search engine--is pursuing an information retrieval task. Moreover, the Seeker is incented to be both effective and efficient.

How can we leverage this framework for information retrieval evaluation? Even though the game envisions both Describers and Seekers to be human beings, there is no reason we cannot allow computers to play too--in either or both roles. Granted, the game, as currently designed, focuses on image retrieval without giving the human players direct access to the image tags, but we could imagine a framework that is more amenable to machine participation, e.g., providing a machine player with a set of tags derived from those in the index when that player is presented with an image. Alternatively, there may be a domain more suited than image retrieval to incorporating computer players.

The main appeal of the game framework is that it allows all participants to be judged based on an objective criterion that reflects the effectiveness and efficiency of the interactive information retrieval process. A good Describer should, on average, outscore a bad Describer over the long term; likewise, a good Seeker should outscore a bad one. We can even vary the search engine available to Seekers, in order to compare competing search engine algorithms or interfaces.

7. CONCLUSION

Our goal is ambitious: we aspire towards an evaluation framework that satisfies information scientists as relevant to real-world information seeking, but nonetheless offers the practicality of the Cranfield paradigm that dominates information retrieval. The near absence of collaboration between the information science and information retrieval communities has been a greatly missed opportunity not only for both researcher communities but also for the rest of the world who could benefit from practical advances in our understanding of information seeking. We hope that the approach we propose takes at least a small step towards resolving this battle royale.

8. REFERENCES

[1] Belkin, N. J., 2008. Some(What) Grand Challenges for Information Retrieval. ACM SIGIR Forum 42, 1 (June 2008), 47-54.

[2] Spärck Jones, K. 1988. A look back and a look forward. In: SIGIR ’88. In Proceedings of the 11th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, 13-29.

[3] Saracevic, T. 1997. Users lost: reflections of the past, future and limits of information science. ACM SIGIR Forum 31, 2 (July 1997), 16-27.

[4] Ingwersen, P. and Järvelin, K. 2005. The turn. Integration of information seeking and retrieval in context. Springer.

[5] Voorhees, E. 2006. Building Test Collections for Adaptive Information Retrieval: What to Abstract for What cost? In First International Workshop on Adaptive Information Retrieval (AIR).

[6] Turpin, A. and Scholer, F. 2006. User performance versus precision measures for simple search tasks. In Proceedings
of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval
, 11-18.

[7] Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology 58(3), 1915-1933.

[8] Cutting, D., Karger, D., Pedersen, J., and Tukey, J. 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, 318-329.

[9] Workshop on Faceted Search. 2006. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval.

[10] Von Ahn, L. 2006. Games with a Purpose. IEEE Computer 39, 6 (June 2006), 92-94.

[11] Von Ahn, L., Ginosar, S., Kedia, M., Liu, R., and Blum, M. 2006. Improving accessibility of the web with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 79-82.


Wednesday, July 2, 2008

A Call to Action

I sent the following open letter to the leading enterprise providers and industry analysts in the information access community. I am inspired by the recent efforts of researchers to bring industry events to major academic conferences. I'd like to see industry--particularly enterprise providers and industry analysts--return the favor, embracing these events to help bridge the gap between research and practice.

Dear friends in the information access community,

I am reaching out to you with this open letter because I believe we, the leading providers and analysts in the information access community, share a common goal of helping companies understand, evaluate, and differentiate the technologies in this space.

Frankly, I feel that we as a community can do much better at achieving this goal. In my experience talking with CTOs, CIOs, and other decision makers in enterprises, I've found that too many people fail to understand either the state of current technology or the processes they need to put in place to leverage that technology. Indeed, a recent AIIM report confirms what I already knew anecdotally--that there is a widespread failure in the enterprise to understand and derive value from information access.

In order to advance the state of knowledge, I propose that we engage an underutilized resource: the scholarly community of information retrieval and information science researchers. Not only has this community brought us many of the foundations of the technology we provide, but it has also developed a rigorous tradition of evaluation and peer review.

In addition, this community has been increasingly interested in connection with practitioners, as demonstrated by the industry days held at top-tier scholarly conferences, such as SIGIR, CIKM, and ECIR. I have participated in a few of these, and I was impressed with the quality of both the presenters and the attendees. Web search leaders, such as Google, Yahoo, and Microsoft, have embraced these events, as have smaller companies that specialize in search and related technologies, such as information extraction. Enterprise information access providers, however, have been largely absent at these events, as have industry analysts.

I suggest that we take at least the following steps to engage the scholarly community of information retrieval and information science researchers:
  • Collaborate with the organizers of academic conferences such as SIGIR, CIKM, and ECIR to promote participation of enterprise information access providers and analysts in conference industry days.

  • Participate in workshops that are particularly relevant to enterprise information access providers, such as the annual HCIR and exploratory search workshops.
The rigor and independence of the conferences and workshops makes them ideal as vendor-neutral forums. I hope that you all will join me in working to strengthen the connection between the commercial and scholarly communities, thus furthering everyone's understanding of the technology that drives our community forward.

Please contact me at dt@endeca.com or join in an open discussion at http://thenoisychannel.blogspot.com/2008/07/call-to-action.html if you are interested in participating in this effort.

Sincerely,
Daniel Tunkelang

Tuesday, July 1, 2008

Clarification before Refinement on Amazon

I just noticed today that a search on Amazon (e.g., this search for algorithms) does not provide the options to sort the results or to refine by anything other than category. Once you do select a category (e.g., books), you are given additional refinement options, as well as the ability to sort.

While I find this interface less than ideal (e.g. even if all of your search are in a single category, it still makes you select that category explicitly), I do commend them for recognizing the need to have users clarify before they refine. The implication--one we've been pursuing at Endeca--is that it is incumbent on the system to detect when its understanding of the user's intent is ambiguous enough to require a clarification dialogue.

Monday, July 28, 2008

Not as Cuil as I Expected

Today's big tech news is the launch of Cuil, the latest challenger to Google's hegemony in Web search. Given the impressive team of Xooglers that put it together, I had high expectations for the launch.

My overall reaction: not bad, but not good enough to take seriously as a challenge to Google. They may be "The World's Biggest Search Engine" based on the number of pages indexed, but they return zero results for a number of queries where Google does just fine, including noisy channel blog (compare to Google). But I'm not taking it personally--after all, their own site doesn't show up when you search for their name (again, compare to Google). As for their interface features (column display, explore by category, query suggestions), they're fine, but neither the concepts nor the quality of their implementation strike me as revolutionary.

Perhaps I'm expecting too much on day 1. But they're not just trying to beat Gigablast; they're trying to beat Google, and they surely expected to get lots of critical attention the moment they launched. Regardless of the improvements they've made in indexing, they clearly need to do more work on their crawler. It's hard to judge the quality of results when it's clear that at least some of the problem is that the most relevant documents simply aren't in their index. I'm also surprised to not see Wikipedia documents showing up much for my searches--particularly for searches when I'm quite sure the most relevant document is in Wikipedia. Again, it's hard to tell if this is an indexing or results quality issue.

I wish them luck--I speak for many in my desire to see Google face worthy competition in web search.

Sunday, July 27, 2008

Catching up on SIGIR '08

Now that SIGIR '08 is over, I hope to see more folks blogging about it. I'm jealous of everyone who had the opportunity to attend, not only because of the culinary delights of Singapore, but because the program seems to reflect an increasing interest of the academic community in real-world IR problems.

Some notes from looking over the proceedings:
  • Of the 27 paper sessions, 2 include the word "user" in their titles, 2 include the word "social", 2 focus on Query Analysis & Models, and 1 is about exploratory search. Compared to the last few SIGIR conferences, this is a significant increase in focus on users and interaction.

  • A paper on whether test collections predict users' effectiveness offers an admirable defense of the Cranfield paradigm, much along the lines I've been advocating.

  • A nice paper from Microsoft Research looks at the problem of whether to personalize results for a query, recognizing that not all queries benefit from personalization. This approach may well be able to reap the benefits of personaliztion while avoiding much of its harm.

  • Two papers on tag prediction: Real-time Automatic Tag Recommendation (ACM Digital Library subscription required) and Social Tag Prediction. Semi-automated tagging tools are one of the best ways to leverage the best of both human and machine capabilities.
And I haven't even gotten to the posters! I'm sad to see that they dropped the industry day, but perhaps they'll bring it back next year in Boston.

Wednesday, July 23, 2008

Knol: Google takes on Wikipedia

Just a few days ago, I was commenting on a New York Times article about Wikipedia's new approval system that the biggest problem with Wikipedia is anonymous authorship. By synchronous coincidence, Google unveiled Knol today, which is something of a cross between Wikipedia and Squidoo. It's most salient feature is that each entry will have a clearly identified author. They even allow authors to verify their identities using credit cards or phone directories.

It's a nice idea, since anonymous authorship is a a major factor in the adversarial nature of information retrieval on the web. Not only does the accountability of authorship inhibit vandalism and edit wars, but it also allows readers to decide for themselves whom to trust--at least to the extent that readers are able and willing to obtain reliable information about the authors. Without question, they are addressing Wikipedia's biggest weakness.

But it's too little, too late. Wikipedia is already there. And, despite complaints about its inaccuracy and bias, Wikipedia is a fantastic, highly utilized resource. The only way I see for Knol to supplant Wikipedia in reasonable time frame is through a massive cut-and-paste to make up for the huge difference in content.

Interestingly, Wikipedia does not seem to place any onerous restrictions on verbatim copying. However, unless a single author is 100% responsible for authoring a Wikipedia entry, it isn't clear that anyone can simply copy the entry into Knol.

I know that it's dangerous to bet against Google. But I'm really skeptical about this latest effort. It's a pity, because I think their emphasis is the right one. But for once I wish they'd been a bit more humble and accepted that they aren't going to build a better Wikipedia from scratch.

Saturday, July 19, 2008

Predictably Irrational

As regular readers have surely noticed by now, I've been on a bit of a behavioral psychology kick lately. Some of this reflects long-standing personal interest and my latest reading. But I also feel increasingly concerned that researchers in information seeking--especially those working on tools--have neglected the impact of cognitive bias.

For those who are unfamiliar with last few decades of research in this field, I highly recommend a recent lecture by behavioral economist Dan Ariely on predictable irrationality. Not only is he a very informative and entertaining speaker, but he chooses very concrete and credible examples, starting with his contemplating how we experience pain based on his own experience of suffering
third-degree burns over 70 percent of his body. I promise you, the lecture is an hour well spent, and the time will fly by.

A running theme of through this and my other posts on cognitive bias is that the way a information is presented to us has dramatic effects on how we interpret that information.

This is great news for anyone who wants to manipulate people. In fact, I once asked Dan about the relative importance of people's inherent preferences vs. those induced by presentation on retail web sites, and he all but dismissed the former (i.e., you can sell ice cubes to Eskimos, if you can manipulate their cognitive biases appropriately). But it's sobering news for those of us who want to empower user to evaluate information objectively to support decision making.

Friday, July 18, 2008

Call to Action - A Follow-Up

The call to action I sent out a couple of weeks ago has generated healthy interest.

One of the several people who responded is the CTO of one of Endeca's competitors, whom I laud for understanding that the need to better articulate and communicate the technology of information access transcends competition among vendors. While we have differences on how to achieve this goal, I at least see hope from his responsiveness.

The rest were analysts representing some of the leading firms in the space. They not only expressed interest, but also contributed their own ideas on how to make this effort successful. Indeed, I met with two analysts this week to discuss next steps.

Here is where I see this going.

In order for any efforts to communicate the technology of information access to be effective, the forum has to establish credibility as a vendor-neutral and analyst-neutral forum. Ideally, that means having at least two major vendors and two major analysts on board. What we want to avoid is having only one major vendor or analyst, since that will create a reasonable perception of bias.

I'd also like to involve academics in information retrieval and library and information science. As one of the analysts suggested, we could reach out to the leading iSchools, who have expressed an open interest in engaging the broader community.

What I'd like to see come together is a forum, probably a one-day workshop, that brings together credible representatives from the vendor, analyst, and academic communities. With a critical mass of participants and enough diversity to assuage concerns of bias, we can start making good on this call to action.

Tuesday, July 15, 2008

Beyond a Reasonable Doubt

In Psychology of Intelligence Analysis, Richards Heuer advocates that we quantify expressions of uncertainty: "To avoid ambiguity, insert an odds ratio or probability range in parentheses after expressions of uncertainty in key judgments."

His suggestion reminds me of my pet peeve about the unquantified notion of reasonable doubt in the American justice system. I've always wanted (but never had the opportunity) to ask a judge what probability of innocence constitutes a reasonable doubt.

Unfortunately, as Heuer himself notes elsewhere in his book, we human beings are really bad at estimating probabilities. I suspect (with a confidence of 90 to 95%) that quantifying our uncertainties as probability ranges will only suggest a false sense of precision.

So, what can we do to better communicate uncertainty? Here are a couple of thoughts:
  • We can calibrate estimates based on past performance. It's unclear what will happen if people realize that their estimates are being translated, but, at worst, it feels like good fodder for research in judgment and decision making.

  • We can ask people to express relative probability judgments. While these are also susceptible to bias, at least they don't demand as much precision. And we can always vary the framing of questions to try to factor out the cognitive biases they induce.
Also, we talk about uncertainty, it is important that we distinguish between aleatory and epistemic uncertainty.

When I flip a coin, I am certain it has a 50% chance of landing heads, because I know the probability distribution of the event space. This is aleatory uncertainty, and forms the basis of probability and statistics.

But when I reason about less contrived uncertain events, such as estimating the likelihood that my bank will collapse this year, the challenge is my ignorance of the probability distribution. This is epistemic uncertainty, and it's a lot messier.

If you'd like to learn more about aleatory and existential uncertainty, I recommend Nicholas Nassim Taleb's Fooled by Randomness (which is a better read than his better-known Black Swan).

In summary, we have to accept the bad news that the real world is messy. As a mathematician and computer scientist, I've learned to pursue theoretical rigor as an ideal. Like me, you may find it very disconcerting to not be able to treat all real-world uncertainty in terms of probability spaces. Tell it to the judge!

Sunday, July 13, 2008

Small is Beautiful

Today's New York Times has an article by John Markoff called On a Small Screen, Just the Salient Stuff. It argues that the design constraints of the iPhone (and of mobile devices in general) lead to an improved user experience, since site designers do a better job of focusing on the information that users will find relevant.

Of course, on a blog entitled The Noisy Channel, I can't help praising approaches that strive to improve the signal-to-noise ratio in information seeking applications. And I'm glad to see them quoting Ben Shneiderman, a colleague of mine at the University of Maryland who has spent much of his career focusing on HCIR issues.

Still, I think they could have taken the idea much further. Their discussion of more efficient or ergonomic use of real estate boils down to stripping extraneous content (a good idea, but hardly novel), and making sites vertically oriented (i.e., no horizontal scrolling). They don't consider the question of what information is best to present in the limited space--which, in my mind, is the most important question to consider as we optimize interaction. Indeed, many of the questions raised by small screens also apply to other interfaces, such as voice.

Perhaps I am asking too much to expect them to call out the extreme inefficiency of ranked lists, compared to summarization-oriented approaches. Certainly the mobile space opens great opportunities for someone to get this right on the web.

Friday, July 11, 2008

Psychology of Intelligence Analysis

In the course of working with some of Endeca's more interesting clients, I started reading up on how the intelligence agencies address the challenges of making decisions, especially in the face of incomplete and contradictory evidence. I ran into a book called Psychology of Intelligence Analysis by former CIA analyst Richards Heuer. The entire book is available online, or you can hunt down a hard copy of the out-of-print book from your favorite used book seller.

Given the mixed record of the intelligence agencies over the past few decades, you might be wondering if the CIA is the best source for learning how to analyze intelligence. But this book is a gem. Even if the agencies don't always practice what they preach (and the book makes a good case as to why), the book is an excellent tour through the literature on judgment and decision making.

If you're already familiar with work by Herb Simon, Danny Kahneman, and Amos Tversky, then a lot of the ground he covers will be familiar--especially the third of the book that enumerates cognitive biases. I'm a big fan of the judgment and decision making literature myself. But I still found some great nuggets, particularly Chapter 8 on Analysis of Competing Hypotheses. Unlike most of the literature that focuses exclusively on demonstrating our systematic departures from rationality, Heuer hopes offer at least some constructive advice.

As someone who builds tools to help people make decisions using information that not only may be incomplete and contradictory, but also challenging to find in the first place, I'm very sensitive to how people's cognitive biases affect their ability to use these tools effectively. One of the HCIR '07 presentations by Jolie Martin and Michael Norton (who have worked with Max Bazerman) showed how the manner in which information was partitioned on retail web sites drove decisions, i.e., re-organizing the same information affected consumer's decision process.

It may be tempting for us on the software side to wash our hands of our users' cognitive biases. But such an approach would be short-sighted. As Heuer shows in his well-researched book, people not only have cognitive biases, but are unable to counter those biases simply by being made aware of them. Hence, if software tools are to help people make effective decisions, it is the job of us tool builders to build with those biases in mind, and to support processes like Analysis of Competing Hypotheses that try to compensate for human bias.

Thursday, July 10, 2008

Nice Selection of Machine Learning Papers

John Langford just posted a list of seven ICML '08 papers that he found interesting. I appreciate his taste in papers, and I particularly liked a paper on Learning Diverse Rankings with Multi-Armed Bandits that addresses learning a diverse ranking of documents based on users' clicking behavior. If you liked the Less is More work that Harr Chen and David Karger presented at SIGIR '06, then I recommend you check this one out.

Tuesday, July 8, 2008

Librarian 2.0

Many of the words that mark milestones in the history of technology, such as calculator and word processor, originally corresponded to people. Calculating had at least two lives as a technology breakthrough--first as a process, and then as a automatic means for executing that process. Thanks to inventions like calculators and computers, human beings have moved up the value chain to become scientists and engineers who take low-level details for granted.

Similarly, the advances in information science and retrieval have dramatically changed the role of a reference librarian.

Hopefully some of you old enough to remember card catalogs, They were certainly functional if you knew the exact title or author you were looking for, assuming the title wasn't too generic or author too prolific. Where card catalogs fell short was in supporting exploratory search. In many cases, your best bet was to quite literally explore the stacks and hope that locality within the Dewey Decimal system sufficed for to support your information seeking needs. Alternatively, you could follow citation paths--the dead-tree precursor of surfing a hypertext collection.

For exploratory tasks, library patrons would turn to reference librarians, who would clarify the patrons' needs through a process called the reference interview. According to Wikipedia:
A reference interview is composed of two segments:

1. An initial segment in which the librarian encourages the user to fully discuss the request.
2. A final segment in which the librarian asks questions to relate the request to the materials available in the library

A reference interview is structured (ideally) according to the following series of steps. First the library user states a question or describes a problem. The librarian then clarifies the user's information need, sometimes leading him or her back from a request for a specific resource (which may not be the best one for the problem at hand) to the actual information need as it manifests in the library user's life. Following that, the librarian suggests information resources that address the user's information need, explaining the nature and scope of information they contain and soliciting feedback. The reference interview closes when the librarian has provided the appropriate information or a referral to an outside resource where it can be found, and the user confirms that he or she has received the information needed.
Fast forward to the present day. Thanks to modern search engines, title and author search are no longer tedious processes. Moreover, search engines are somewhat forgiving of users, offering spelling correction and inexact query matching. Libraries are still catching up with advances in technology, but the evolution is clearly under way.

However, search engines have not obviated the need for a reference interview. Excepting the simple cases of known item search, the typical information seeker needs help translating an information need into one or more search queries. And that information need may change as the seeker learns from the process.

But it should come as no surprise that information seeking support systems need to be more than search engines. The ideal information seeking support system emulates a reference librarian, stepping users through a structured process of clarification. Indeed, this is exactly what my colleagues and I at Endeca are trying to do in our work with libraries and more broadly in pursuing a vision of human computer information retrieval.

What then becomes of librarians? Much as calculators and computers did not obviate the need for mathematicians, I don't see technology obviating the need for information scientists. Library schools have already evolved into information schools, and I have no doubt that their graduates will help establish the next generation of information seeking technology that makes today's search engines seem as quaint as card catalogs.

Sunday, July 6, 2008

Resolving the Battle Royale between Information Retrieval and Information Science

The following is the position paper I submitted to the NSF Information Seeking Support Systems Workshop last month. The workshop report is still being assembled, but I wanted to share my own contribution to the discussion, since it is particularly appropriate to the themes of The Noisy Channel.


Resolving the Battle Royale between Information Retrieval and Information Science


Daniel Tunkelang

Endeca

ABSTRACT

We propose an approach to help resolve the “battle royale” between the information retrieval and information science communities. The information retrieval side favors the Cranfield paradigm of batch evaluation, criticized by the information science side for its neglect of the user. The information science side favors user studies, criticized by the information retrieval side for their scale and repeatability challenges. Our approach aims to satisfy the primary concerns of both sides.

Categories and Subject Descriptors

H.1.2 [Human Factors]: Human information processing.

H.3.3 [Information Systems]: Information Search and Retrieval - Information Filtering, Retrieval Models

H.5.2 [Information Systems]: Information Interfaces and Presentation - User Interfaces

General Terms

Design, Experimentation, Human Factors

Keywords

Information science, information retrieval, information seeking, evaluation, user studies

1. INTRODUCTION

Over the past few decades, a growing community of researchers has called for the information retrieval community to think outside the Cranfield box. Perhaps the most vocal advocate is Nick Belkin, whose "grand challenges" in his keynote at the 2008 European Conference on Information Retrieval [1] all pertained to the interactive nature of information seeking he claims the Cranfield approach neglects. Belkin cited similar calls to action going back as far as Karen Spärck Jones, in her 1988 acceptance speech for the Gerald Salton award [2], and again from Tefko Saracevic, when he received the same award in 1997 [3]. More recently, we have the Information Seeking and Retrieval research program proposed by Peter Ingwersen and Kalervo Järvelin in The Turn, published in 2005 [4].

2. IMPASSE BETWEEN IR AND IS

Given the advocacy of Belkin and others, why hasn't there been more progress? As Ellen Voorhees noted in defense of Cranfield at the 2006 Workshop on Adaptive Information Retrieval, "changing the abstraction slightly to include just a bit more characterization of the user will result in a dramatic loss of power or increase in cost of retrieval experiments" [5]. Despite user studies that have sought to challenge the Cranfield emphasis on batch information retrieval measures like mean average precision—such as those of Andrew Turpin and Bill Hersh [6]—the information retrieval community, on the whole, remains unconvinced by these experiments because they are smaller in scale and less repeatable than the TREC evaluations.

As Tefko Saracevic has said, there is a "battle royale" between the information retrieval community, which favors the Cranfield paradigm of batch evaluation despite its neglect of the user, and the information science community, which favors user studies despite their scale and repeatability challenges [7]. How do we move forward?

3. PRIMARY CONCERNS OF IR AND IS

Both sides have compelling arguments. If an evaluation procedure is not repeatable and cost-effective, it has little practical value. Nonetheless, it is essential that an evaluation procedure measure the interactive nature of information seeking.

If we are to find common ground to resolve this dispute, we need to satisfy the primary concerns of both sides:

· Real information seeking tasks are interstice, so the results of the evaluation procedure must be meaningful in an interactive context.

· The evaluation procedure must be repeatable and cost-effective.

In order to move beyond the battle royale and resolve the impasse between the IR and IS communities, we need to address both of these concerns.

4. PROPOSED APPROACH


A key point of contention in the battle royale is whether we should evaluate systems by studying individual users or measuring system performance against test collections.

The short answer is that we need to do both. In order to ground the results of evaluation in realistic contexts, we need to conduct user studies that relate proposed measures to success in interactive information seeking tasks. Otherwise, we optimize under the artificial constraint that a task involves only a single user query.

Such an approach presumes that we have a characterization of information seeking tasks. This characterization is an open problem that is beyond the scope of this position paper but has been addressed by other information seeking researchers, including Ingwersen and Järvelin [4]. We presume access to a set of tasks that, if not exhaustive, at least applies to a valuable subset of real information seeking problems.

Consider, as a concrete example, the task of a researcher who, given a comprehensive digital library of technical publications, wants to determine with confidence whether his or her idea is novel. In other words, the researcher want to either discover prior art that anticipates the idea, or to state with confidence that there is no such art. Patent inventors and lawyers performing e-discovery perform analogous tasks. We can measure task performance objectively as a combination of accuracy and efficiency, and we can also consider subject measures like user confidence and satisfaction. Let us assume that we are able to quantify a task success measure that incorporates these factors.

Given this task and success measure, we would like to know how well an information retrieval system supports the user performing it. As the information scientists correctly argue, user studies are indispensable. But, as we employ user studies to determine which systems are most helpful to users, we need to go a step further and correlate user success to one or more system measures. We can then evaluate these system measures in a repeatable, cost-effective process that does not require user involvement.

For example, let us hypothesize that mean average precision (MAP) on a given TREC collection is such a measure. We hypothesize that users pursuing the prior art search task are more successful using a system with higher MAP than those using a system with lower MAP. In order to test this hypothesis, we can present users with a family of systems that, insofar as possible, vary only in MAP, and see how well user success correlates to the system’s MAP. If the correlation is strong, then we validate the utility of MAP as a system measure and invest in evaluating systems using MAP against the specified collection in order to predict their utility for the prior art task.

The principle here is a general one, and can even be used not only to compare different algorithms, but also to evaluate more sophisticated interfaces, such as document clustering [8] or faceted search [9]. The only requirement is that we hypothesize and validate system measures that correlate to user success.

5. WEAKNESSES OF APPROACH

Our proposed approach has two major weaknesses.

The first weakness is that, in a realistic interactive information retrieval context, distinct queries are not independent. Rather, a typical user executes a sequence of queries in pursuit of an information need, each query informed by the results of the previous ones.

In a batch test, we must decide the query sequence in advance, and cannot model how the user’s queries depend on system response. Hence, we are limited to computing measures that can be evaluated for each query independently. Nonetheless, we can choose measures which correlate to effectiveness in realistic settings. Hopefully these measures are still meaningful, even when we remove the test queries from their realistic context.

The second challenge is that we do not envision a way to compare different interfaces in a batch setting. It seems that testing the relative merits of different interfaces requires real—or at least simulated—users.

If, however, we hold the interface constant, then we can define performance measures that apply to those interfaces. For example, we can develop standardized versions of well-studied interfaces, such as faceted search and clustering. We can then compare the performance of different systems that use these interfaces, e.g., different clustering algorithms.

6. AN ALTERNATIVE APPROACH

An alternative way to tackle the evaluation problem leverages the “human computation” approach championed by Luis Von Ahn [10]. This approach uses “games with a purpose” to motivate people to perform information-related tasks, such as image tagging and optical character recognition (OCR).

A particularly interesting "game" in our present context is Phetch, in which in which one or more "Seekers" compete to find an image based on a text description provided by a "Describer" [11]. The Describer’s goal is to help the Seekers succeed, while the Seekers compete with one another to find the target image within a fixed time limit, using search engine that has indexed the images based on tagging results from the ESP Game. In order to discourage a shotgun approach, the game penalizes Seekers for wrong guesses.

This game goes quite far in capturing the essence of interactive information retrieval. If we put aside the competition among the Seekers, then we see that an individual Seeker, aided by the human Describer and the algorithmic--but human indexed--search engine--is pursuing an information retrieval task. Moreover, the Seeker is incented to be both effective and efficient.

How can we leverage this framework for information retrieval evaluation? Even though the game envisions both Describers and Seekers to be human beings, there is no reason we cannot allow computers to play too--in either or both roles. Granted, the game, as currently designed, focuses on image retrieval without giving the human players direct access to the image tags, but we could imagine a framework that is more amenable to machine participation, e.g., providing a machine player with a set of tags derived from those in the index when that player is presented with an image. Alternatively, there may be a domain more suited than image retrieval to incorporating computer players.

The main appeal of the game framework is that it allows all participants to be judged based on an objective criterion that reflects the effectiveness and efficiency of the interactive information retrieval process. A good Describer should, on average, outscore a bad Describer over the long term; likewise, a good Seeker should outscore a bad one. We can even vary the search engine available to Seekers, in order to compare competing search engine algorithms or interfaces.

7. CONCLUSION

Our goal is ambitious: we aspire towards an evaluation framework that satisfies information scientists as relevant to real-world information seeking, but nonetheless offers the practicality of the Cranfield paradigm that dominates information retrieval. The near absence of collaboration between the information science and information retrieval communities has been a greatly missed opportunity not only for both researcher communities but also for the rest of the world who could benefit from practical advances in our understanding of information seeking. We hope that the approach we propose takes at least a small step towards resolving this battle royale.

8. REFERENCES

[1] Belkin, N. J., 2008. Some(What) Grand Challenges for Information Retrieval. ACM SIGIR Forum 42, 1 (June 2008), 47-54.

[2] Spärck Jones, K. 1988. A look back and a look forward. In: SIGIR ’88. In Proceedings of the 11th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, 13-29.

[3] Saracevic, T. 1997. Users lost: reflections of the past, future and limits of information science. ACM SIGIR Forum 31, 2 (July 1997), 16-27.

[4] Ingwersen, P. and Järvelin, K. 2005. The turn. Integration of information seeking and retrieval in context. Springer.

[5] Voorhees, E. 2006. Building Test Collections for Adaptive Information Retrieval: What to Abstract for What cost? In First International Workshop on Adaptive Information Retrieval (AIR).

[6] Turpin, A. and Scholer, F. 2006. User performance versus precision measures for simple search tasks. In Proceedings
of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval
, 11-18.

[7] Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology 58(3), 1915-1933.

[8] Cutting, D., Karger, D., Pedersen, J., and Tukey, J. 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, 318-329.

[9] Workshop on Faceted Search. 2006. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval.

[10] Von Ahn, L. 2006. Games with a Purpose. IEEE Computer 39, 6 (June 2006), 92-94.

[11] Von Ahn, L., Ginosar, S., Kedia, M., Liu, R., and Blum, M. 2006. Improving accessibility of the web with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 79-82.


Wednesday, July 2, 2008

A Call to Action

I sent the following open letter to the leading enterprise providers and industry analysts in the information access community. I am inspired by the recent efforts of researchers to bring industry events to major academic conferences. I'd like to see industry--particularly enterprise providers and industry analysts--return the favor, embracing these events to help bridge the gap between research and practice.

Dear friends in the information access community,

I am reaching out to you with this open letter because I believe we, the leading providers and analysts in the information access community, share a common goal of helping companies understand, evaluate, and differentiate the technologies in this space.

Frankly, I feel that we as a community can do much better at achieving this goal. In my experience talking with CTOs, CIOs, and other decision makers in enterprises, I've found that too many people fail to understand either the state of current technology or the processes they need to put in place to leverage that technology. Indeed, a recent AIIM report confirms what I already knew anecdotally--that there is a widespread failure in the enterprise to understand and derive value from information access.

In order to advance the state of knowledge, I propose that we engage an underutilized resource: the scholarly community of information retrieval and information science researchers. Not only has this community brought us many of the foundations of the technology we provide, but it has also developed a rigorous tradition of evaluation and peer review.

In addition, this community has been increasingly interested in connection with practitioners, as demonstrated by the industry days held at top-tier scholarly conferences, such as SIGIR, CIKM, and ECIR. I have participated in a few of these, and I was impressed with the quality of both the presenters and the attendees. Web search leaders, such as Google, Yahoo, and Microsoft, have embraced these events, as have smaller companies that specialize in search and related technologies, such as information extraction. Enterprise information access providers, however, have been largely absent at these events, as have industry analysts.

I suggest that we take at least the following steps to engage the scholarly community of information retrieval and information science researchers:
  • Collaborate with the organizers of academic conferences such as SIGIR, CIKM, and ECIR to promote participation of enterprise information access providers and analysts in conference industry days.

  • Participate in workshops that are particularly relevant to enterprise information access providers, such as the annual HCIR and exploratory search workshops.
The rigor and independence of the conferences and workshops makes them ideal as vendor-neutral forums. I hope that you all will join me in working to strengthen the connection between the commercial and scholarly communities, thus furthering everyone's understanding of the technology that drives our community forward.

Please contact me at dt@endeca.com or join in an open discussion at http://thenoisychannel.blogspot.com/2008/07/call-to-action.html if you are interested in participating in this effort.

Sincerely,
Daniel Tunkelang

Tuesday, July 1, 2008

Clarification before Refinement on Amazon

I just noticed today that a search on Amazon (e.g., this search for algorithms) does not provide the options to sort the results or to refine by anything other than category. Once you do select a category (e.g., books), you are given additional refinement options, as well as the ability to sort.

While I find this interface less than ideal (e.g. even if all of your search are in a single category, it still makes you select that category explicitly), I do commend them for recognizing the need to have users clarify before they refine. The implication--one we've been pursuing at Endeca--is that it is incumbent on the system to detect when its understanding of the user's intent is ambiguous enough to require a clarification dialogue.