Friday, August 29, 2008

Improving The Noisy Channel: A Call for Ideas

Over the past five months, this blog has grown from a suggestion Jeff Dalton put in my ear to a community to which I'm proud to belong.

Some milestones:
  • Over 70 posts to date.
  • 94 subscribers, as reported by Google Reader.
  • 100 unique visitors on.a typical day.
To be honest, I thought I'd struggle to keep up with posting weekly, and that I'd need to convince my mom to read this blog so that I wouldn't be speaking to an empty room. The results so far have wildly exceeded the expectations I came in with.

But now that I've seen the potential of this blog, I'd like to "take it to the next level," as the MBA types say.

My goals:
  • Increase the readership. My motive isn't (only) to inflate my own ego. I've seen that this blog succeeds most when it stimulates conversation, and a conversation needs participants.

  • Increase participation. Given the quantity and quality of comments on recent posts, it's clear that readers here contribute the most valuable content. I'd like to step that up a notch by having readers guest-blog and perhaps going as far as to turning The Noisy Channel into a group blog about information seeking that transcends my personal take on the subject. I've very open to suggestions here.

  • Add some style. Various folks have offered suggestions for improving the blog, such as changing platforms to WordPress, modifying the layout to better use screen real estate, adding more images, etc. I'm the first to admit that I am not a designer, and I'd really appreciate ideas from you all on how to make this site more attractive and usable.
In short, I'm asking you to help me help you make The Noisy Channel a better and noisier place. Please post your comments here or email me if you'd prefer to make suggestions privately.

Wednesday, August 27, 2008

Transparency in Information Retrieval

It's been hard to find time to write another post while keeping up with the comment stream on my previous post about set retrieval! I'm very happy to see this level of interest, and I hope to continue catalyzing such discussions.

Today, I'd like to discuss transparency in the context of information retrieval. Transparency is an increasingly popular term these days in the context of search--perhaps not surprising, since users are finally starting to question the idea of search as a black box.

The idea of transparency is simple: users should know why a search engine returns a particular response to their query. Note the emphasis on "why" rather than "how". Most users don't care what algorithms a search engine uses to compute a response. What they do care about is how the engine ultimately "understood" their query--in other words, what question the engine thinks it's answering.

Some of you might find this description too anthropomorphic. But a recent study reported that most users expect search engines to read their minds--never mind that the general case goes beyond AI-complete (should we create a new class of ESP-complete problems)? But what frustrates users most is when a search engine not only fails to read their minds, but gives no indication of where the communication broke down, let alone how to fix it. In short, a failure to provide transparency.

What does this have to do with set retrieval vs. ranked retrieval? Plenty!

Set retrieval predates the Internet by a few decades, and was the first approach used to implement search engines. These search engines allowed users to enter queries by stringing together search terms with Boolean operators (AND, OR, etc.). Today, Boolean retrieval seem arcane, and most people see set retrieval as suitable for querying databases, rather than for querying search engines.

The biggest problem with set retrieval is that users find it extremely difficult to compose effective Boolean queries. Nonetheless, there is no question that set retrieval offers transparency: what you ask is what you get. And, if you prefer a particular sort order for your results, you can specify it.

In contrast, ranked retrieval makes it much easier for users to compose queries: users simply enter a few top-of-mind keywords. And for many use cases (in particular, known-item search) , a state-of-the-art implementation of ranked retrieval yields results that are good enough.

But ranked retrieval approaches generally shed transparency. At best, they employ standard information retrieval models that, although published in all of their gory detail, are opaque to their users--who are unlikely to be SIGIR regulars. At worst, they employ secret, proprietary models, either to protect their competitive differentiation or to thwart spammers.

Either way, the only clues that most ranked retrieval engines provide to users are text snippets from the returned documents. Those snippets may validate the relevance of the results that are shown, but the user does not learn what distinguishes the top-ranked results from other documents that contain some or all of the query terms.

If the user is satisfied with one of the top results, then transparency is unlikely to even come up. Even if the selected result isn't optimal, users may do well to satisfice. But when the search engine fails to read the user's mind, transparency offer the best hope of recovery.

But, as I mentioned earlier, users aren't great at composing queries for set retrieval, which was how ranked retrieval became so popular in the first place despite its lack of transparency. How do we resolve this dilemma?

To be continued...

Sunday, August 24, 2008

Set Retrieval vs. Ranked Retrieval

After last week's post about a racially targeted web search engine, you'd think I'd avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I'll start by introducing the terms.
  • In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.

  • In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.
An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:
  • The number of documents reported to match my search should be meaningful--or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.

  • Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.
Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

Saturday, August 23, 2008

Back from the Cone of Silence

Regular readers may have noticed the lack of posts this week. My apologies to anyone who was waiting by the RSS feed. Yesterday was the submission deadline for HCIR '08, which means that today is a new day! So please stay tuned for your regularly scheduled programming.

Saturday, August 16, 2008

Thinking Outside the Black Box

I was reading Techmeme today, and I noticed an LA Times article about RushmoreDrive, described on its About Us page as "a first-of-its-kind search engine for the Black community." My first reaction, blogged by others already, was that this idea was dumb and racist. In fact, it took some work to find positive commentary about RushmoreDrive.

But I've learned from the way the blogosphere handled the Cuil launch not to trust anyone who evaluates a search engine without having tried it, myself included. My wife and I have been the only white people at Amy Ruth's and the service was as gracious as the chicken and waffles were delicious; I decided I'd try my luck on a search engine not targeted at my racial profile.

The search quality is solid, comparable to that of Google, Yahoo, and Microsoft. In fact, the site looks a lot like a re-skinning (no pun intended) of Ask.com, a corporate sibling of IAC-owned RushmoreDrive. Like Ask.com, RushmoreDrive emphasizes search refinement through narrowing and broadening refinements.

What I find ironic is that the whole controversy about racial bias in relevance ranking reveals the much bigger problem--that relevance ranking should not be a black box (ok, maybe this time I'll take responsibility for the pun). I've been beating this drum at The Noisy Channel ever since I criticized Amit Singhal for Google's lack of transparency. I think that sites like RushmoreDrive are inevitable if search engines refuse to cede more control of search results to users.

I don't know how much information race provides as prior to influence statistical ranking approaches, but I'm skeptical that the effects are useful or even noticeable beyond a few well-chosen examples. I'm more inclined to see RushmoreDrive as a marketing ploy by the folks at IAC--and perhaps a successful one. I doubt that Google is running scared, but I think this should be a wake-up call to folks who are convinced that personalized relevance ranking is the end goal of user experience for search engines.

Friday, August 15, 2008

New Information Retrieval Book Available Online

Props to Jeff Dalton for alerting me about the new book on information retrieval by Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. You can buy a hard copy, but you can also access it online for free at the book website.

Wednesday, August 13, 2008

David Huynh's Freebase Parallax

One of the perks of working in HCIR is that you get to meet some of the coolest people in academic and industrial research. I met David Huynh a few years ago, while he was a graduate student at MIT, working in the Haystack group and on the Simile project. You've probably seen some of his work: his Timeline project has been deployed all over the web.

Despite efforts by me and other to persuade David to stay in the Northeast, he went out west a few months ago to join Metaweb, a company with ambitions "to build a better infrastructure for the Web." While I (and others) am not persuaded by Freebase, Metaweb's "open database of the world’s information," I am happy to see that David is still doing great work.

I encourage you to check out David's latest project: Freebase Parallax. In it, he does something I've never seen outside Endeca (excepting David's earlier work on a Nested Faceted Browser) he allows you to navigate using the facets of multiple entity types, joining between sets of entities through their relationships. At Endeca, we call this "record relationship navigation"--we presented it at HCIR '07, showing an how it can enable social navigation.

David includes a video where he eloquently demonstrates how Parallax works, and the interface is quite compelling. I'm not sure how well it scales with large data sets, but David's focus has been on interfaces rather than systems. My biggest complaint--which isn't David's fault--is that the Freebase content is a bit sparse. But his interface strikes me as a great fit for exploratory search.

Conversation with Seth Grimes

I had an great conversation with Intelligent Enterprise columnist Seth Grimes today. Apparently there's an upside to writing critical commentary on Google's aspirations in the enterprise!

One of the challenges in talking about enterprise search is that no one seems to agree on what it is. Indeed, as I've been discussing with Ryan Shaw , I use the term broadly to describe information access scenarios distinct from web search where an organization has some ownership or control of the content (in contrast to the somewhat adversarial relationship that web search companies have with the content they index). But I realize that many folks define enterprise search more narrowly to be a search box hooked up to the intranet.

Perhaps a better way to think about enterprise search is as a problem rather than solution. Many people expect a search box because they're familiar with searching the web using Google. I don't blame anyone for expecting that the same interface will work for enterprise information collections. Unfortunately, wishful thinking and clever advertising notwithstanding, it doesn't.

I've blogged about this subject from several different perspectives over the past weeks, so I'll refer recent readers to earlier posts on the subject rather than bore the regulars.

But I did want to mention a comment Seth made that I found particularly insightful. He defined enterprise search even more broadly than I do, suggesting that it encompassed any information seeking performed in the pursuit of enterprise-centric needs. In that context, he does see Google as the leader in enterprise search--not because of their enterprise offerings, but rather because of the web search they offer for free.

I'm not sure how I feel about his definition, but I think he raises a point that enterprise vendors often neglect. No matter how much information an enterprise controls, there will always be valuable information outside the enterprise. I find today's APIs to that information woefully inadequate; for example, I can't even choose a sort order through any of the web search APIs. But I am optimistic that those APIs will evolve, and that we will see "federated" information seeking that goes beyond merging ranked lists from different sources.

Indeed, I look forward to the day that web search providers take a cue from the enterprise and drop the focus on black box relevance ranking in favor of an approach that offers users control and interaction.

Monday, August 11, 2008

Position papers for NSF IS3 Workshop

I just wanted to let folks know that the position papers for the NSF Information Seeking Support Systems Workshop are now available at this link.

Here is a listing to whet your curiosity:
  • Supporting Interaction and Familiarity
    James Allan, University of Massachusetts Amherst, USA

  • From Web Search to Exploratory Search: Can we get there from here?
    Peter Anick, Yahoo! Inc., USA

  • Complex and Exploratory Web Search (with Daniel Russell)
    Anne Aula, Google, USA

  • Really Supporting Information Seeking: A Position Paper
    Nicholas J. Belkin, Rutgers University, USA

  • Transparent and User-Controllable Personalization For Information Exploration
    Peter Brusilovsky, University of Pittsburgh, USA

  • Faceted Exploratory Search Using the Relation Browser
    Robert Capra, UNC, USA

  • Towards a Model of Understanding Social Search
    Ed Chi, Palo Alto Research Center, USA

  • Building Blocks For Rapid Development of Information Seeking Support Systems
    Gary Geisler, University of Texas at Austin, USA

  • Collaborative Information Seeking in Electronic Environments
    Gene Golovchinsky, FX Palo Alto Laboratory, USA

  • NeoNote: User Centered Design Suggestions for a Global Shared Scholarly Annotation System
    Brad Hemminger, UNC, USA

  • Speaking the Same Language About Exploratory Information Seeking
    Bill Kules, The Catholic University of America, USA

  • Musings on Information Seeking Support Systems
    Michael Levi, U.S. Bureau of Labor Statistics, USA

  • Social Bookmarking and Information Seeking
    David Millen, IBM Research, USA

  • Making Sense of Search Result Pages
    Jan Pedersen, Yahoo, USA

  • A Multilevel Science of Social Information Foraging and Sensemaking
    Peter Pirolli, XEROX PARC USA

  • Characterizing, Supporting and Evaluating Exploratory Search
    Edie Rasmussen, University of British Columbia, Canada

  • The Information-Seeking Funnel
    Daniel Rose, A9.com, USA

  • Complex and Exploratory Web Search (with Anne Aula)
    Daniel Russell, Google, USA

  • Research Agenda: Visual Overviews for Exploratory Search
    Ben Shneiderman, University of Maryland, USA

  • Five Challenges for Research to Support IS3
    Elaine Toms, Dalhousie University, Canada

  • Resolving the Battle Royale between Information Retrieval and Information Science
    Daniel Tunkelang, Endeca, USA

Sunday, August 10, 2008

Why Enterprise Search Will Never Be Google-y

As I prepared to end my trilogy of Google-themed posts, I ran into two recently published items. They provide an excellent context for what I intended to talk about: the challenges and opportunities of enterprise search.

The first is Google's announcement of an upgrade to their search appliance that allows one box to index 10 million documents and offers improved search quality and personalization.

The second is an article by Chris Sherman in the Enterprise Search Sourcebook 2008 entitled Why Enterprise Search Will Never Be Google-y.

First, the Google announcement. These are certainly improvements for the GSA, and Google does seem to be aiming to compete with the Big Three: Autonomy, Endeca, FAST (now a subsidiary of Microsoft). But these improvements should be seen in the context of state of the art. In particular, Google's scalability claims, while impressive, still fall short of the market leaders in enterprise search. Moreover, the bottleneck in enterprise search hasn't been the scale of document indexing, but rather the effectiveness with which people can access and interact with the indexed content. Interestingly, Google's strongest selling point for the GSA, their claim it works "out of the box", is also its biggest weakness: even with the new set of features, the GSA does not offer the flexibility or rich functionality that enterprises have come to expect.

Second, the Chris Sherman piece. Here is an excerpt:
Enterprise search and web search are fundamentally different animals, and I'd argue that enterprise search won't--and shouldn't--be Google-y any time soon....Like web search, Google's enterprise search is easy to use--if you're willing to go along with how Google's algorithms view and present your business information....Ironically, enterprises, with all of their highly structures and carefully organized silos of information, require a very different and paradoxically more complex approach.
I highly recommend you read the whole article (it's only 2 pages), not only because it informative and well written, but also because the author isn't working for one of the Big Three.

The upshot? There is no question that Google is raising the bar for simple search in the enterprise. I wouldn't recommend that anyone try to compete with the GSA on its turf.

But information needs in the enterprise go far beyond known-item search, What enterprises want when they ask for "enterprise search" is not just a search box, but an interactive tool that helps them (or their customers) work through the process of articulating and fulfilling their information needs, for tasks as diverse as customer segmentation, knowledge management, and e-discovery.

If you're interested in search and want to be on the cutting edge of innovation, I suggest you think about the enterprise.

Thursday, August 7, 2008

Where Google Isn't Good Enough

My last post, Is Google Good Enough?, challenged would-be Google killers to identify and address clear consumer needs for which Google isn't good enough as a solution. I like helping my readers, so here are some ideas.
  • Shopping. Google Product Search (fka Froogle) is not one of Google's crown jewels. At best, it works well when you know the exact name of the product you are looking for. But it pales in contrast to any modern ecommerce site, such as Amazon or Home Depot. What makes a shopping site successful? Put simply, it helps users find what they want, even when they didn't know exactly what they wanted when they started.

  • Finding a job. Google has not thrown its hat into the ring of job search, and even the page they offer for finding jobs at Google could use some improvement. The two biggest job sites, Monster and Careerbuilder, succeed in terms of the number of jobs posted, but aren't exactly optimized for user experience. Dice does better, but only for technology jobs. Interestingly, the best job finding site may be LinkedIn--not because of their search implementation (which is adequate but not innovative), but because of their success in getting millions of professionals to provide high-quality data.

  • Finding employees. Again, LinkedIn has probably come closest to providing a good employee finding site. The large job sites (all of which I've used at some point) not only fail to support exploratory search, but also suffer from a skew towards ineligible candidates and a nuisance of recruiters posing as job seekers. Here again, Google has not tried to compete.

  • Planning a trip. Sure, you can use Expedia, Travelocity, or Kayak to find a flight, hotel, and car rental. But there's a lot of room for improvement when it comes to planning a trip, whether for business or pleasure. The existing tools do a poor job of putting together a coordinated itinerary (e.g., meals, activities), and also don't integrate with relevant information sources, such as local directories and reviews. This is another area where Google has not tried to play.
Note two general themes here. The first is thinking beyond the mechanics of search and focusing on the ability to meet user needs at the task level. The second is the need for exploratory search. These only scratch the surface of opportunities in consumer-facing "search" applications. The opportunities within the enterprise are even greater, but I'll save that for my next post.

Tuesday, August 5, 2008

Is Google Good Enough?

As Chief Scientist of Endeca, I spend a lot of my time explaining to people why they should not be satisfied with an information seekin interface that only offers them keyword search as an input mechanism and a ranked list of results as output. I tell them about query clarification dialogs, faceted navigation, and set analysis. More broadly, I evangelize exploratory search and human computer information retrieval as critical to addressing the inherent weakness of conventional ranked retrieval. If you haven't heard me expound on the subject, feel free to check out this slide show on Is Search Broken?.

But today I wanted to put my ideology aside and ask the the simple question: Is Google good enough? Here is a good faith attempt to make the case for the status quo. I'll focus on web search, since, as I've discussed before on this blog, enterprise search is different.

1) Google does well enough on result quality, enough of the time.

While Google doesn't publish statistics about user satisfaction, it's commonplace that Google usually succeeds in returning results that users find relevant. Granted, so do all of the major search engines: you can compare Google and Yahoo graphically at this site. But the question is not whether other search engines are also good enough--or even whether they are better. The point is that Google is good enough.

2) Google doesn't support exploratory search. But it often leads you to a tool that does.

The classic instance of this synergy is when Google leads you to a Wikipedia entry. For example, I look up Daniel Kahneman on Google. The top results is his Wikipedia entry. From there, I can traverse links to learn about his research areas, his colleagues, etc.

3) Google is a benign monopoly that mitigates choice overload.

Many people, myself includes, have concerns about Google's increasing role in mediating our access to information. But it's hard to ignore the upside of a single portal that gives you access to everything in one place: web pages, blogs, maps, email, etc, And it's all "free"--at least in so far as ad-supported services can be said to be free.

In summary, Google sets the bar pretty high. There are places where Google performs poorly (e.g., shopping) or doesn't even try to compete (e.g., travel). But when I see the series of companies lining up to challenge Google, I have to wonder how many of them have identified and addressed clear consumer needs for which Google isn't good enough as a solution. Given Google's near-monopoly in web search, parity or even incremental advantage isn't enough.

Friday, August 29, 2008

Improving The Noisy Channel: A Call for Ideas

Over the past five months, this blog has grown from a suggestion Jeff Dalton put in my ear to a community to which I'm proud to belong.

Some milestones:
  • Over 70 posts to date.
  • 94 subscribers, as reported by Google Reader.
  • 100 unique visitors on.a typical day.
To be honest, I thought I'd struggle to keep up with posting weekly, and that I'd need to convince my mom to read this blog so that I wouldn't be speaking to an empty room. The results so far have wildly exceeded the expectations I came in with.

But now that I've seen the potential of this blog, I'd like to "take it to the next level," as the MBA types say.

My goals:
  • Increase the readership. My motive isn't (only) to inflate my own ego. I've seen that this blog succeeds most when it stimulates conversation, and a conversation needs participants.

  • Increase participation. Given the quantity and quality of comments on recent posts, it's clear that readers here contribute the most valuable content. I'd like to step that up a notch by having readers guest-blog and perhaps going as far as to turning The Noisy Channel into a group blog about information seeking that transcends my personal take on the subject. I've very open to suggestions here.

  • Add some style. Various folks have offered suggestions for improving the blog, such as changing platforms to WordPress, modifying the layout to better use screen real estate, adding more images, etc. I'm the first to admit that I am not a designer, and I'd really appreciate ideas from you all on how to make this site more attractive and usable.
In short, I'm asking you to help me help you make The Noisy Channel a better and noisier place. Please post your comments here or email me if you'd prefer to make suggestions privately.

Wednesday, August 27, 2008

Transparency in Information Retrieval

It's been hard to find time to write another post while keeping up with the comment stream on my previous post about set retrieval! I'm very happy to see this level of interest, and I hope to continue catalyzing such discussions.

Today, I'd like to discuss transparency in the context of information retrieval. Transparency is an increasingly popular term these days in the context of search--perhaps not surprising, since users are finally starting to question the idea of search as a black box.

The idea of transparency is simple: users should know why a search engine returns a particular response to their query. Note the emphasis on "why" rather than "how". Most users don't care what algorithms a search engine uses to compute a response. What they do care about is how the engine ultimately "understood" their query--in other words, what question the engine thinks it's answering.

Some of you might find this description too anthropomorphic. But a recent study reported that most users expect search engines to read their minds--never mind that the general case goes beyond AI-complete (should we create a new class of ESP-complete problems)? But what frustrates users most is when a search engine not only fails to read their minds, but gives no indication of where the communication broke down, let alone how to fix it. In short, a failure to provide transparency.

What does this have to do with set retrieval vs. ranked retrieval? Plenty!

Set retrieval predates the Internet by a few decades, and was the first approach used to implement search engines. These search engines allowed users to enter queries by stringing together search terms with Boolean operators (AND, OR, etc.). Today, Boolean retrieval seem arcane, and most people see set retrieval as suitable for querying databases, rather than for querying search engines.

The biggest problem with set retrieval is that users find it extremely difficult to compose effective Boolean queries. Nonetheless, there is no question that set retrieval offers transparency: what you ask is what you get. And, if you prefer a particular sort order for your results, you can specify it.

In contrast, ranked retrieval makes it much easier for users to compose queries: users simply enter a few top-of-mind keywords. And for many use cases (in particular, known-item search) , a state-of-the-art implementation of ranked retrieval yields results that are good enough.

But ranked retrieval approaches generally shed transparency. At best, they employ standard information retrieval models that, although published in all of their gory detail, are opaque to their users--who are unlikely to be SIGIR regulars. At worst, they employ secret, proprietary models, either to protect their competitive differentiation or to thwart spammers.

Either way, the only clues that most ranked retrieval engines provide to users are text snippets from the returned documents. Those snippets may validate the relevance of the results that are shown, but the user does not learn what distinguishes the top-ranked results from other documents that contain some or all of the query terms.

If the user is satisfied with one of the top results, then transparency is unlikely to even come up. Even if the selected result isn't optimal, users may do well to satisfice. But when the search engine fails to read the user's mind, transparency offer the best hope of recovery.

But, as I mentioned earlier, users aren't great at composing queries for set retrieval, which was how ranked retrieval became so popular in the first place despite its lack of transparency. How do we resolve this dilemma?

To be continued...

Sunday, August 24, 2008

Set Retrieval vs. Ranked Retrieval

After last week's post about a racially targeted web search engine, you'd think I'd avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I'll start by introducing the terms.
  • In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.

  • In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.
An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:
  • The number of documents reported to match my search should be meaningful--or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.

  • Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.
Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

Saturday, August 23, 2008

Back from the Cone of Silence

Regular readers may have noticed the lack of posts this week. My apologies to anyone who was waiting by the RSS feed. Yesterday was the submission deadline for HCIR '08, which means that today is a new day! So please stay tuned for your regularly scheduled programming.

Saturday, August 16, 2008

Thinking Outside the Black Box

I was reading Techmeme today, and I noticed an LA Times article about RushmoreDrive, described on its About Us page as "a first-of-its-kind search engine for the Black community." My first reaction, blogged by others already, was that this idea was dumb and racist. In fact, it took some work to find positive commentary about RushmoreDrive.

But I've learned from the way the blogosphere handled the Cuil launch not to trust anyone who evaluates a search engine without having tried it, myself included. My wife and I have been the only white people at Amy Ruth's and the service was as gracious as the chicken and waffles were delicious; I decided I'd try my luck on a search engine not targeted at my racial profile.

The search quality is solid, comparable to that of Google, Yahoo, and Microsoft. In fact, the site looks a lot like a re-skinning (no pun intended) of Ask.com, a corporate sibling of IAC-owned RushmoreDrive. Like Ask.com, RushmoreDrive emphasizes search refinement through narrowing and broadening refinements.

What I find ironic is that the whole controversy about racial bias in relevance ranking reveals the much bigger problem--that relevance ranking should not be a black box (ok, maybe this time I'll take responsibility for the pun). I've been beating this drum at The Noisy Channel ever since I criticized Amit Singhal for Google's lack of transparency. I think that sites like RushmoreDrive are inevitable if search engines refuse to cede more control of search results to users.

I don't know how much information race provides as prior to influence statistical ranking approaches, but I'm skeptical that the effects are useful or even noticeable beyond a few well-chosen examples. I'm more inclined to see RushmoreDrive as a marketing ploy by the folks at IAC--and perhaps a successful one. I doubt that Google is running scared, but I think this should be a wake-up call to folks who are convinced that personalized relevance ranking is the end goal of user experience for search engines.

Friday, August 15, 2008

New Information Retrieval Book Available Online

Props to Jeff Dalton for alerting me about the new book on information retrieval by Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. You can buy a hard copy, but you can also access it online for free at the book website.

Wednesday, August 13, 2008

David Huynh's Freebase Parallax

One of the perks of working in HCIR is that you get to meet some of the coolest people in academic and industrial research. I met David Huynh a few years ago, while he was a graduate student at MIT, working in the Haystack group and on the Simile project. You've probably seen some of his work: his Timeline project has been deployed all over the web.

Despite efforts by me and other to persuade David to stay in the Northeast, he went out west a few months ago to join Metaweb, a company with ambitions "to build a better infrastructure for the Web." While I (and others) am not persuaded by Freebase, Metaweb's "open database of the world’s information," I am happy to see that David is still doing great work.

I encourage you to check out David's latest project: Freebase Parallax. In it, he does something I've never seen outside Endeca (excepting David's earlier work on a Nested Faceted Browser) he allows you to navigate using the facets of multiple entity types, joining between sets of entities through their relationships. At Endeca, we call this "record relationship navigation"--we presented it at HCIR '07, showing an how it can enable social navigation.

David includes a video where he eloquently demonstrates how Parallax works, and the interface is quite compelling. I'm not sure how well it scales with large data sets, but David's focus has been on interfaces rather than systems. My biggest complaint--which isn't David's fault--is that the Freebase content is a bit sparse. But his interface strikes me as a great fit for exploratory search.

Conversation with Seth Grimes

I had an great conversation with Intelligent Enterprise columnist Seth Grimes today. Apparently there's an upside to writing critical commentary on Google's aspirations in the enterprise!

One of the challenges in talking about enterprise search is that no one seems to agree on what it is. Indeed, as I've been discussing with Ryan Shaw , I use the term broadly to describe information access scenarios distinct from web search where an organization has some ownership or control of the content (in contrast to the somewhat adversarial relationship that web search companies have with the content they index). But I realize that many folks define enterprise search more narrowly to be a search box hooked up to the intranet.

Perhaps a better way to think about enterprise search is as a problem rather than solution. Many people expect a search box because they're familiar with searching the web using Google. I don't blame anyone for expecting that the same interface will work for enterprise information collections. Unfortunately, wishful thinking and clever advertising notwithstanding, it doesn't.

I've blogged about this subject from several different perspectives over the past weeks, so I'll refer recent readers to earlier posts on the subject rather than bore the regulars.

But I did want to mention a comment Seth made that I found particularly insightful. He defined enterprise search even more broadly than I do, suggesting that it encompassed any information seeking performed in the pursuit of enterprise-centric needs. In that context, he does see Google as the leader in enterprise search--not because of their enterprise offerings, but rather because of the web search they offer for free.

I'm not sure how I feel about his definition, but I think he raises a point that enterprise vendors often neglect. No matter how much information an enterprise controls, there will always be valuable information outside the enterprise. I find today's APIs to that information woefully inadequate; for example, I can't even choose a sort order through any of the web search APIs. But I am optimistic that those APIs will evolve, and that we will see "federated" information seeking that goes beyond merging ranked lists from different sources.

Indeed, I look forward to the day that web search providers take a cue from the enterprise and drop the focus on black box relevance ranking in favor of an approach that offers users control and interaction.

Monday, August 11, 2008

Position papers for NSF IS3 Workshop

I just wanted to let folks know that the position papers for the NSF Information Seeking Support Systems Workshop are now available at this link.

Here is a listing to whet your curiosity:
  • Supporting Interaction and Familiarity
    James Allan, University of Massachusetts Amherst, USA

  • From Web Search to Exploratory Search: Can we get there from here?
    Peter Anick, Yahoo! Inc., USA

  • Complex and Exploratory Web Search (with Daniel Russell)
    Anne Aula, Google, USA

  • Really Supporting Information Seeking: A Position Paper
    Nicholas J. Belkin, Rutgers University, USA

  • Transparent and User-Controllable Personalization For Information Exploration
    Peter Brusilovsky, University of Pittsburgh, USA

  • Faceted Exploratory Search Using the Relation Browser
    Robert Capra, UNC, USA

  • Towards a Model of Understanding Social Search
    Ed Chi, Palo Alto Research Center, USA

  • Building Blocks For Rapid Development of Information Seeking Support Systems
    Gary Geisler, University of Texas at Austin, USA

  • Collaborative Information Seeking in Electronic Environments
    Gene Golovchinsky, FX Palo Alto Laboratory, USA

  • NeoNote: User Centered Design Suggestions for a Global Shared Scholarly Annotation System
    Brad Hemminger, UNC, USA

  • Speaking the Same Language About Exploratory Information Seeking
    Bill Kules, The Catholic University of America, USA

  • Musings on Information Seeking Support Systems
    Michael Levi, U.S. Bureau of Labor Statistics, USA

  • Social Bookmarking and Information Seeking
    David Millen, IBM Research, USA

  • Making Sense of Search Result Pages
    Jan Pedersen, Yahoo, USA

  • A Multilevel Science of Social Information Foraging and Sensemaking
    Peter Pirolli, XEROX PARC USA

  • Characterizing, Supporting and Evaluating Exploratory Search
    Edie Rasmussen, University of British Columbia, Canada

  • The Information-Seeking Funnel
    Daniel Rose, A9.com, USA

  • Complex and Exploratory Web Search (with Anne Aula)
    Daniel Russell, Google, USA

  • Research Agenda: Visual Overviews for Exploratory Search
    Ben Shneiderman, University of Maryland, USA

  • Five Challenges for Research to Support IS3
    Elaine Toms, Dalhousie University, Canada

  • Resolving the Battle Royale between Information Retrieval and Information Science
    Daniel Tunkelang, Endeca, USA

Sunday, August 10, 2008

Why Enterprise Search Will Never Be Google-y

As I prepared to end my trilogy of Google-themed posts, I ran into two recently published items. They provide an excellent context for what I intended to talk about: the challenges and opportunities of enterprise search.

The first is Google's announcement of an upgrade to their search appliance that allows one box to index 10 million documents and offers improved search quality and personalization.

The second is an article by Chris Sherman in the Enterprise Search Sourcebook 2008 entitled Why Enterprise Search Will Never Be Google-y.

First, the Google announcement. These are certainly improvements for the GSA, and Google does seem to be aiming to compete with the Big Three: Autonomy, Endeca, FAST (now a subsidiary of Microsoft). But these improvements should be seen in the context of state of the art. In particular, Google's scalability claims, while impressive, still fall short of the market leaders in enterprise search. Moreover, the bottleneck in enterprise search hasn't been the scale of document indexing, but rather the effectiveness with which people can access and interact with the indexed content. Interestingly, Google's strongest selling point for the GSA, their claim it works "out of the box", is also its biggest weakness: even with the new set of features, the GSA does not offer the flexibility or rich functionality that enterprises have come to expect.

Second, the Chris Sherman piece. Here is an excerpt:
Enterprise search and web search are fundamentally different animals, and I'd argue that enterprise search won't--and shouldn't--be Google-y any time soon....Like web search, Google's enterprise search is easy to use--if you're willing to go along with how Google's algorithms view and present your business information....Ironically, enterprises, with all of their highly structures and carefully organized silos of information, require a very different and paradoxically more complex approach.
I highly recommend you read the whole article (it's only 2 pages), not only because it informative and well written, but also because the author isn't working for one of the Big Three.

The upshot? There is no question that Google is raising the bar for simple search in the enterprise. I wouldn't recommend that anyone try to compete with the GSA on its turf.

But information needs in the enterprise go far beyond known-item search, What enterprises want when they ask for "enterprise search" is not just a search box, but an interactive tool that helps them (or their customers) work through the process of articulating and fulfilling their information needs, for tasks as diverse as customer segmentation, knowledge management, and e-discovery.

If you're interested in search and want to be on the cutting edge of innovation, I suggest you think about the enterprise.

Thursday, August 7, 2008

Where Google Isn't Good Enough

My last post, Is Google Good Enough?, challenged would-be Google killers to identify and address clear consumer needs for which Google isn't good enough as a solution. I like helping my readers, so here are some ideas.
  • Shopping. Google Product Search (fka Froogle) is not one of Google's crown jewels. At best, it works well when you know the exact name of the product you are looking for. But it pales in contrast to any modern ecommerce site, such as Amazon or Home Depot. What makes a shopping site successful? Put simply, it helps users find what they want, even when they didn't know exactly what they wanted when they started.

  • Finding a job. Google has not thrown its hat into the ring of job search, and even the page they offer for finding jobs at Google could use some improvement. The two biggest job sites, Monster and Careerbuilder, succeed in terms of the number of jobs posted, but aren't exactly optimized for user experience. Dice does better, but only for technology jobs. Interestingly, the best job finding site may be LinkedIn--not because of their search implementation (which is adequate but not innovative), but because of their success in getting millions of professionals to provide high-quality data.

  • Finding employees. Again, LinkedIn has probably come closest to providing a good employee finding site. The large job sites (all of which I've used at some point) not only fail to support exploratory search, but also suffer from a skew towards ineligible candidates and a nuisance of recruiters posing as job seekers. Here again, Google has not tried to compete.

  • Planning a trip. Sure, you can use Expedia, Travelocity, or Kayak to find a flight, hotel, and car rental. But there's a lot of room for improvement when it comes to planning a trip, whether for business or pleasure. The existing tools do a poor job of putting together a coordinated itinerary (e.g., meals, activities), and also don't integrate with relevant information sources, such as local directories and reviews. This is another area where Google has not tried to play.
Note two general themes here. The first is thinking beyond the mechanics of search and focusing on the ability to meet user needs at the task level. The second is the need for exploratory search. These only scratch the surface of opportunities in consumer-facing "search" applications. The opportunities within the enterprise are even greater, but I'll save that for my next post.

Tuesday, August 5, 2008

Is Google Good Enough?

As Chief Scientist of Endeca, I spend a lot of my time explaining to people why they should not be satisfied with an information seekin interface that only offers them keyword search as an input mechanism and a ranked list of results as output. I tell them about query clarification dialogs, faceted navigation, and set analysis. More broadly, I evangelize exploratory search and human computer information retrieval as critical to addressing the inherent weakness of conventional ranked retrieval. If you haven't heard me expound on the subject, feel free to check out this slide show on Is Search Broken?.

But today I wanted to put my ideology aside and ask the the simple question: Is Google good enough? Here is a good faith attempt to make the case for the status quo. I'll focus on web search, since, as I've discussed before on this blog, enterprise search is different.

1) Google does well enough on result quality, enough of the time.

While Google doesn't publish statistics about user satisfaction, it's commonplace that Google usually succeeds in returning results that users find relevant. Granted, so do all of the major search engines: you can compare Google and Yahoo graphically at this site. But the question is not whether other search engines are also good enough--or even whether they are better. The point is that Google is good enough.

2) Google doesn't support exploratory search. But it often leads you to a tool that does.

The classic instance of this synergy is when Google leads you to a Wikipedia entry. For example, I look up Daniel Kahneman on Google. The top results is his Wikipedia entry. From there, I can traverse links to learn about his research areas, his colleagues, etc.

3) Google is a benign monopoly that mitigates choice overload.

Many people, myself includes, have concerns about Google's increasing role in mediating our access to information. But it's hard to ignore the upside of a single portal that gives you access to everything in one place: web pages, blogs, maps, email, etc, And it's all "free"--at least in so far as ad-supported services can be said to be free.

In summary, Google sets the bar pretty high. There are places where Google performs poorly (e.g., shopping) or doesn't even try to compete (e.g., travel). But when I see the series of companies lining up to challenge Google, I have to wonder how many of them have identified and addressed clear consumer needs for which Google isn't good enough as a solution. Given Google's near-monopoly in web search, parity or even incremental advantage isn't enough.

Friday, August 29, 2008

Improving The Noisy Channel: A Call for Ideas

Over the past five months, this blog has grown from a suggestion Jeff Dalton put in my ear to a community to which I'm proud to belong.

Some milestones:
  • Over 70 posts to date.
  • 94 subscribers, as reported by Google Reader.
  • 100 unique visitors on.a typical day.
To be honest, I thought I'd struggle to keep up with posting weekly, and that I'd need to convince my mom to read this blog so that I wouldn't be speaking to an empty room. The results so far have wildly exceeded the expectations I came in with.

But now that I've seen the potential of this blog, I'd like to "take it to the next level," as the MBA types say.

My goals:
  • Increase the readership. My motive isn't (only) to inflate my own ego. I've seen that this blog succeeds most when it stimulates conversation, and a conversation needs participants.

  • Increase participation. Given the quantity and quality of comments on recent posts, it's clear that readers here contribute the most valuable content. I'd like to step that up a notch by having readers guest-blog and perhaps going as far as to turning The Noisy Channel into a group blog about information seeking that transcends my personal take on the subject. I've very open to suggestions here.

  • Add some style. Various folks have offered suggestions for improving the blog, such as changing platforms to WordPress, modifying the layout to better use screen real estate, adding more images, etc. I'm the first to admit that I am not a designer, and I'd really appreciate ideas from you all on how to make this site more attractive and usable.
In short, I'm asking you to help me help you make The Noisy Channel a better and noisier place. Please post your comments here or email me if you'd prefer to make suggestions privately.

Wednesday, August 27, 2008

Transparency in Information Retrieval

It's been hard to find time to write another post while keeping up with the comment stream on my previous post about set retrieval! I'm very happy to see this level of interest, and I hope to continue catalyzing such discussions.

Today, I'd like to discuss transparency in the context of information retrieval. Transparency is an increasingly popular term these days in the context of search--perhaps not surprising, since users are finally starting to question the idea of search as a black box.

The idea of transparency is simple: users should know why a search engine returns a particular response to their query. Note the emphasis on "why" rather than "how". Most users don't care what algorithms a search engine uses to compute a response. What they do care about is how the engine ultimately "understood" their query--in other words, what question the engine thinks it's answering.

Some of you might find this description too anthropomorphic. But a recent study reported that most users expect search engines to read their minds--never mind that the general case goes beyond AI-complete (should we create a new class of ESP-complete problems)? But what frustrates users most is when a search engine not only fails to read their minds, but gives no indication of where the communication broke down, let alone how to fix it. In short, a failure to provide transparency.

What does this have to do with set retrieval vs. ranked retrieval? Plenty!

Set retrieval predates the Internet by a few decades, and was the first approach used to implement search engines. These search engines allowed users to enter queries by stringing together search terms with Boolean operators (AND, OR, etc.). Today, Boolean retrieval seem arcane, and most people see set retrieval as suitable for querying databases, rather than for querying search engines.

The biggest problem with set retrieval is that users find it extremely difficult to compose effective Boolean queries. Nonetheless, there is no question that set retrieval offers transparency: what you ask is what you get. And, if you prefer a particular sort order for your results, you can specify it.

In contrast, ranked retrieval makes it much easier for users to compose queries: users simply enter a few top-of-mind keywords. And for many use cases (in particular, known-item search) , a state-of-the-art implementation of ranked retrieval yields results that are good enough.

But ranked retrieval approaches generally shed transparency. At best, they employ standard information retrieval models that, although published in all of their gory detail, are opaque to their users--who are unlikely to be SIGIR regulars. At worst, they employ secret, proprietary models, either to protect their competitive differentiation or to thwart spammers.

Either way, the only clues that most ranked retrieval engines provide to users are text snippets from the returned documents. Those snippets may validate the relevance of the results that are shown, but the user does not learn what distinguishes the top-ranked results from other documents that contain some or all of the query terms.

If the user is satisfied with one of the top results, then transparency is unlikely to even come up. Even if the selected result isn't optimal, users may do well to satisfice. But when the search engine fails to read the user's mind, transparency offer the best hope of recovery.

But, as I mentioned earlier, users aren't great at composing queries for set retrieval, which was how ranked retrieval became so popular in the first place despite its lack of transparency. How do we resolve this dilemma?

To be continued...

Sunday, August 24, 2008

Set Retrieval vs. Ranked Retrieval

After last week's post about a racially targeted web search engine, you'd think I'd avoid controversy for a while. To the contrary, I now feel bold enough like to bring up what I have found to be my most controversial position within the information retrieval community: my preference for set retrieval over ranked retrieval.

This will be the first of several posts along this theme, so I'll start by introducing the terms.
  • In a ranked retrieval approach, the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query.

  • In a set retrieval approach, the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.
An information retrieval system can combine set retrieval and ranked retrieval by first determining a set of matching documents and then ranking the matching documents. Most industrial search engines, such as Google, take this approach, at least in principle. But, because the set of matching documents is typically much larger than the set of documents displayed to a user, these approaches are, in practice, ranked retrieval.

What is set retrieval in practice? In my view, a set retrieval approach satisfies two expectations:
  • The number of documents reported to match my search should be meaningful--or at least should be a meaningful estimate. More generally, any summary information reported about this set should be useful.

  • Displaying a random subset of the set of matching documents to the user should be a plausible behavior, even if it is not as good as displaying the top-ranked matches. In other words, relevance ranking should help distinguish more relevant results from less relevant results, rather than distinguishing relevant results from irrelevant results.
Despite its popularity, the ranked retrieval model suffers because it does not provide a clear split between relevant and irrelevant documents. This weakness makes it impossible to obtain even basic analysis of the query results, such as the number of relevant documents, let alone a more complicated one, such as the result quality. In contrast, a set retrieval model partitions the corpus into two subsets of documents: those that are considered relevant, and those that are not. A set retrieval model does not rank the retrieved documents; instead, it establishes a clear split between documents that are in and out of the retrieved set. As a result, set retrieval models enable rich analysis of query results, which can then be applied to improve user experience.

Saturday, August 23, 2008

Back from the Cone of Silence

Regular readers may have noticed the lack of posts this week. My apologies to anyone who was waiting by the RSS feed. Yesterday was the submission deadline for HCIR '08, which means that today is a new day! So please stay tuned for your regularly scheduled programming.

Saturday, August 16, 2008

Thinking Outside the Black Box

I was reading Techmeme today, and I noticed an LA Times article about RushmoreDrive, described on its About Us page as "a first-of-its-kind search engine for the Black community." My first reaction, blogged by others already, was that this idea was dumb and racist. In fact, it took some work to find positive commentary about RushmoreDrive.

But I've learned from the way the blogosphere handled the Cuil launch not to trust anyone who evaluates a search engine without having tried it, myself included. My wife and I have been the only white people at Amy Ruth's and the service was as gracious as the chicken and waffles were delicious; I decided I'd try my luck on a search engine not targeted at my racial profile.

The search quality is solid, comparable to that of Google, Yahoo, and Microsoft. In fact, the site looks a lot like a re-skinning (no pun intended) of Ask.com, a corporate sibling of IAC-owned RushmoreDrive. Like Ask.com, RushmoreDrive emphasizes search refinement through narrowing and broadening refinements.

What I find ironic is that the whole controversy about racial bias in relevance ranking reveals the much bigger problem--that relevance ranking should not be a black box (ok, maybe this time I'll take responsibility for the pun). I've been beating this drum at The Noisy Channel ever since I criticized Amit Singhal for Google's lack of transparency. I think that sites like RushmoreDrive are inevitable if search engines refuse to cede more control of search results to users.

I don't know how much information race provides as prior to influence statistical ranking approaches, but I'm skeptical that the effects are useful or even noticeable beyond a few well-chosen examples. I'm more inclined to see RushmoreDrive as a marketing ploy by the folks at IAC--and perhaps a successful one. I doubt that Google is running scared, but I think this should be a wake-up call to folks who are convinced that personalized relevance ranking is the end goal of user experience for search engines.

Friday, August 15, 2008

New Information Retrieval Book Available Online

Props to Jeff Dalton for alerting me about the new book on information retrieval by Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. You can buy a hard copy, but you can also access it online for free at the book website.

Wednesday, August 13, 2008

David Huynh's Freebase Parallax

One of the perks of working in HCIR is that you get to meet some of the coolest people in academic and industrial research. I met David Huynh a few years ago, while he was a graduate student at MIT, working in the Haystack group and on the Simile project. You've probably seen some of his work: his Timeline project has been deployed all over the web.

Despite efforts by me and other to persuade David to stay in the Northeast, he went out west a few months ago to join Metaweb, a company with ambitions "to build a better infrastructure for the Web." While I (and others) am not persuaded by Freebase, Metaweb's "open database of the world’s information," I am happy to see that David is still doing great work.

I encourage you to check out David's latest project: Freebase Parallax. In it, he does something I've never seen outside Endeca (excepting David's earlier work on a Nested Faceted Browser) he allows you to navigate using the facets of multiple entity types, joining between sets of entities through their relationships. At Endeca, we call this "record relationship navigation"--we presented it at HCIR '07, showing an how it can enable social navigation.

David includes a video where he eloquently demonstrates how Parallax works, and the interface is quite compelling. I'm not sure how well it scales with large data sets, but David's focus has been on interfaces rather than systems. My biggest complaint--which isn't David's fault--is that the Freebase content is a bit sparse. But his interface strikes me as a great fit for exploratory search.

Conversation with Seth Grimes

I had an great conversation with Intelligent Enterprise columnist Seth Grimes today. Apparently there's an upside to writing critical commentary on Google's aspirations in the enterprise!

One of the challenges in talking about enterprise search is that no one seems to agree on what it is. Indeed, as I've been discussing with Ryan Shaw , I use the term broadly to describe information access scenarios distinct from web search where an organization has some ownership or control of the content (in contrast to the somewhat adversarial relationship that web search companies have with the content they index). But I realize that many folks define enterprise search more narrowly to be a search box hooked up to the intranet.

Perhaps a better way to think about enterprise search is as a problem rather than solution. Many people expect a search box because they're familiar with searching the web using Google. I don't blame anyone for expecting that the same interface will work for enterprise information collections. Unfortunately, wishful thinking and clever advertising notwithstanding, it doesn't.

I've blogged about this subject from several different perspectives over the past weeks, so I'll refer recent readers to earlier posts on the subject rather than bore the regulars.

But I did want to mention a comment Seth made that I found particularly insightful. He defined enterprise search even more broadly than I do, suggesting that it encompassed any information seeking performed in the pursuit of enterprise-centric needs. In that context, he does see Google as the leader in enterprise search--not because of their enterprise offerings, but rather because of the web search they offer for free.

I'm not sure how I feel about his definition, but I think he raises a point that enterprise vendors often neglect. No matter how much information an enterprise controls, there will always be valuable information outside the enterprise. I find today's APIs to that information woefully inadequate; for example, I can't even choose a sort order through any of the web search APIs. But I am optimistic that those APIs will evolve, and that we will see "federated" information seeking that goes beyond merging ranked lists from different sources.

Indeed, I look forward to the day that web search providers take a cue from the enterprise and drop the focus on black box relevance ranking in favor of an approach that offers users control and interaction.

Monday, August 11, 2008

Position papers for NSF IS3 Workshop

I just wanted to let folks know that the position papers for the NSF Information Seeking Support Systems Workshop are now available at this link.

Here is a listing to whet your curiosity:
  • Supporting Interaction and Familiarity
    James Allan, University of Massachusetts Amherst, USA

  • From Web Search to Exploratory Search: Can we get there from here?
    Peter Anick, Yahoo! Inc., USA

  • Complex and Exploratory Web Search (with Daniel Russell)
    Anne Aula, Google, USA

  • Really Supporting Information Seeking: A Position Paper
    Nicholas J. Belkin, Rutgers University, USA

  • Transparent and User-Controllable Personalization For Information Exploration
    Peter Brusilovsky, University of Pittsburgh, USA

  • Faceted Exploratory Search Using the Relation Browser
    Robert Capra, UNC, USA

  • Towards a Model of Understanding Social Search
    Ed Chi, Palo Alto Research Center, USA

  • Building Blocks For Rapid Development of Information Seeking Support Systems
    Gary Geisler, University of Texas at Austin, USA

  • Collaborative Information Seeking in Electronic Environments
    Gene Golovchinsky, FX Palo Alto Laboratory, USA

  • NeoNote: User Centered Design Suggestions for a Global Shared Scholarly Annotation System
    Brad Hemminger, UNC, USA

  • Speaking the Same Language About Exploratory Information Seeking
    Bill Kules, The Catholic University of America, USA

  • Musings on Information Seeking Support Systems
    Michael Levi, U.S. Bureau of Labor Statistics, USA

  • Social Bookmarking and Information Seeking
    David Millen, IBM Research, USA

  • Making Sense of Search Result Pages
    Jan Pedersen, Yahoo, USA

  • A Multilevel Science of Social Information Foraging and Sensemaking
    Peter Pirolli, XEROX PARC USA

  • Characterizing, Supporting and Evaluating Exploratory Search
    Edie Rasmussen, University of British Columbia, Canada

  • The Information-Seeking Funnel
    Daniel Rose, A9.com, USA

  • Complex and Exploratory Web Search (with Anne Aula)
    Daniel Russell, Google, USA

  • Research Agenda: Visual Overviews for Exploratory Search
    Ben Shneiderman, University of Maryland, USA

  • Five Challenges for Research to Support IS3
    Elaine Toms, Dalhousie University, Canada

  • Resolving the Battle Royale between Information Retrieval and Information Science
    Daniel Tunkelang, Endeca, USA

Sunday, August 10, 2008

Why Enterprise Search Will Never Be Google-y

As I prepared to end my trilogy of Google-themed posts, I ran into two recently published items. They provide an excellent context for what I intended to talk about: the challenges and opportunities of enterprise search.

The first is Google's announcement of an upgrade to their search appliance that allows one box to index 10 million documents and offers improved search quality and personalization.

The second is an article by Chris Sherman in the Enterprise Search Sourcebook 2008 entitled Why Enterprise Search Will Never Be Google-y.

First, the Google announcement. These are certainly improvements for the GSA, and Google does seem to be aiming to compete with the Big Three: Autonomy, Endeca, FAST (now a subsidiary of Microsoft). But these improvements should be seen in the context of state of the art. In particular, Google's scalability claims, while impressive, still fall short of the market leaders in enterprise search. Moreover, the bottleneck in enterprise search hasn't been the scale of document indexing, but rather the effectiveness with which people can access and interact with the indexed content. Interestingly, Google's strongest selling point for the GSA, their claim it works "out of the box", is also its biggest weakness: even with the new set of features, the GSA does not offer the flexibility or rich functionality that enterprises have come to expect.

Second, the Chris Sherman piece. Here is an excerpt:
Enterprise search and web search are fundamentally different animals, and I'd argue that enterprise search won't--and shouldn't--be Google-y any time soon....Like web search, Google's enterprise search is easy to use--if you're willing to go along with how Google's algorithms view and present your business information....Ironically, enterprises, with all of their highly structures and carefully organized silos of information, require a very different and paradoxically more complex approach.
I highly recommend you read the whole article (it's only 2 pages), not only because it informative and well written, but also because the author isn't working for one of the Big Three.

The upshot? There is no question that Google is raising the bar for simple search in the enterprise. I wouldn't recommend that anyone try to compete with the GSA on its turf.

But information needs in the enterprise go far beyond known-item search, What enterprises want when they ask for "enterprise search" is not just a search box, but an interactive tool that helps them (or their customers) work through the process of articulating and fulfilling their information needs, for tasks as diverse as customer segmentation, knowledge management, and e-discovery.

If you're interested in search and want to be on the cutting edge of innovation, I suggest you think about the enterprise.

Thursday, August 7, 2008

Where Google Isn't Good Enough

My last post, Is Google Good Enough?, challenged would-be Google killers to identify and address clear consumer needs for which Google isn't good enough as a solution. I like helping my readers, so here are some ideas.
  • Shopping. Google Product Search (fka Froogle) is not one of Google's crown jewels. At best, it works well when you know the exact name of the product you are looking for. But it pales in contrast to any modern ecommerce site, such as Amazon or Home Depot. What makes a shopping site successful? Put simply, it helps users find what they want, even when they didn't know exactly what they wanted when they started.

  • Finding a job. Google has not thrown its hat into the ring of job search, and even the page they offer for finding jobs at Google could use some improvement. The two biggest job sites, Monster and Careerbuilder, succeed in terms of the number of jobs posted, but aren't exactly optimized for user experience. Dice does better, but only for technology jobs. Interestingly, the best job finding site may be LinkedIn--not because of their search implementation (which is adequate but not innovative), but because of their success in getting millions of professionals to provide high-quality data.

  • Finding employees. Again, LinkedIn has probably come closest to providing a good employee finding site. The large job sites (all of which I've used at some point) not only fail to support exploratory search, but also suffer from a skew towards ineligible candidates and a nuisance of recruiters posing as job seekers. Here again, Google has not tried to compete.

  • Planning a trip. Sure, you can use Expedia, Travelocity, or Kayak to find a flight, hotel, and car rental. But there's a lot of room for improvement when it comes to planning a trip, whether for business or pleasure. The existing tools do a poor job of putting together a coordinated itinerary (e.g., meals, activities), and also don't integrate with relevant information sources, such as local directories and reviews. This is another area where Google has not tried to play.
Note two general themes here. The first is thinking beyond the mechanics of search and focusing on the ability to meet user needs at the task level. The second is the need for exploratory search. These only scratch the surface of opportunities in consumer-facing "search" applications. The opportunities within the enterprise are even greater, but I'll save that for my next post.

Tuesday, August 5, 2008

Is Google Good Enough?

As Chief Scientist of Endeca, I spend a lot of my time explaining to people why they should not be satisfied with an information seekin interface that only offers them keyword search as an input mechanism and a ranked list of results as output. I tell them about query clarification dialogs, faceted navigation, and set analysis. More broadly, I evangelize exploratory search and human computer information retrieval as critical to addressing the inherent weakness of conventional ranked retrieval. If you haven't heard me expound on the subject, feel free to check out this slide show on Is Search Broken?.

But today I wanted to put my ideology aside and ask the the simple question: Is Google good enough? Here is a good faith attempt to make the case for the status quo. I'll focus on web search, since, as I've discussed before on this blog, enterprise search is different.

1) Google does well enough on result quality, enough of the time.

While Google doesn't publish statistics about user satisfaction, it's commonplace that Google usually succeeds in returning results that users find relevant. Granted, so do all of the major search engines: you can compare Google and Yahoo graphically at this site. But the question is not whether other search engines are also good enough--or even whether they are better. The point is that Google is good enough.

2) Google doesn't support exploratory search. But it often leads you to a tool that does.

The classic instance of this synergy is when Google leads you to a Wikipedia entry. For example, I look up Daniel Kahneman on Google. The top results is his Wikipedia entry. From there, I can traverse links to learn about his research areas, his colleagues, etc.

3) Google is a benign monopoly that mitigates choice overload.

Many people, myself includes, have concerns about Google's increasing role in mediating our access to information. But it's hard to ignore the upside of a single portal that gives you access to everything in one place: web pages, blogs, maps, email, etc, And it's all "free"--at least in so far as ad-supported services can be said to be free.

In summary, Google sets the bar pretty high. There are places where Google performs poorly (e.g., shopping) or doesn't even try to compete (e.g., travel). But when I see the series of companies lining up to challenge Google, I have to wonder how many of them have identified and addressed clear consumer needs for which Google isn't good enough as a solution. Given Google's near-monopoly in web search, parity or even incremental advantage isn't enough.