Tuesday, April 8, 2008

Q&A with Amit Singhal

Amit Singhal, who is head of search quality at Google, gave a very entertaining keynote at ECIR '08 that focused on the adversarial aspects of Web IR. Specifically, he discussed some of the techniques used in the arms race to game Google's ranking algorithms. Perhaps he revealed more than he intended!

During the question and answer session, I reminded Amit of the admonition against security through obscurity that is well accepted in the security and cryptography communities. I questioned whether his team is pursuing the wrong strategy by failing to respect this maxim. Amit replied that a relevance analog to security by design was an interesting challenge (which he delegated to the audience), but he appealed to the subjectivity of relevance as a reason for it being harder to make relevance as transparent as security.

While I accept the difficulty of this challenge, I reject the suggestion that subjectivity makes it harder. To being with, Google and other web search engines rank results objectively, rather than based on user-specific considerations. Furthermore, the subjectivity of relevance should make the adversarial problem easier rather than harder, as has been observed in the security industry.

But the challenge is indeed a daunting one. Is there a way we can give control to users and thus make the search engines objective referees rather than paternalistic gatekeepers?

At Endeca, we emphasize the transparency of our engine as a core value of our offering to enterprises. Granted, our clients generally do not have an adversarial relationship with their data. Still, I am convinced that the same approach not only can work on the web, but will be the only way to end the arms race between spammers and Amit's army of tweakers.

6 comments:

Mark Watkins said...

From the "loyal opposition" (the author of
How Pagerank wrecked the web
founded a new search company) - essentially arguing that Google's ranking algorithms (effective, but, ultimately, arbitrary in some sense) are in fact the origin of the arms race. There must be some "less gamable" ranking system out there....

Daniel Tunkelang said...

Well, the initial success of PageRank, at least as far as I can tell, came from it being harder to game than the IR measures that other search engines were using at the time. Since then, of course, it's been an arms race.

I'd really love to see the relevance arms race replace with a principled approach based on attention economics.

Mark said...

Well back in the day Page Rank succeeded because it worked better than existing approaches, as well as being harder to game.

But yes the "attention economics" approach would be interesting, if there were some way to measure that, that did not incent people to mount DoS attacks to simulate attention 8).

Anonymous said...

Back in the day PageRank succeeded because it was a baseline approximation for user data. With richer user visitation data (through e.g. toolbars) PageRank becomes moot. See here.

Mark said...

I might be wrong, but if we built a ranking system based on user visitation data, won't that be gamed as well? - instead of "link farms", won't we get "traffic farms" that artificially inflate the user visitation values of some sites by directing extra (fake) traffic to them?

Aaswath Raman said...

This was also a topic of discussion while I was at MS -- we did in fact publish some papers on spam and adversarial IR based on collaborations with MSR, but we were well aware that blackhats/spammers out there were reading these papers. When spam proves to be an existential threat to the relevance of search engines, things like transparency are sometimes hard to justify.

However, Google (and Yahoo and Live) have done a good job of providing at least some transparency to siteowners with their webmaster tools (ie: are you being hit with spam filters, etc). These didn't exist until relatively recently, but I'd say everyone's now come around to seeing the positive benefits of engaging with legitimate siteowners and offering them information in return for their registering themselves, their sites and sitemaps in a formal way. This is a sea-change from a previously adversarial relationship to all siteowners to one that tries to engage with normal/"good" sites.

Some other thoughts:
1) Search engines are always looking for proxies for relevance (like links) and aspects of an attention economics-approach to this are in play already. Unless I'm misunderstanding however, this too is currently prone to be gamed as well -- for example, botnets can be frighteningly effective.

2) Can there ever be a truly (or even quasi-) objective definition of relevance across the web? Obviously engines have ways of measuring their effectiveness, but those definitions are ultimately subjective ones. Would an "open" standard of relevance achieve this?

I'm inclined to think web relevance will remain subjective by virtue of the nature of the dataset, and because the stakes are high, monetarily speaking for all involved. I'll also say given the amount of money involved, spammers will try very, very hard to game whatever system is put out there. There's a few billion too many involved, and thus, unsurprisingly, a large number of smart folks working on spamming. Perhaps I have a dimmer view of human nature after dealing with this for a while, but I'm skeptical we can ever end the arms race with spammers unless the monetary incentive decreases in some way :)

Another thought is that an ancillary beneficiary to spam succeeding is often the search engine's ad wing itself in terms of fees. This isn't to suggest any conspiracy, just an example of the weird dynamics often at play vis-a-vis spam.

Tuesday, April 8, 2008

Q&A with Amit Singhal

Amit Singhal, who is head of search quality at Google, gave a very entertaining keynote at ECIR '08 that focused on the adversarial aspects of Web IR. Specifically, he discussed some of the techniques used in the arms race to game Google's ranking algorithms. Perhaps he revealed more than he intended!

During the question and answer session, I reminded Amit of the admonition against security through obscurity that is well accepted in the security and cryptography communities. I questioned whether his team is pursuing the wrong strategy by failing to respect this maxim. Amit replied that a relevance analog to security by design was an interesting challenge (which he delegated to the audience), but he appealed to the subjectivity of relevance as a reason for it being harder to make relevance as transparent as security.

While I accept the difficulty of this challenge, I reject the suggestion that subjectivity makes it harder. To being with, Google and other web search engines rank results objectively, rather than based on user-specific considerations. Furthermore, the subjectivity of relevance should make the adversarial problem easier rather than harder, as has been observed in the security industry.

But the challenge is indeed a daunting one. Is there a way we can give control to users and thus make the search engines objective referees rather than paternalistic gatekeepers?

At Endeca, we emphasize the transparency of our engine as a core value of our offering to enterprises. Granted, our clients generally do not have an adversarial relationship with their data. Still, I am convinced that the same approach not only can work on the web, but will be the only way to end the arms race between spammers and Amit's army of tweakers.

6 comments:

Mark Watkins said...

From the "loyal opposition" (the author of
How Pagerank wrecked the web
founded a new search company) - essentially arguing that Google's ranking algorithms (effective, but, ultimately, arbitrary in some sense) are in fact the origin of the arms race. There must be some "less gamable" ranking system out there....

Daniel Tunkelang said...

Well, the initial success of PageRank, at least as far as I can tell, came from it being harder to game than the IR measures that other search engines were using at the time. Since then, of course, it's been an arms race.

I'd really love to see the relevance arms race replace with a principled approach based on attention economics.

Mark said...

Well back in the day Page Rank succeeded because it worked better than existing approaches, as well as being harder to game.

But yes the "attention economics" approach would be interesting, if there were some way to measure that, that did not incent people to mount DoS attacks to simulate attention 8).

Anonymous said...

Back in the day PageRank succeeded because it was a baseline approximation for user data. With richer user visitation data (through e.g. toolbars) PageRank becomes moot. See here.

Mark said...

I might be wrong, but if we built a ranking system based on user visitation data, won't that be gamed as well? - instead of "link farms", won't we get "traffic farms" that artificially inflate the user visitation values of some sites by directing extra (fake) traffic to them?

Aaswath Raman said...

This was also a topic of discussion while I was at MS -- we did in fact publish some papers on spam and adversarial IR based on collaborations with MSR, but we were well aware that blackhats/spammers out there were reading these papers. When spam proves to be an existential threat to the relevance of search engines, things like transparency are sometimes hard to justify.

However, Google (and Yahoo and Live) have done a good job of providing at least some transparency to siteowners with their webmaster tools (ie: are you being hit with spam filters, etc). These didn't exist until relatively recently, but I'd say everyone's now come around to seeing the positive benefits of engaging with legitimate siteowners and offering them information in return for their registering themselves, their sites and sitemaps in a formal way. This is a sea-change from a previously adversarial relationship to all siteowners to one that tries to engage with normal/"good" sites.

Some other thoughts:
1) Search engines are always looking for proxies for relevance (like links) and aspects of an attention economics-approach to this are in play already. Unless I'm misunderstanding however, this too is currently prone to be gamed as well -- for example, botnets can be frighteningly effective.

2) Can there ever be a truly (or even quasi-) objective definition of relevance across the web? Obviously engines have ways of measuring their effectiveness, but those definitions are ultimately subjective ones. Would an "open" standard of relevance achieve this?

I'm inclined to think web relevance will remain subjective by virtue of the nature of the dataset, and because the stakes are high, monetarily speaking for all involved. I'll also say given the amount of money involved, spammers will try very, very hard to game whatever system is put out there. There's a few billion too many involved, and thus, unsurprisingly, a large number of smart folks working on spamming. Perhaps I have a dimmer view of human nature after dealing with this for a while, but I'm skeptical we can ever end the arms race with spammers unless the monetary incentive decreases in some way :)

Another thought is that an ancillary beneficiary to spam succeeding is often the search engine's ad wing itself in terms of fees. This isn't to suggest any conspiracy, just an example of the weird dynamics often at play vis-a-vis spam.

Tuesday, April 8, 2008

Q&A with Amit Singhal

Amit Singhal, who is head of search quality at Google, gave a very entertaining keynote at ECIR '08 that focused on the adversarial aspects of Web IR. Specifically, he discussed some of the techniques used in the arms race to game Google's ranking algorithms. Perhaps he revealed more than he intended!

During the question and answer session, I reminded Amit of the admonition against security through obscurity that is well accepted in the security and cryptography communities. I questioned whether his team is pursuing the wrong strategy by failing to respect this maxim. Amit replied that a relevance analog to security by design was an interesting challenge (which he delegated to the audience), but he appealed to the subjectivity of relevance as a reason for it being harder to make relevance as transparent as security.

While I accept the difficulty of this challenge, I reject the suggestion that subjectivity makes it harder. To being with, Google and other web search engines rank results objectively, rather than based on user-specific considerations. Furthermore, the subjectivity of relevance should make the adversarial problem easier rather than harder, as has been observed in the security industry.

But the challenge is indeed a daunting one. Is there a way we can give control to users and thus make the search engines objective referees rather than paternalistic gatekeepers?

At Endeca, we emphasize the transparency of our engine as a core value of our offering to enterprises. Granted, our clients generally do not have an adversarial relationship with their data. Still, I am convinced that the same approach not only can work on the web, but will be the only way to end the arms race between spammers and Amit's army of tweakers.

6 comments:

Mark Watkins said...

From the "loyal opposition" (the author of
How Pagerank wrecked the web
founded a new search company) - essentially arguing that Google's ranking algorithms (effective, but, ultimately, arbitrary in some sense) are in fact the origin of the arms race. There must be some "less gamable" ranking system out there....

Daniel Tunkelang said...

Well, the initial success of PageRank, at least as far as I can tell, came from it being harder to game than the IR measures that other search engines were using at the time. Since then, of course, it's been an arms race.

I'd really love to see the relevance arms race replace with a principled approach based on attention economics.

Mark said...

Well back in the day Page Rank succeeded because it worked better than existing approaches, as well as being harder to game.

But yes the "attention economics" approach would be interesting, if there were some way to measure that, that did not incent people to mount DoS attacks to simulate attention 8).

Anonymous said...

Back in the day PageRank succeeded because it was a baseline approximation for user data. With richer user visitation data (through e.g. toolbars) PageRank becomes moot. See here.

Mark said...

I might be wrong, but if we built a ranking system based on user visitation data, won't that be gamed as well? - instead of "link farms", won't we get "traffic farms" that artificially inflate the user visitation values of some sites by directing extra (fake) traffic to them?

Aaswath Raman said...

This was also a topic of discussion while I was at MS -- we did in fact publish some papers on spam and adversarial IR based on collaborations with MSR, but we were well aware that blackhats/spammers out there were reading these papers. When spam proves to be an existential threat to the relevance of search engines, things like transparency are sometimes hard to justify.

However, Google (and Yahoo and Live) have done a good job of providing at least some transparency to siteowners with their webmaster tools (ie: are you being hit with spam filters, etc). These didn't exist until relatively recently, but I'd say everyone's now come around to seeing the positive benefits of engaging with legitimate siteowners and offering them information in return for their registering themselves, their sites and sitemaps in a formal way. This is a sea-change from a previously adversarial relationship to all siteowners to one that tries to engage with normal/"good" sites.

Some other thoughts:
1) Search engines are always looking for proxies for relevance (like links) and aspects of an attention economics-approach to this are in play already. Unless I'm misunderstanding however, this too is currently prone to be gamed as well -- for example, botnets can be frighteningly effective.

2) Can there ever be a truly (or even quasi-) objective definition of relevance across the web? Obviously engines have ways of measuring their effectiveness, but those definitions are ultimately subjective ones. Would an "open" standard of relevance achieve this?

I'm inclined to think web relevance will remain subjective by virtue of the nature of the dataset, and because the stakes are high, monetarily speaking for all involved. I'll also say given the amount of money involved, spammers will try very, very hard to game whatever system is put out there. There's a few billion too many involved, and thus, unsurprisingly, a large number of smart folks working on spamming. Perhaps I have a dimmer view of human nature after dealing with this for a while, but I'm skeptical we can ever end the arms race with spammers unless the monetary incentive decreases in some way :)

Another thought is that an ancillary beneficiary to spam succeeding is often the search engine's ad wing itself in terms of fees. This isn't to suggest any conspiracy, just an example of the weird dynamics often at play vis-a-vis spam.