Wednesday, June 4, 2008

Idea Navigation

Last summer, my colleague Vladimir Zelevinsky worked with two interns, Robin Stewart (MIT) and Greg Scott (Tufts), on a novel approach to information exploration. They call it "idea navigation": the basic idea is to extract subject-verb-object triples from unstructured text, group them into hierarchies, and then expose them in a faceted search and browsing interface. I like to think of it as an exploratory search take on question answering.

We found out later that Powerset developed similar functionality that they called "Powermouse" in their private beta and now call "Factz". While the idea navigation prototype is on a smaller scale (about 100k news articles from October 2000), it does some cool things that I haven't seen on Powerset, like leveraging verb hypernyms from WordNet.

Click on the frame below to see the presentation they delivered at CHI '08.



Idea Navigation: Structured Browsing for Unstructured Text

5 comments:

Anonymous said...

One of the nicer instances of verb-object extraction I've seen is IHOP (information hyperlinked over proteins), which operates over proteins and their interactions. Here's their relations page for TP53, a widely studied human tumor suppressor:

ihop-net.org/UniPub/iHOP/gs/92798.html

Another live app using the same kind of approach is TextRunner from U. Washington:

cs.washington.edu/research/textrunner/

The CoNLL bakeoffs focused on this kind of lightweight predicate/argument parsing for a few years. For instance, see:

lsi.upc.edu/~srlconll/st05/st05.html

As to extending to gerunds and other nominalizations, check out this corpus and related work:

nlp.cs.nyu.edu/meyers/NomBank.html

Daniel Tunkelang said...

Thanks for the links! The TextRunner application is very cool, even if it doesn't seem to do much with the verbs. But it seems more interesting that anything else I've seen on the open web, and of course it indexes a much broader and heterogeneous corpus than Wikipedia.

Anonymous said...

See also Dawn Lawrie's dissertation, from 2003. A statistical approach to idea navigation, or "concept subsumption heirarchies", as she calls 'em.

http://www.cs.loyola.edu/~lawrie/papers/lawrieThesis.pdf

Scroll through for some good screenshots.

Dawn does this with NN phrases in her heirarchies, but there is no reason why you couldn't extract Adj-Noun phrases, Noun-Verb-Noun phrases, etc. and then use the same underlying statistical language model approaches to building the subsumption heirarchies.

Daniel Tunkelang said...

Jeremy, thanks for the link. The approach looks promising, and I'm curious how it compares to WordNet-driven Castanet work at Berkeley. Granted, there's something nice about not depending on a limited lexicon.

As for the idea navigation work, I see it more as suggesting an interface rather than an approach to the information extraction problem of identifying the N-V-N triples. The really simple idea is to think of question answering as a problem best serves by an exploratory interface.

Anonymous said...

The Castanet work does cite the 1999 Sanderson and Croft work, upon which this 2003 Lawrie work is also based. So I'm sure there are some similarities.

One offhand difference, though, I think, is that the Castanet work appears to create mtutually exclusive, partitioned heirarchies, whereas the Lawrie work allows for multiple parents.

However, that is just my impression after a quick skim; I didn't read the Castanet work in full, and it has also been 5-6 years since I read the Lawrie work in detail.

Wednesday, June 4, 2008

Idea Navigation

Last summer, my colleague Vladimir Zelevinsky worked with two interns, Robin Stewart (MIT) and Greg Scott (Tufts), on a novel approach to information exploration. They call it "idea navigation": the basic idea is to extract subject-verb-object triples from unstructured text, group them into hierarchies, and then expose them in a faceted search and browsing interface. I like to think of it as an exploratory search take on question answering.

We found out later that Powerset developed similar functionality that they called "Powermouse" in their private beta and now call "Factz". While the idea navigation prototype is on a smaller scale (about 100k news articles from October 2000), it does some cool things that I haven't seen on Powerset, like leveraging verb hypernyms from WordNet.

Click on the frame below to see the presentation they delivered at CHI '08.



Idea Navigation: Structured Browsing for Unstructured Text

5 comments:

Anonymous said...

One of the nicer instances of verb-object extraction I've seen is IHOP (information hyperlinked over proteins), which operates over proteins and their interactions. Here's their relations page for TP53, a widely studied human tumor suppressor:

ihop-net.org/UniPub/iHOP/gs/92798.html

Another live app using the same kind of approach is TextRunner from U. Washington:

cs.washington.edu/research/textrunner/

The CoNLL bakeoffs focused on this kind of lightweight predicate/argument parsing for a few years. For instance, see:

lsi.upc.edu/~srlconll/st05/st05.html

As to extending to gerunds and other nominalizations, check out this corpus and related work:

nlp.cs.nyu.edu/meyers/NomBank.html

Daniel Tunkelang said...

Thanks for the links! The TextRunner application is very cool, even if it doesn't seem to do much with the verbs. But it seems more interesting that anything else I've seen on the open web, and of course it indexes a much broader and heterogeneous corpus than Wikipedia.

Anonymous said...

See also Dawn Lawrie's dissertation, from 2003. A statistical approach to idea navigation, or "concept subsumption heirarchies", as she calls 'em.

http://www.cs.loyola.edu/~lawrie/papers/lawrieThesis.pdf

Scroll through for some good screenshots.

Dawn does this with NN phrases in her heirarchies, but there is no reason why you couldn't extract Adj-Noun phrases, Noun-Verb-Noun phrases, etc. and then use the same underlying statistical language model approaches to building the subsumption heirarchies.

Daniel Tunkelang said...

Jeremy, thanks for the link. The approach looks promising, and I'm curious how it compares to WordNet-driven Castanet work at Berkeley. Granted, there's something nice about not depending on a limited lexicon.

As for the idea navigation work, I see it more as suggesting an interface rather than an approach to the information extraction problem of identifying the N-V-N triples. The really simple idea is to think of question answering as a problem best serves by an exploratory interface.

Anonymous said...

The Castanet work does cite the 1999 Sanderson and Croft work, upon which this 2003 Lawrie work is also based. So I'm sure there are some similarities.

One offhand difference, though, I think, is that the Castanet work appears to create mtutually exclusive, partitioned heirarchies, whereas the Lawrie work allows for multiple parents.

However, that is just my impression after a quick skim; I didn't read the Castanet work in full, and it has also been 5-6 years since I read the Lawrie work in detail.

Wednesday, June 4, 2008

Idea Navigation

Last summer, my colleague Vladimir Zelevinsky worked with two interns, Robin Stewart (MIT) and Greg Scott (Tufts), on a novel approach to information exploration. They call it "idea navigation": the basic idea is to extract subject-verb-object triples from unstructured text, group them into hierarchies, and then expose them in a faceted search and browsing interface. I like to think of it as an exploratory search take on question answering.

We found out later that Powerset developed similar functionality that they called "Powermouse" in their private beta and now call "Factz". While the idea navigation prototype is on a smaller scale (about 100k news articles from October 2000), it does some cool things that I haven't seen on Powerset, like leveraging verb hypernyms from WordNet.

Click on the frame below to see the presentation they delivered at CHI '08.



Idea Navigation: Structured Browsing for Unstructured Text

5 comments:

Anonymous said...

One of the nicer instances of verb-object extraction I've seen is IHOP (information hyperlinked over proteins), which operates over proteins and their interactions. Here's their relations page for TP53, a widely studied human tumor suppressor:

ihop-net.org/UniPub/iHOP/gs/92798.html

Another live app using the same kind of approach is TextRunner from U. Washington:

cs.washington.edu/research/textrunner/

The CoNLL bakeoffs focused on this kind of lightweight predicate/argument parsing for a few years. For instance, see:

lsi.upc.edu/~srlconll/st05/st05.html

As to extending to gerunds and other nominalizations, check out this corpus and related work:

nlp.cs.nyu.edu/meyers/NomBank.html

Daniel Tunkelang said...

Thanks for the links! The TextRunner application is very cool, even if it doesn't seem to do much with the verbs. But it seems more interesting that anything else I've seen on the open web, and of course it indexes a much broader and heterogeneous corpus than Wikipedia.

Anonymous said...

See also Dawn Lawrie's dissertation, from 2003. A statistical approach to idea navigation, or "concept subsumption heirarchies", as she calls 'em.

http://www.cs.loyola.edu/~lawrie/papers/lawrieThesis.pdf

Scroll through for some good screenshots.

Dawn does this with NN phrases in her heirarchies, but there is no reason why you couldn't extract Adj-Noun phrases, Noun-Verb-Noun phrases, etc. and then use the same underlying statistical language model approaches to building the subsumption heirarchies.

Daniel Tunkelang said...

Jeremy, thanks for the link. The approach looks promising, and I'm curious how it compares to WordNet-driven Castanet work at Berkeley. Granted, there's something nice about not depending on a limited lexicon.

As for the idea navigation work, I see it more as suggesting an interface rather than an approach to the information extraction problem of identifying the N-V-N triples. The really simple idea is to think of question answering as a problem best serves by an exploratory interface.

Anonymous said...

The Castanet work does cite the 1999 Sanderson and Croft work, upon which this 2003 Lawrie work is also based. So I'm sure there are some similarities.

One offhand difference, though, I think, is that the Castanet work appears to create mtutually exclusive, partitioned heirarchies, whereas the Lawrie work allows for multiple parents.

However, that is just my impression after a quick skim; I didn't read the Castanet work in full, and it has also been 5-6 years since I read the Lawrie work in detail.